Naive Bayes

Buse Köseoğlu
4 min readMar 3, 2021

--

Bayes Theorem

The probability value for an event A conditional to event B (i.e. event A is known, although event B is known) as an event studied in probability theory, the probability value for event B conditionally to event A (i.e. event A is known to event B) is different from the probability value. However, there is a very specific relationship between these two opposite conditionality, and this relationship is called Bayes’ Theorem, after the British statistician Thomas Bayes (1702–1761) who first explained it. (Wikipedia)

Naive Bayes Classifier

It is based on the bayes theorem. It can work successfully with unbalanced datasets or datasets with few inputs. While working, it calculates the possible values for an element and classifies it according to the highest value.

Naive Bayes cannot classify data that it has not seen before. In other words, if the value given in the test set is given in the training set and not taught to the model, the classification process will fail. This is called the “zero probability” problem. For example, the mail coming to a spam classifier is “lunch money money”. Since this model has never seen the word “lunch” in the training data set, it takes the probability of “lunch” to be spam as 0. No matter how likely the word “money” is to be spam, when using Naive Bayes this probability is multiplied by 0 and the model cannot understand that the mail is spam. To avoid this, developers usually add a minimum value (usually 1) to all data. Thus, no multiplication by 0 takes place.

Naive Bayes ignores any order in the data. The order of the words given in the spam example is irrelevant for the algorithm.

Naive Bayes Classifier Usage Areas

• In real time projects because it works fast,

• In projects where more than one class will be classified,

• In text classification,

• Filtering spam mails,

• Sentiment analysis.

There are 3 types of Naive Bayes. These:

1. Gaussian Naive Bayes: Assumes that the distribution of variables is normal.

2. Multinomial Naive Bayes: Used for discrete data. For example, word count for text classification.

3. Bernoulli Naive Bayes: Used if variable vectors are binary (0–1).

Application with Python

In this article, we will implement an application using the HR Analytics: Job Changes of Data Scientists dataset in kaggle. You can access the kaggle notebook I prepared here.

This data set consists of 14 variables and we are asked to create a model for the “target” dependent variable. If the “target” value is 0, the employee is not looking for a new job, and 1 means the employee is looking for a new job.

First of all, we start by loading the necessary libraries.

import pandas as pdimport numpy as npfrom sklearn.preprocessing import LabelEncoderfrom sklearn.naive_bayes import GaussianNBfrom sklearn.metrics import classification_reportfrom sklearn.model_selection import train_test_split, GridSearchCV

Then we examine the data set with the help of the head () function.

Next, we examine the variables in the data set. You can access this section on the notebook. When all these processes are finished, we perform a preliminary process to put the data into the model.

def preprocessing(df):    df.drop(["enrollee_id"], axis=1, inplace=True)
df.dropna(inplace=True)
le = LabelEncoder() cols_to_le = ["gender","city","relevent_experience","enrolled_university", "education_level","major_discipline","experience","company_size", "company_type","last_new_job"] for col in cols_to_le:
df[col] = le.fit_transform(df[col])
preprocessing(df_train)
df_train.head()

Preprocessing () takes a DataFrame as a parameter. First of all, it drops the “enrolle_id” variable, which is an unnecessary variable, from the data set. In the drop operation, the “axis = 1” parameter specifies that this process will be made column-based, and the “inplace = True” parameter indicates that it will be applied to the entire DataFrame and will be permanent. Then the missing data in the data set is dropped with dropna (). This data set consists of many categorical variables. In this case, we cannot give these categorical data to the model, first we need to convert them to numerical values. We use Label Encoder for this. Thus, all our variables become numerical values.

Before exporting the data set to the model, we gave all variables except the “target” variable to the X variable and the “target” variable to the y variable. Then we split the data into train and test data. X_train and y_train show the dependent and independent variables to be used to test the model, while X_test and y_test are used to develop the model. Test_size specifies how many% of the data (20%) will be used for testing. Random_state is used to see the same distinction every time we run the program. Stratify, on the other hand, ensures that the classes in the y variable are separated in a balanced way.

X = df_train.drop("target", axis=1)y = df_train["target"]X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42, test_size=0.2, stratify=y)

Now that all the processes are completed, we can give the data to the model.

gb = GaussianNB()gb.fit(X_train,y_train)print(classification_report(y_test, gb.predict(X_test)))

Here we first create a GaussianNB () object. Then we train our train dataset by giving it to the model. Finally, we compare the predicted and actual values with classification_report.

When we examine this table, we see that the prediction success of our model is 81%.

--

--