Machine Learning with Python: Introduction Naive Bayes Classifier

Naive Bayes Classifier


Thomas Bayes: Conditional Probability

In machine learning, a Bayes classifier is a simple probabilistic classifier, which is based on applying Bayes' theorem. The feature model used by a naive Bayes classifier makes strong independence assumptions. This means that the existence of a particular feature of a class is independent or unrelated to the existence of every other feature.

Definition of independent events:

Two events E and F are independent, if both E and F have positive probability and if P(E|F) = P(E) and P(F|E) = P(F)

As we have stated in our definition, the Naive Bayes Classifier is based on the Bayes' theorem. The Bayes theorem is based on the conditional probability, which we will define now:

Conditional Probability

$P(A|B)$ stands for "the conditional probability of A given B", or "the probability of A under the condition B", i.e. the probability of some event A under the assumption that the event B took place. When in a random experiment the event B is known to have occurred, the possible outcomes of the experiment are reduced to B, and hence the probability of the occurrence of A is changed from the unconditional probability into the conditional probability given B. The Joint probability is the probability of two events in conjunction. That is, it is the probability of both events together. There are three notations for the joint probability of A and B. It can be written as

The conditional probability is defined by

$$P(A|B) = \frac{P( A ∩ B)}{P(B)}$$

Examples for Conditional Probability

German Swiss Speaker

There are about 8.4 million people living in Switzerland. About 64 % of them speak German. There are about 7500 million people on earth.

If some aliens randomly beam up an earthling, what are the chances that he is a German speaking Swiss?

We have the events

S: being Swiss

GS: German Speaking

The probability for a randomly chosen person to be Swiss:

$$P(S) = \frac{8.4}{7500} = 0.00112 $$

If we know that somebody is Swiss, the probability of speaking German is 0.64. This corresponds to the conditional probability

$$P(GS | S) = 0.64$$

So the probability of the earthling being Swiss and speaking German, can be calculated by the formula:

$$P(GS | S) = \frac{P(GS ∩ S)}{P(S)}$$

inserting the values from above gives us:

$$0.64 = \frac{P(GS ∩ S)}{0.00112}$$


$$P(GS ∩ S) = 0.0007168$$

So our aliens end up with a chance of 0.07168 % of getting a German speaking Swiss person.

False Positives and False Negatives

A medical research lab proposes a screening to test a large group of people for a disease. An argument against such screenings is the problem of false positive screening results.

Suppose 0,1% of the group suffer from the disease, and the rest is well:

$$P("sick") = 0,1 % = 0.01$$

and $$P("well") = 99,9 % = 0.999$$

The following is true for a screening test:

If you have the disease, the test will be positive 99% of the time, and if you don't have it, the test will be negative 99% of the time:

P("test positive" | "well") = 1 %


P("test negative" | "well") = 99 %.

Finally, suppose that when the test is applied to a person having the disease, there is a 1% chance of a false negative result (and 99% chance of getting a true positive result), i.e.

P("test negative" | "sick") = 1 %


P("test positive" | "sick") = 99 %

There are 999 False Positives and 1 False Negative.


In many cases even medical professionals assume that "if you have this sickness, the test will be positive in 99 % of the time and if you don't have it, the test will be negative 99 % of the time. Out of the 1098 cases that report positive results only 99 (9 %) cases are correct and 999 cases are false positives (91 %), i.e. if a person gets a positive test result, the probability that he or she actually has the disease is just about 9 %. P("sick" | "test positive") = 99 / 1098 = 9.02 %

Bayes' Theorem

We calculated the conditional probability $P(GS | S)$, which was the probability that a person speaks German, if he or she is known to be Swiss. To calculate this we used the following equation:

$$P(GS | S) = \frac{P(GS, S)}{P(S)}$$

What about calculating the probability $P(S | GS)$, i.e. the probability that somebody is Swiss under the assumption that the person speeks German?

The equation looks like this:

$$P(S| GS) = \frac{P(GS, S)}{P(GS)}$$

Let's isolate on both equations $P(GS, S)$:

$$P(GS, S) = P(GS | S) P(S)$$$$P(GS, S) = P(S | GS) P(GS)$$

As the left sides are equal, the right sides have to be equal as well:

$$P(GS | S) * P(S) = P(S | GS) P(GS)$$

This equation can be transformed into:

$$P(S | GS) = \frac{P(GS | S) P(S)}{P(GS)}$$

The result corresponts to Bayes' theorem

To solve our problem, - i.e. the probability that a person is Swiss, if we know that he or she speaks German - all we have to do is calculate the right side. We know already from our previous exercise that

$$P(GS | S) = 0.64$$


$$P(S) = 0.00112 $$

The number of German native speakers in the world corresponds to 101 millions, so we know that

$$P(GS) = \frac{101}{7500} = 0.0134667 $$

Finally, we can calculate $P(S | GS)$ by substituting the values in our equation:

$$P(S | GS) = \frac{P(GS | S) P(S)}{ P(GS)} = \frac{0.64 * 0.00112}{0.0134667} = 0.0532276$$

There are about 8.4 million people living in Switzerland. About 64 % of them speak German. There are about 7500 million people on earth.

If the some aliens randomly beam up an earthling, what are the chances that he is a German speaking Swiss?

We have the events

$S$: being Swiss $GS$: German Speaking

$$P(S) = \frac{8.4}{7500} = 0.00112 $$$$P(A|B) = \frac{P(B|A) P(A)}{P(B)}$$

$P(A|B)$ is the conditional probability of $A$, given $B$ (posterior probability), $P(B)$ is the prior probability of $B$ and $P(A)$ the prior probability of $A$. $P(B|A)$ is the conditional probability of $B$ given $A$, called the likely-hood.

An advantage of the naive Bayes classifier is that it requires only a small amount of training data to estimate the parameters necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix.

Naive Bayes Classifier

Introductory Exercise

Main Railway station in Hamburg

Let's set out on a journey by train to create our first very simple Naive Bayes Classifier. Let us assume we are in the city of Hamburg and we want to travel to Munich. We will have to change trains in Frankfurt am Main. We know from previous train journeys that our train from Hamburg might be delayed and the we will not catch our connecting train in Frankfurt. The probability that we will not be in time for our connecting train depends on how high our possible delay will be. The connecting train will not wait for more than five minutes. Sometimes the other train is delayed as well.

The following lists 'in_time' (the train from Hamburg arrived in time to catch the connecting train to Munich) and 'too_late' (connecting train is missed) are data showing the situation over some weeks. The first component of each tuple shows the minutes the train was late and the second component shows the number of time this occurred.

# the tuples consist of (delay time of train1, delay time of train2)
# tuples are (minutes, number of times)
in_time = [(0, 22), (1, 19), (2, 17), (3, 18),
           (4, 16), (5, 15), (6, 9), (7, 7),
           (8, 4), (9, 3), (10, 3), (11, 2)]
too_late = [(6, 6), (7, 9), (8, 12), (9, 17), 
            (10, 18), (11, 15), (12,16), (13, 7),
            (14, 8), (15, 5)]
%matplotlib inline
import matplotlib.pyplot as plt
X, Y = zip(*in_time)
X2, Y2 = zip(*too_late)
bar_width = 0.9, Y, bar_width,  color="blue", alpha=0.75, label="in time")
bar_width = 0.8, Y2, bar_width,  color="red", alpha=0.75, label="too late")
plt.legend(loc='upper right')

From this data we can deduce that the probability of catching the connecting train if we are one minute late is 1, because we had 19 successful cases experienced and no misses, i.e. there is no tuple with 1 as the first component in 'too_late'.

We will denote the event "train arrived in time to catch the connecting train" with $S$ (success) and the 'unlucky' event "train arrived too late to catch the connecting train" with $M$ (miss)

We can now define the probability "catching the train given that we are 1 minute late" formally:

$$P(S | 1) = 19 / 19 = 1$$

We used the fact that the tuple $(1, 19)$ is in 'in_time' and there is no tuple with the first component 1 in 'too_late'

It's getting critical for catching the connecting train to Munich, if we are 6 minutes late. Yet, the chances are still 60 %:

$$P(S | 6) = 9 / 9 + 6 = 0.6$$

Accordingly, the probability for missing the train knowing that we are 6 minutes late is:

$$P(S | 6) = 6 / 9 + 6 = 0.4$$

We can write a 'classifier' function, which will give the probability for catching the connecting train:

in_time_dict = dict(in_time)
too_late_dict = dict(too_late)
def catch_the_train(min):
    s = in_time_dict.get(min, 0)
    if s == 0:
        return 0
        m = too_late_dict.get(min, 0)
        return s / (s + m)
for minutes in range(-1, 13):
    print(minutes, catch_the_train(minutes))
-1 0
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
5 1.0
6 0.6
7 0.4375
8 0.25
9 0.15
10 0.14285714285714285
11 0.11764705882352941
12 0

A Naive Bayes Classifier Example

Getting the Data Ready

We will use a file called 'person_data.txt'. It contains 100 random person data, male and female, with body sizes, weights and gender tags.

import numpy as np
genders = ["male", "female"]
persons = []
with open("data/person_data.txt") as fh:
    for line in fh:
firstnames = {}
heights = {}
for gender in genders:
    firstnames[gender] = [ x[0] for x in persons if x[4]==gender]
    heights[gender] = [ x[2] for x in persons if x[4]==gender]
    heights[gender] = np.array(heights[gender],
for gender in ("female", "male"):
    print(gender + ":")
['Stephanie', 'Cynthia', 'Katherine', 'Elizabeth', 'Carol', 'Christina', 'Beverly', 'Sharon', 'Denise', 'Rebecca']
[149 174 183 138 145 161 179 162 148 196]
['Randy', 'Jessie', 'David', 'Stephen', 'Jerry', 'Billy', 'Earl', 'Todd', 'Martin', 'Kenneth']
[184 175 187 192 204 180 184 174 177 200]

Warning: There might be some confusion between a Python class and a Naive Bayes class. We try to avoid it by saying explicitly what is meant, whenever possible!

Designing a Feature class

We will now define a Python class "Feature" for the features, which we will use for classification later.

The Feature class needs a label, e.g. "heights" or "firstnames". If the feature values are numerical we may want to "bin" them to reduce the number of possible feature values. The heights from our persons have a huge range and we have only 50 measured values for our Naive Bayes classes "male" and "female". We will bin them into ranges "130 to 134", "135 to 139", "140 to 144" and so on by setting bin_width to 5. There is no way of binning the first names, so bin_width will be set to None.

The method frequency returns the number of occurrencies for a certain feature value or a binned range.

from collections import Counter
import numpy as np
class Feature:
    def __init__(self, data, name=None, bin_width=None): = name
        self.bin_width = bin_width
        if bin_width:
            self.min, self.max = min(data), max(data)
            bins = np.arange((self.min // bin_width) * bin_width, 
                                (self.max // bin_width) * bin_width,
            freq, bins = np.histogram(data, bins)
            self.freq_dict = dict(zip(bins, freq))
            self.freq_sum = sum(freq)
            self.freq_dict = dict(Counter(data))
            self.freq_sum = sum(self.freq_dict.values())
    def frequency(self, value):
        if self.bin_width:
            value = (value // self.bin_width) * self.bin_width
        if value in self.freq_dict:
            return self.freq_dict[value]
            return 0

We will create now two feature classes Feature for the height values of the person data set. One Feature class contains the height for the Naive Bayes class "male" and one the heights for the class "female":

fts = {}
for gender in genders:
    fts[gender] = Feature(heights[gender], name=gender, bin_width=5)
    print(gender, fts[gender].freq_dict)
male {160: 5, 195: 2, 180: 5, 165: 4, 200: 3, 185: 8, 170: 6, 155: 1, 190: 8, 175: 7}
female {160: 8, 130: 1, 165: 11, 135: 1, 170: 7, 140: 0, 175: 2, 145: 3, 180: 4, 150: 5, 185: 0, 155: 7}

Bar Chart of Frequency Distribution

We printed out the frequencies of our bins, but it is a lot better to see these values dipicted in a bar chart. We will do this with the following code:

for gender in genders:
    frequencies = list(fts[gender].freq_dict.items())
    frequencies.sort(key=lambda x: x[1])
    X, Y = zip(*frequencies)
    color = "blue" if gender=="male" else "red"
    bar_width = 4 if gender=="male" else 3, Y, bar_width, color=color, alpha=0.75, label=gender)
plt.legend(loc='upper right')

We have to design now a Naive Bayes class in Python. We will call it NBclass. An NBclass contains one or more Feature classes. The name of the NBclass will be stored in

class NBclass:
        def __init__(self, name, *features):
            self.features = features
   = name
        def probability_value_given_feature(self, 
            p_value_given_feature returns the probability p 
            for a feature_value 'value' of the feature  to occurr
            corresponds to P(d_i | p_j)
            where d_i is a feature variable of the feature i
            if feature.freq_sum == 0:
                return 0
                return feature.frequency(feature_value) / feature.freq_sum

In the following code, we will create NBclasses with one feature, i.e. the height feature. We will use the Feature classes of fts, which we have previously created:

cls = {}
for gender in genders:
    cls[gender] = NBclass(gender, fts[gender])

The final step for creating a simple Naive Bayes classifier consists in writing a class 'Classifier', which will use our classes 'NBclass' and 'Feature'.

class Classifier:
    def __init__(self, *nbclasses):
        self.nbclasses = nbclasses
    def prob(self, *d, best_only=True):
        nbclasses = self.nbclasses
        probability_list = []
        for nbclass in nbclasses:            
            ftrs = nbclass.features
            prob = 1
            for i in range(len(ftrs)):
                prob *= nbclass.probability_value_given_feature(d[i], ftrs[i])
            probability_list.append( (prob, )
        prob_values = [f[0] for f in probability_list]
        prob_sum = sum(prob_values)
        if prob_sum==0:
            number_classes = len(self.nbclasses)
            pl = []
            for prob_element in probability_list:
                pl.append( ((1 / number_classes), prob_element[1]))
            probability_list = pl
            probability_list = [ (p[0] / prob_sum, p[1])  for p in probability_list]
        if best_only:
            return max(probability_list)
            return probability_list

We will create a classifier with one feature class 'height'. We check it with values between 130 and 220 cm.

c = Classifier(cls["male"], cls["female"])
for i in range(130, 220, 5):
    print(i, c.prob(i, best_only=False))
130 [(0.0, 'male'), (1.0, 'female')]
135 [(0.0, 'male'), (1.0, 'female')]
140 [(0.5, 'male'), (0.5, 'female')]
145 [(0.0, 'male'), (1.0, 'female')]
150 [(0.0, 'male'), (1.0, 'female')]
155 [(0.125, 'male'), (0.875, 'female')]
160 [(0.38461538461538469, 'male'), (0.61538461538461542, 'female')]
165 [(0.26666666666666666, 'male'), (0.73333333333333328, 'female')]
170 [(0.46153846153846162, 'male'), (0.53846153846153855, 'female')]
175 [(0.77777777777777779, 'male'), (0.22222222222222224, 'female')]
180 [(0.55555555555555558, 'male'), (0.44444444444444448, 'female')]
185 [(1.0, 'male'), (0.0, 'female')]
190 [(1.0, 'male'), (0.0, 'female')]
195 [(1.0, 'male'), (0.0, 'female')]
200 [(1.0, 'male'), (0.0, 'female')]
205 [(0.5, 'male'), (0.5, 'female')]
210 [(0.5, 'male'), (0.5, 'female')]
215 [(0.5, 'male'), (0.5, 'female')]

There are no persons - neither male nor female - in our learn set, with a body height between 140 and 144. That is the reason, our classifier can't base its result on learned data and therefore comes back with a fify-fifty result.

We can also train a classifier with our firstnames:

fts = {}
cls = {}
for gender in genders:
    fts_names = Feature(firstnames[gender], name=gender)
    cls[gender] = NBclass(gender, fts_names)
c = Classifier(cls["male"], cls["female"])
testnames = ['Edgar', 'Benjamin', 'Fred', 'Albert', 'Laura', 
             'Maria', 'Paula', 'Sharon', 'Jessie']
for name in testnames:
    print(name, c.prob(name))
Edgar (0.5, 'male')
Benjamin (1.0, 'male')
Fred (1.0, 'male')
Albert (1.0, 'male')
Laura (1.0, 'female')
Maria (1.0, 'female')
Paula (1.0, 'female')
Sharon (1.0, 'female')
Jessie (0.6666666666666667, 'female')

The name "Jessie" is an ambiguous name. there are about 66 boys per 100 girls with this name. We can learn from the previous classification results that the probability for the name "Jessie" being "female" is about two-thirds, which is calculated from our data set "person":

[person for person in persons if person[0] == "Jessie"]
We received the following output:
[['Jessie', 'Morgan', '175', '67.0', 'male'],
 ['Jessie', 'Bell', '165', '65', 'female'],
 ['Jessie', 'Washington', '159', '56', 'female'],
 ['Jessie', 'Davis', '174', '45', 'female'],
 ['Jessie', 'Johnson', '165', '30.0', 'male'],
 ['Jessie', 'Thomas', '168', '69', 'female']]

Jessie Washington is only 159 cm tall. If we have a look at the results of our Classifier, trained with heights, we see that the likelihood for a person 159 cm tall of being "female" is 0.875. So what about an unknown person called "Jessie" and being 159 cm tall? Is this person female or male?

To answer this question, we will train an Naive Bayes classifier with two feature classes, i.e. heights and firstnames:

cls = {}
for gender in genders:
    fts_heights = Feature(heights[gender], name="heights", bin_width=5)
    fts_names = Feature(firstnames[gender], name="names")
    cls[gender] = NBclass(gender, fts_names, fts_heights)
c = Classifier(cls["male"], cls["female"])
for d in [("Maria", 140), ("Anthony", 200), ("Anthony", 153), 
          ("Jessie", 188) , ("Jessie", 159), ("Jessie", 160) ]:
    print(d, c.prob(*d, best_only=False))
('Maria', 140) [(0.5, 'male'), (0.5, 'female')]
('Anthony', 200) [(1.0, 'male'), (0.0, 'female')]
('Anthony', 153) [(0.5, 'male'), (0.5, 'female')]
('Jessie', 188) [(1.0, 'male'), (0.0, 'female')]
('Jessie', 159) [(0.066666666666666666, 'male'), (0.93333333333333335, 'female')]
('Jessie', 160) [(0.23809523809523817, 'male'), (0.76190476190476197, 'female')]

The Underlying Theory

Our classifier from the previous example is based on the Bayes theorem:

$$P(c_j | d) = \frac{P(d | c_j) P(c_j)}{P(d)}$$


We had used only one feature in our previous examples, i.e. the 'height' or the name.

It's possible to define a Bayes Classifier with multiple features, e.g. $d = (d_1, d_2, ..., d_n)$

We get the following formula:

$$P(c_j | d) = \frac{1}{P(d)} \displaystyle \prod_{i=1}^{n} P( d_i | c_j) P(c_j)$$

$\frac{1}{P(d)}$ is only depending on the values of $d_1, d_2, ... d_n$. This means that it is a constant as the values of the feature variables are known.

In [ ]: