37. Expectation Maximization and Gaussian Mixture Models (GMM)

By Tobias Schlagenhauf. Last modified: 19 Feb 2025.

On this page ➤

Context and Key Concepts

The Gaussian Mixture Models (GMM) algorithm is an unsupervised learning algorithm since we do not know any values of a target feature. Further, the GMM is categorized into the clustering algorithms, since it can be used to find clusters in the data. Key concepts you should have heard about are:

Multivariate Gaussian Distribution
Covariance Matrix
Mean vector of multivariate data

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

See our Python training courses

See our Machine Learning with Python training courses

What are Gaussian Mixture models

GMM artwork which shows GMMs on smth. like mist

We want to use Gaussian Mixture models to find clusters in a dataset from which we know (or assume to know) the number of clusters enclosed in this dataset, but we do not know where these clusters are as well as how they are shaped. Finding these clusters is the task of GMM and since we don't have any information instead of the number of clusters, the GMM is an unsupervised approach. To accomplish that, we try to fit a mixture of gaussians to our dataset. That is, we try to find a number of gaussian distributions which can be used to describe the shape of our dataset. A critical point for the understanding is that these gaussian shaped clusters must not be circular shaped as for instance in the KNN approach but can have all shapes a multivariate Gaussian distribution can take. That is, a circle can only change in its diameter whilst a GMM model can (because of its covariance matrix) model all ellipsoid shapes as well. See the following illustration for an example in the two dimensional space.

Illustrates the difference between a Gaussian Mixture model approach and a KNN approach where the KNN approach can only model circles whilst the GMM is able to model all types of ellipsoids

What I have omitted in this illustration is that the position in space of KNN and GMM models is defined by their mean vector. Hence the mean vector gives the space whilst the diameter respectively the covariance matrix defines the shape of KNN and GMM models.

So if we consider an arbitrary dataset like the following:

Illustrates a GMM dataset

How precise can we fit a KNN model to this kind of dataset, if we assume that there are two clusters in the dataset? Well, not so precise since we have overlapping areas where the KNN model is not accurate. This is due to the fact that the KNN clusters are circular shaped whilst the data is of ellipsoid shape. It may even happen that the KNN totally fails as illustrated in the following figure.

Illustrates a dataset to which a KNN model is fitted but which is of ellipsoid shape and hence the KNN model failed to model the data

If we would fit ellipsoids to the data, as we do with the GMM approach, we would be able to model the dataset well, as illustrated in the following figure.

Illustrates a dataset to which a GMM model is fitted and which model the data well

Another weak point of KNN in its original form is that each point is allocated to one cluster, that is, each point either belongs to cluster one or two in our example. So assume, we add some more datapoints in between the two clusters in our illustration above. As you can see, we can still assume that there are two clusters, but in the space between the two clusters are some points where it is not totally clear to which cluster they belong. Tackling this dataset with an classical KNN approach would lead to the result, that each datapoint is allocated to cluster one or cluster two respectively and therewith the KNN algorithm would find a hard cut-off border between the two clusters. Though, as you can see, this is probably not correct for all datapoints since we rather would say that for instance datapoint 1 has a probability of 60% to belong to cluster one and a probability of 40% to belong to cluster two. Hence we want to assign probabilities to the datapoints.

Illlustrates the same dataset as above but with the difference, that some datapoints are added in the space between the two clusters such that they are no longer clearly allocatable to one cluster.

In such a case, a classical KNN approach is rather useless and we need something let's say more flexible or smth. which adds more likelihood to our clustering. Fortunately,the GMM is such a model. Since we do not simply try to model the data with circles but add gaussians to our data this allows us to allocate to each point a likelihood to belong to each of the gaussians. It is clear, and we know, that the closer a datapoint is to one gaussian, the higher is the probability that this point actually belongs to this gaussian and the less is the probability that this point belongs to the other gaussian. Therefore, consider the following illustration where we have added a GMM to the above data and highlighted point 2. This point is much more likely to belong to cluster/gaussian one (C1) than to cluster/gaussian two (C2). Hence, if we would calculate the probability for this point for each cluster we would get smth. like: With a probability of 99% This point belongs to cluster one, and with a probability of 1% to cluster two.

Illlustrates the same dataset as above but with the difference, that some datapoints are added in the space between the two clusters such that they are no longer clearly allocatable.

So let's quickly summarize and recapitulate in which cases we want to use a GMM over a classical KNN approach. If we have data where we assume that the clusters are not defined by simple circles but by more complex, ellipsoid shapes, we prefer the GMM approach over the KNN approach. Additionally, if we want to have soft cut-off borders and therewith probabilities, that is, if we want to know the probability of a datapoint to belong to each of our clusters, we prefer the GMM over the KNN approach. Hence, if there arise the two buzz words probabilities and non-circular during our model selection discussion, we should strongly check the use of the GMM.

So now that we know that we should check the usage of the GMM approach if we want to allocate probabilities to our clusterings or if there are non-circular clusters, we should take a look at how we can build a GMM model. This is derived in the next section of this tutorial. So much for that: We follow a approach called Expectation Maximization (EM).

Maths behind Gaussian Mixture Models (GMM)

To understand the maths behind the GMM concept I strongly recommend to watch the video of Prof. Alexander Ihler about Gaussian Mixture Models and EM. This video gives a perfect insight into what is going on during the calculations of a GMM and I want to build the following steps on top of that video. After you have red the above section and watched this video you will understand the following pseudocode.

So we know that we have to run the E-Step and the M-Step iteratively and maximize the log likelihood function until it converges. Though, we will go into more detail about what is going on during these two steps and how we can compute this in python for one and more dimensional datasets.

I will quickly show the E, M steps here.

Decide how many sources/clusters (c) you want to fit to your data
Initialize the parameters mean $\mu_c$, covariance $\Sigma_c$, and fraction_per_class $\pi_c$ per cluster c

$\underline{E-Step}$

Calculate for each datapoint $x_i$ the probability $r_{ic}$ that datapoint $x_i$ belongs to cluster c with:

$$r_{ic} = \frac{\pi_c N(\boldsymbol{x_i} \ | \ \boldsymbol{\mu_c},\boldsymbol{\Sigma_c})}{\Sigma_{k=1}^K \pi_k N(\boldsymbol{x_i \ | \ \boldsymbol{\mu_k},\boldsymbol{\Sigma_k}})}$$ where $N(\boldsymbol{x} \ | \ \boldsymbol{\mu},\boldsymbol{\Sigma})$ describes the mulitvariate Gaussian with:

$$N(\boldsymbol{x_i},\boldsymbol{\mu_c},\boldsymbol{\Sigma_c}) \ = \ \frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma_c|^{\frac{1}{2}}}exp(-\frac{1}{2}(\boldsymbol{x_i}-\boldsymbol{\mu_c})^T\boldsymbol{\Sigma_c^{-1}}(\boldsymbol{x_i}-\boldsymbol{\mu_c}))$$

$r_{ic}$ gives us for each datapoint $x_i$ the measure of: $\frac{Probability \ that \ x_i \ belongs \ to \ class \ c}{Probability \ of \ x_i \ over \ all \ classes}$ hence if $x_i$ is very close to one gaussian c, it will get a high $r_{ic}$ value for this gaussian and relatively low values otherwise.

$\underline{M-Step}$

For each cluster c: Calculate the total weight $m_c$ (loosely speaking the fraction of points allocated to cluster c) and update $\pi_c$, $\mu_c$, and $\Sigma_c$ using $r_{ic}$ with:

$$m_c \ = \ \Sigma_i r_{ic}$$

$$\pi_c \ = \ \frac{m_c}{m}$$

$$\boldsymbol{\mu_c} \ = \ \frac{1}{m_c}\Sigma_i r_{ic} \boldsymbol{x_i} $$

$$\boldsymbol{\Sigma_c} \ = \ \frac{1}{m_c}\Sigma_i r_{ic}(\boldsymbol{x_i}-\boldsymbol{\mu_c})^T(\boldsymbol{x_i}-\boldsymbol{\mu_c})$$

Mind that you have to use the updated means in this last formula.

Iteratively repeat the E and M step until the log-likelihood function of our model converges where the log likelihood is computed with:

$$ln \ p(\boldsymbol{X} \ | \ \boldsymbol{\pi},\boldsymbol{\mu},\boldsymbol{\Sigma}) \ = \ \Sigma_{i=1}^N \ ln(\Sigma_{k=1}^K \pi_k N(\boldsymbol{x_i} \ | \ \boldsymbol{\mu_k},\boldsymbol{\Sigma_k}))$$

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

Upcoming online Courses

Efficient Data Analysis with Pandas

Python and Machine Learning Course

Machine Learning from Data Preparation to Deep Learning

See our Python training courses

See our Machine Learning with Python training courses

GMM in Python from scratch

To understand how we can implement the above in Python, we best go through the single steps, step by step. Therefore, we best start with the following situation:

Illustrates simply some one dimensional points on the x axis in a x,y coordinate system which can be clustered in three clusters

What can you say about this data? Well, we may see that there are kind of three data clusters. Further, we know that our goal is to automatically fit gaussians (in this case it should be three) to this dataset. Now first of all, lets draw three randomly drawn gaussians on top of that data and see if this brings us any further.

Illustrates simply some one dimensional points on the x axis in a x,y coordinate system which can be clustered in three clusters. On top of that data we have drawn three randomly choses gaussians.

What do we have now? Well, we have the datapoints and we have three randomly chosen gaussian models on top of that datapoints. Remember that we want to have three gaussian models fitted to our three data-clusters. So how can we cause these three randomly chosen guassians to fit the data automatically? Well, here we use an approach called Expectation-Maximization (EM). This approach can, in principal, be used for many different models but it turns out that it is especially popular for the fitting of a bunch of Gaussians to data. I won't go into detail about the principal EM algorithm itself and will only talk about its application for GMM. If you want to read more about it I recommend the chapter about General Statement of EM Algorithm in Mitchel (1997) pp.194. But don't panic, in principal it works always the same.

Ok, now we know that we want to use something called Expectation Maximization. This term consists of two parts: Expectation and Maximzation. Well, how can we combine the data and above randomly drawn gaussians with the first term Expectation? Lets try to simply calculate the probability for each datapoint in our dataset for each gaussian, that it the probability that this datapoint belongs to each of the three gaussians. Since we have to store these probabilities somewhere, we introduce a new variable and call this variable $r$. We use $r$ for convenience purposes to kind of have a container where we can store the probability that datapoint $x_i$ belongs to gaussian $c$. We denote this probability with $r_{ic}$. What we get as result is an nxK array where n denotes the number of datapoints in $x$ and K denotes the number of clusters/gaussians. Hm let's try this for our data and see what we get.

import numpy as np
from scipy.stats import norm
np.random.seed(0)

X = np.linspace(-5,5,num=20)
X0 = X*np.random.rand(len(X))+10 # Create data cluster 1
X1 = X*np.random.rand(len(X))-10 # Create data cluster 2
X2 = X*np.random.rand(len(X)) # Create data cluster 3
X_tot = np.stack((X0,X1,X2)).flatten() # Combine the clusters to get the random datapoints from above



"""Create the array r with dimensionality nxK"""
r = np.zeros((len(X_tot),3))  
print('Dimensionality','=',np.shape(r))

"""Instantiate the random gaussians"""

gauss_1 = norm(loc=-5,scale=5) 
gauss_2 = norm(loc=8,scale=3)
gauss_3 = norm(loc=1.5,scale=1)

"""
Probability for each datapoint x_i to belong to gaussian g 
"""
for c,g in zip(range(3),[gauss_1,gauss_2,gauss_3]):
    r[:,c] = g.pdf(X_tot) # Write the probability that x belongs to gaussian c in column c. 
                          # Therewith we get a 60x3 array filled with the probability that each x_i belongs to one of the gaussians
"""
Normalize the probabilities such that each row of r sums to 1
"""
for i in range(len(r)):
    r[i] = r[i]/np.sum(r,axis=1)[i] 

"""In the last calculation we normalized the probabilites r_ic. So each row i in r gives us the probability for x_i 
to belong to one gaussian (one column per gaussian). Since we want to know the probability that x_i belongs 
to gaussian g, we have to do smth. like a simple calculation of percentage where we want to know how likely it is in % that
x_i belongs to gaussian g. To realize this, we must dive the probability of each r_ic by the total probability r_i (this is done by 
summing up each row in r and divide each value r_ic by sum(np.sum(r,axis=1)[r_i] )). To get this,
look at the above plot and pick an arbitrary datapoint. Pick one gaussian and imagine the probability that this datapoint
belongs to this gaussian. This value will normally be small since the point is relatively far away right? So what is
the percentage that this point belongs to the chosen gaussian? --> Correct, the probability that this datapoint belongs to this 
gaussian divided by the sum of the probabilites for this datapoint for all three gaussians."""
    
print(r)
print(np.sum(r,axis=1)) # As we can see, as result each row sums up to one, just as we want it.

OUTPUT:

Dimensionality = (60, 3)
[[2.97644006e-02 9.70235407e-01 1.91912550e-07]
 [3.85713024e-02 9.61426220e-01 2.47747304e-06]
 [2.44002651e-02 9.75599713e-01 2.16252823e-08]
 [1.86909096e-02 9.81309090e-01 8.07574590e-10]
 [1.37640773e-02 9.86235923e-01 9.93606589e-12]
 [1.58674083e-02 9.84132592e-01 8.42447356e-11]
 [1.14191259e-02 9.88580874e-01 4.48947365e-13]
 [1.34349421e-02 9.86565058e-01 6.78305927e-12]
 [1.11995848e-02 9.88800415e-01 3.18533028e-13]
 [8.57645259e-03 9.91423547e-01 1.74498648e-15]
 [7.64696969e-03 9.92353030e-01 1.33051021e-16]
 [7.10275112e-03 9.92897249e-01 2.22285146e-17]
 [6.36154765e-03 9.93638452e-01 1.22221112e-18]
 [4.82376290e-03 9.95176237e-01 1.55549544e-22]
 [7.75866904e-03 9.92241331e-01 1.86665135e-16]
 [7.52759691e-03 9.92472403e-01 9.17205413e-17]
 [8.04550643e-03 9.91954494e-01 4.28205323e-16]
 [3.51864573e-03 9.96481354e-01 9.60903037e-30]
 [3.42631418e-03 9.96573686e-01 1.06921949e-30]
 [3.14390460e-03 9.96856095e-01 3.91217273e-35]
 [1.00000000e+00 2.67245688e-12 1.56443629e-57]
 [1.00000000e+00 4.26082753e-11 9.73970426e-49]
 [9.99999999e-01 1.40098281e-09 3.68939866e-38]
 [1.00000000e+00 2.65579518e-10 4.05324196e-43]
 [9.99999977e-01 2.25030673e-08 3.11711096e-30]
 [9.99999997e-01 2.52018974e-09 1.91287930e-36]
 [9.99999974e-01 2.59528826e-08 7.72534540e-30]
 [9.99999996e-01 4.22823192e-09 5.97494463e-35]
 [9.99999980e-01 1.98158593e-08 1.38414545e-30]
 [9.99999966e-01 3.43722391e-08 4.57504394e-29]
 [9.99999953e-01 4.74290492e-08 3.45975850e-28]
 [9.99999876e-01 1.24093364e-07 1.31878573e-25]
 [9.99999878e-01 1.21709730e-07 1.17161878e-25]
 [9.99999735e-01 2.65048706e-07 1.28402556e-23]
 [9.99999955e-01 4.53370639e-08 2.60841891e-28]
 [9.99999067e-01 9.33220139e-07 2.02379180e-20]
 [9.99998448e-01 1.55216175e-06 3.63693167e-19]
 [9.99997285e-01 2.71542629e-06 8.18923788e-18]
 [9.99955648e-01 4.43516655e-05 1.59283752e-11]
 [9.99987200e-01 1.28004505e-05 3.20565446e-14]
 [9.64689131e-01 9.53405294e-03 2.57768163e-02]
 [9.77001731e-01 7.96383733e-03 1.50344317e-02]
 [9.96373670e-01 2.97775078e-03 6.48579562e-04]
 [3.43634425e-01 2.15201653e-02 6.34845409e-01]
 [9.75390877e-01 8.19866977e-03 1.64104537e-02]
 [9.37822997e-01 1.19363656e-02 5.02406373e-02]
 [4.27396946e-01 2.18816340e-02 5.50721420e-01]
 [3.28570544e-01 2.14190231e-02 6.50010433e-01]
 [3.62198108e-01 2.16303800e-02 6.16171512e-01]
 [2.99837196e-01 2.11991858e-02 6.78963618e-01]
 [2.21768797e-01 2.04809383e-02 7.57750265e-01]
 [1.76497129e-01 2.01127714e-02 8.03390100e-01]
 [8.23252013e-02 2.50758227e-02 8.92598976e-01]
 [2.11943183e-01 2.03894641e-02 7.67667353e-01]
 [1.50351209e-01 2.00499057e-02 8.29598885e-01]
 [1.54779991e-01 2.00449518e-02 8.25175057e-01]
 [7.92109803e-02 5.93118654e-02 8.61477154e-01]
 [9.71905134e-02 2.18698473e-02 8.80939639e-01]
 [7.60625670e-02 4.95831879e-02 8.74354245e-01]
 [8.53513721e-02 2.40396004e-02 8.90609028e-01]]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

So you see that we got an array $r$ where each row contains the probability that $x_i$ belongs to any of the gaussians $g$. Hence for each $x_i$ we get three probabilities, one for each gaussian $g$. Ok, so good for now. We now have three probabilities for each $x_i$ and that's fine. Recapitulate our initial goal: We want to fit as many gaussians to the data as we expect clusters in the dataset. Now, probably it would be the case that one cluster consists of more datapoints as another one and therewith the probability for each $x_i$ to belong to this "large" cluster is much greater than belonging to one of the others. How can we address this issue in our above code? Well, we simply code-in this probability by multiplying the probability for each $r_ic$ with the fraction of points we assume to belong to this cluster c. We denote this variable with $\pi_c$.

For illustration purposes, look at the following figure:

$Illustrates the fraction of points pi for each cluster$

So as you can see, the points are approximately equally distributed over the two clusters and hence each $\mu_c$ is $\approx$ $0.5$. If the fractions where more unequally distributed like for instance if only two datapoints would belong to cluster 1, we would have more unbalanced $\mu$'s. The fractions must some to one. So let's implement these weighted classes in our code above. Since we do not know the actual values for our $mu$'s we have to set arbitrary values as well. We will set both $\mu_c$ to 0.5 here and hence we don't get any other results as above since the points are assumed to be equally distributed over the two clusters c.

import matplotlib.pyplot as plt
from matplotlib import style
style.use('fivethirtyeight')
import numpy as np
from scipy.stats import norm
np.random.seed(0)

X = np.linspace(-5,5,num=20)
X0 = X*np.random.rand(len(X))+10 # Create data cluster 1
X1 = X*np.random.rand(len(X))-10 # Create data cluster 2
X2 = X*np.random.rand(len(X)) # Create data cluster 3
X_tot = np.stack((X0,X1,X2)).flatten() # Combine the clusters to get the random datapoints from above



"""Create the array r with dimensionality nxK"""
r = np.zeros((len(X_tot),3))  
print('Dimensionality','=',np.shape(r))

"""Instantiate the random gaussians"""

gauss_1 = norm(loc=-5,scale=5) 
gauss_2 = norm(loc=8,scale=3)
gauss_3 = norm(loc=1.5,scale=1)

"""Instantiate the random pi_c"""
pi = np.array([1/3,1/3,1/3]) # We expect to have three clusters 


"""
Probability for each datapoint x_i to belong to gaussian g 
"""
for c,g,p in zip(range(3),[gauss_1,gauss_2,gauss_3],pi):
    r[:,c] = p*g.pdf(X_tot) # Write the probability that x belongs to gaussian c in column c. 
                          # Therewith we get a 60x3 array filled with the probability that each x_i belongs to one of the gaussians
"""
Normalize the probabilities such that each row of r sums to 1 and weight it by pi_c == the fraction of points belonging to 
cluster c
"""
for i in range(len(r)):
    r[i] = r[i]/(np.sum(pi)*np.sum(r,axis=1)[i])

"""In the last calculation we normalized the probabilites r_ic. So each row i in r gives us the probability for x_i 
to belong to one gaussian (one column per gaussian). Since we want to know the probability that x_i belongs 
to gaussian g, we have to do smth. like a simple calculation of percentage where we want to know how likely it is in % that
x_i belongs to gaussian g. To realize this we must dive the probability of each r_ic by the total probability r_i (this is done by 
summing up each row in r and divide each value r_ic by sum(np.sum(r,axis=1)[r_i] )). To get this,
look at the above plot and pick an arbitrary datapoint. Pick one gaussian and imagine the probability that this datapoint
belongs to this gaussian. This value will normally be small since the point is relatively far away right? So what is
the percentage that this point belongs to the chosen gaussian? --> Correct, the probability that this datapoint belongs to this 
gaussian divided by the sum of the probabilites for this datapoint and all three gaussians. Since we don't know how many
point belong to each cluster c and threwith to each gaussian c, we have to make assumptions and in this case simply said that we 
assume that the points are equally distributed over the three clusters."""
    
print(r)
print(np.sum(r,axis=1)) # As we can see, as result each row sums up to one, just as we want it.

OUTPUT:

Dimensionality = (60, 3)
[[2.97644006e-02 9.70235407e-01 1.91912550e-07]
 [3.85713024e-02 9.61426220e-01 2.47747304e-06]
 [2.44002651e-02 9.75599713e-01 2.16252823e-08]
 [1.86909096e-02 9.81309090e-01 8.07574590e-10]
 [1.37640773e-02 9.86235923e-01 9.93606589e-12]
 [1.58674083e-02 9.84132592e-01 8.42447356e-11]
 [1.14191259e-02 9.88580874e-01 4.48947365e-13]
 [1.34349421e-02 9.86565058e-01 6.78305927e-12]
 [1.11995848e-02 9.88800415e-01 3.18533028e-13]
 [8.57645259e-03 9.91423547e-01 1.74498648e-15]
 [7.64696969e-03 9.92353030e-01 1.33051021e-16]
 [7.10275112e-03 9.92897249e-01 2.22285146e-17]
 [6.36154765e-03 9.93638452e-01 1.22221112e-18]
 [4.82376290e-03 9.95176237e-01 1.55549544e-22]
 [7.75866904e-03 9.92241331e-01 1.86665135e-16]
 [7.52759691e-03 9.92472403e-01 9.17205413e-17]
 [8.04550643e-03 9.91954494e-01 4.28205323e-16]
 [3.51864573e-03 9.96481354e-01 9.60903037e-30]
 [3.42631418e-03 9.96573686e-01 1.06921949e-30]
 [3.14390460e-03 9.96856095e-01 3.91217273e-35]
 [1.00000000e+00 2.67245688e-12 1.56443629e-57]
 [1.00000000e+00 4.26082753e-11 9.73970426e-49]
 [9.99999999e-01 1.40098281e-09 3.68939866e-38]
 [1.00000000e+00 2.65579518e-10 4.05324196e-43]
 [9.99999977e-01 2.25030673e-08 3.11711096e-30]
 [9.99999997e-01 2.52018974e-09 1.91287930e-36]
 [9.99999974e-01 2.59528826e-08 7.72534540e-30]
 [9.99999996e-01 4.22823192e-09 5.97494463e-35]
 [9.99999980e-01 1.98158593e-08 1.38414545e-30]
 [9.99999966e-01 3.43722391e-08 4.57504394e-29]
 [9.99999953e-01 4.74290492e-08 3.45975850e-28]
 [9.99999876e-01 1.24093364e-07 1.31878573e-25]
 [9.99999878e-01 1.21709730e-07 1.17161878e-25]
 [9.99999735e-01 2.65048706e-07 1.28402556e-23]
 [9.99999955e-01 4.53370639e-08 2.60841891e-28]
 [9.99999067e-01 9.33220139e-07 2.02379180e-20]
 [9.99998448e-01 1.55216175e-06 3.63693167e-19]
 [9.99997285e-01 2.71542629e-06 8.18923788e-18]
 [9.99955648e-01 4.43516655e-05 1.59283752e-11]
 [9.99987200e-01 1.28004505e-05 3.20565446e-14]
 [9.64689131e-01 9.53405294e-03 2.57768163e-02]
 [9.77001731e-01 7.96383733e-03 1.50344317e-02]
 [9.96373670e-01 2.97775078e-03 6.48579562e-04]
 [3.43634425e-01 2.15201653e-02 6.34845409e-01]
 [9.75390877e-01 8.19866977e-03 1.64104537e-02]
 [9.37822997e-01 1.19363656e-02 5.02406373e-02]
 [4.27396946e-01 2.18816340e-02 5.50721420e-01]
 [3.28570544e-01 2.14190231e-02 6.50010433e-01]
 [3.62198108e-01 2.16303800e-02 6.16171512e-01]
 [2.99837196e-01 2.11991858e-02 6.78963618e-01]
 [2.21768797e-01 2.04809383e-02 7.57750265e-01]
 [1.76497129e-01 2.01127714e-02 8.03390100e-01]
 [8.23252013e-02 2.50758227e-02 8.92598976e-01]
 [2.11943183e-01 2.03894641e-02 7.67667353e-01]
 [1.50351209e-01 2.00499057e-02 8.29598885e-01]
 [1.54779991e-01 2.00449518e-02 8.25175057e-01]
 [7.92109803e-02 5.93118654e-02 8.61477154e-01]
 [9.71905134e-02 2.18698473e-02 8.80939639e-01]
 [7.60625670e-02 4.95831879e-02 8.74354245e-01]
 [8.53513721e-02 2.40396004e-02 8.90609028e-01]]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

So this was the Expectation step. Let's quickly recapitulate what we have done and let's visualize the above (with different colors due to illustration purposes).

"""Plot the data"""

fig = plt.figure(figsize=(10,10))
ax0 = fig.add_subplot(111)


for i in range(len(r)):
    ax0.scatter(X_tot[i],0,c=np.array([r[i][0],r[i][1],r[i][2]]),s=100) # We have defined the first column as red, the second as
                                                                        # green and the third as blue



for g,c in zip([gauss_1.pdf(np.linspace(-15,15)),gauss_2.pdf(np.linspace(-15,15)),gauss_3.pdf(np.linspace(-15,15))],['r','g','b']):
    ax0.plot(np.linspace(-15,15),g,c=c,zorder=0)

    
    
plt.show()

expectation-maximization-and-gaussian-mixture-models-gmm: Graph 0

So have fitted three arbitrarily chosen gaussian models to our dataset. Therefore we have introduced a new variable which we called $r$ and in which we have stored the probability for each point $x_i$ to belong to gaussian $g$ or to cluster c, respectively. Next we have plotted the $x_i$ points and colored according to their probabilities for the three clusters. You can see that the points which have a very high probability to belong to one specific gaussian, has the color of this gaussian while the points which are between two gaussians have a color which is a mixture of the colors of the corresponding gaussians.

So in a more mathematical notation and for multidimensional cases (here the single mean value $\mu$ for the calculation of each gaussian changes to a mean vector $\boldsymbol{\mu}$ with one entry per dimension and the single variance value $\sigma^2$ per gaussian changes to a nxn covariance matrix $\boldsymbol{\Sigma}$ where n is the number of dimensions in the dataset.) the Expectation step of the EM algorithm looks like:

$\underline{E-Step}$

Decide how many sources/clusters (c) you want to fit to your data --> Mind that each cluster c is represented by gaussian g
Initialize the parameters mean $\mu_c$, covariance $\Sigma_c$, and fraction_per_class $\pi_c$ per cluster c
Calculate for each datapoint $x_i$ the probability $r_{ic}$ that datapoint $x_i$ belongs to cluster c with:

$$r_{ic} = \frac{\pi_c N(\boldsymbol{x_i} \ | \ \boldsymbol{\mu_c},\boldsymbol{\Sigma_c})}{\Sigma_{k=1}^K \pi_k N(\boldsymbol{x_i \ | \ \boldsymbol{\mu_k},\boldsymbol{\Sigma_k}})}$$ where $N(\boldsymbol{x} \ | \ \boldsymbol{\mu},\boldsymbol{\Sigma})$ describes the mulitvariate Gaussian with:

$$N(\boldsymbol{x_i},\boldsymbol{\mu_c},\boldsymbol{\Sigma_c}) \ = \ \frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma_c|^{\frac{1}{2}}}exp(-\frac{1}{2}(\boldsymbol{x_i}-\boldsymbol{\mu_c})^T\boldsymbol{\Sigma_c^{-1}}(\boldsymbol{x_i}-\boldsymbol{\mu_c}))$$

$r_{ic}$ gives us for each datapoint $x_i$ the measure of: $\frac{Probability \ that \ x_i \ belongs \ to \ class \ c}{Probability \ of \ x_i \ over \ all \ classes}$ hence if $x_i$ is very close to one gaussian g, it will get a high $r_{ic}$ value for this gaussian and relatively low values otherwise.

So why did this help us? Well, we now know the probability for each point to belong to each gaussian.
What can we do with this information? Well, with this information we can calculate a new mean as well as a new variance (in 1D) or covariance matrix in > 1D datasets. What will be the result of that? Well, this would change the location of each gaussian in the direction of the "real" mean and would re-shape each gaussian using a value for the variance which is closer to the "real" variance. This procedure is called the Maximization step of the EM algorithm. The Maximization step looks as follows:

$\underline{M-Step}$

For each cluster c: Calculate the total weight $m_c$ (loosely speaking the fraction of points allocated to cluster c) and update $\pi_c$, $\mu_c$, and $\Sigma_c$ using $r_{ic}$ with:

$$m_c \ = \ \Sigma_i r_{ic}$$

$$\pi_c \ = \ \frac{m_c}{m}$$

$$\boldsymbol{\mu_c} \ = \ \frac{1}{m_c}\Sigma_i r_{ic} \boldsymbol{x_i} $$

$$\boldsymbol{\Sigma_c} \ = \ \Sigma_i r_{ic}(\boldsymbol{x_i}-\boldsymbol{\mu_c})^T(\boldsymbol{x_i}-\boldsymbol{\mu_c})$$

Mind that you have to use the updated means in this last formula.

So here $m_c$ is simply the fraction of points allocated to cluster c. To understand this, assume that in our $r$ we don't have probabilities between 0 and 1 but simply 0 or 1. That is, a 1 if $x_i$ belongs to $c$ and a 0 otherwise (So each row would contain one 1 and two 0 in our example above). In this case $m_c$ would be simply the number of 1s per column, that is, the number of $x_i$ allocated to each cluster $c$. But since we have probabilities, we do not simply have ones and zeros but the principal is the same --> We sum up the probabilities $r_ic$ over all i to get $m_c$. Then $\pi_c$ is simply the fraction of points which belong to cluster c as above. We then calculate the new mean (vector) $\boldsymbol{\mu_c}$ by summing up the product of each value $x_i$ and the corresponding probability that this point belongs to cluster c ($r_{ic}$) and divide this sum by "the number of points" in c ($m_c$). Remember that if we had zeros and ones, this would be a completely normal mean calculation but since we have probabilities, we divide by the sum of these probabilities per cluster c. Also the new covariance matrix ($\boldsymbol{\Sigma_c}$) is updated by calculating the covariance matrix per class c ($\boldsymbol{\Sigma_c}$) weighted by the probability that point $x_i$ belongs to cluster c. We could also write $\boldsymbol{\Sigma_c}=m_c(\boldsymbol{x}-\boldsymbol{\mu_c})^T((\boldsymbol{x}-\boldsymbol{\mu_c}))$.
so let's look at our plot if we do the above updates, that is run the first EM iteration .

import matplotlib.pyplot as plt
from matplotlib import style
style.use('fivethirtyeight')
import numpy as np
from scipy.stats import norm
np.random.seed(0)

X = np.linspace(-5,5,num=20)
X0 = X*np.random.rand(len(X))+10 # Create data cluster 1
X1 = X*np.random.rand(len(X))-10 # Create data cluster 2
X2 = X*np.random.rand(len(X)) # Create data cluster 3
X_tot = np.stack((X0,X1,X2)).flatten() # Combine the clusters to get the random datapoints from above

"""
E-Step
"""

"""Create the array r with dimensionality nxK"""
r = np.zeros((len(X_tot),3))  

"""Instantiate the random gaussians"""

gauss_1 = norm(loc=-5,scale=5) 
gauss_2 = norm(loc=8,scale=3)
gauss_3 = norm(loc=1.5,scale=1)

"""Instantiate the random mu_c"""
m = np.array([1/3,1/3,1/3]) # We expect to have three clusters 

pi = m/np.sum(m)


"""
Probability for each datapoint x_i to belong to gaussian g 
"""
for c,g,p in zip(range(3),[gauss_1,gauss_2,gauss_3],pi):
    r[:,c] = p*g.pdf(X_tot) # Write the probability that x belongs to gaussian c in column c. 
                          # Therewith we get a 60x3 array filled with the probability that each x_i belongs to one of the gaussians
"""
Normalize the probabilities such that each row of r sums to 1 and weight it by mu_c == the fraction of points belonging to 
cluster c
"""
for i in range(len(r)):
    r[i] = r[i]/(np.sum(pi)*np.sum(r,axis=1)[i])


"""M-Step"""


"""calculate m_c"""
m_c = []
for c in range(len(r[0])):
    m = np.sum(r[:,c])
    m_c.append(m) # For each cluster c, calculate the m_c and add it to the list m_c
    
"""calculate pi_c"""
pi_c = []
for m in m_c:
    pi_c.append(m/np.sum(m_c)) # For each cluster c, calculate the fraction of points pi_c which belongs to cluster c

"""calculate mu_c"""
mu_c = np.sum(X_tot.reshape(len(X_tot),1)*r,axis=0)/m_c


"""calculate var_c"""
var_c = []

for c in range(len(r[0])):
    var_c.append((1/m_c[c])*np.dot(((np.array(r[:,c]).reshape(60,1))*(X_tot.reshape(len(X_tot),1)-mu_c[c])).T,(X_tot.reshape(len(X_tot),1)-mu_c[c])))
  

    
    
"""Update the gaussians"""

gauss_1 = norm(loc=mu_c[0],scale=var_c[0]) 
gauss_2 = norm(loc=mu_c[1],scale=var_c[1])
gauss_3 = norm(loc=mu_c[2],scale=var_c[2])



"""Plot the data"""

fig = plt.figure(figsize=(10,10))
ax0 = fig.add_subplot(111)


for i in range(len(r)):
    ax0.scatter(X_tot[i],0,c=np.array([r[i][0],r[i][1],r[i][2]]),s=100) 


"""Plot the gaussians"""


for g,c in zip([gauss_1.pdf(np.sort(X_tot).reshape(60,1)),gauss_2.pdf(np.sort(X_tot).reshape(60,1)),gauss_3.pdf(np.sort(X_tot).reshape(60,1))],['r','g','b']):
    ax0.plot(np.sort(X_tot),g,c=c)

    

    
plt.show()

expectation-maximization-and-gaussian-mixture-models-gmm 2: Graph 1

So as you can see the occurrence of our gaussians changed dramatically after the first EM iteration. Let's update $r$ and illustrate the coloring of the points.

"""update r"""

"""
Probability for each datapoint x_i to belong to gaussian g 
"""
# Mind that we use the new pi_c here
for c,g,p in zip(range(3),[gauss_1,gauss_2,gauss_3],pi):
    r[:,c] = p*g.pdf(X_tot) 
"""
Normalize the probabilities such that each row of r sums to 1 and weight it by mu_c == the fraction of points belonging to 
cluster c
"""
for i in range(len(r)):
    r[i] = r[i]/(np.sum(pi_c)*np.sum(r,axis=1)[i])


"""Plot the data"""

fig = plt.figure(figsize=(10,10))
ax0 = fig.add_subplot(111)



for i in range(len(r)):
    ax0.scatter(X_tot[i],0,c=np.array([r[i][0],r[i][1],r[i][2]]),s=100) # We have defined the first column as red, the second as
                                                                        # green and the third as blue

        
"""Plot the gaussians"""

for g,c in zip([gauss_1.pdf(np.sort(X_tot).reshape(60,1)),gauss_2.pdf(np.sort(X_tot).reshape(60,1)),gauss_3.pdf(np.sort(X_tot).reshape(60,1))],['r','g','b']):
    ax0.plot(np.sort(X_tot),g,c=c)
        
        
        
plt.show()

expectation-maximization-and-gaussian-mixture-models-gmm 3: Graph 2

As you can see, the colors of the datapoints changed due to the adjustment of $r$. This is logical since also the means and the variances of the gaussians changed and therewith the allocation probabilities changed as well. Though, after this first run of our EM algorithm, the results does not look better than our initial, arbitrary starting results isn't it? Lets see what happens if we run the steps above multiple times. This is done by simply looping through the EM steps after we have done out first initializations of $\boldsymbol{\mu_c}$, $\sigma_c^2$ and $\mu_c$. We run the EM for 10 loops and plot the result in each loop. You can observe the progress for each EM loop below.

import matplotlib.pyplot as plt
from matplotlib import style
style.use('fivethirtyeight')
import numpy as np
from scipy.stats import norm
np.random.seed(0)

X = np.linspace(-5,5,num=20)
X0 = X*np.random.rand(len(X))+15 # Create data cluster 1
X1 = X*np.random.rand(len(X))-15 # Create data cluster 2
X2 = X*np.random.rand(len(X)) # Create data cluster 3
X_tot = np.stack((X0,X1,X2)).flatten() # Combine the clusters to get the random datapoints from above

class GM1D:

    def __init__(self,X,iterations):
        self.iterations = iterations
        self.X = X
        self.mu = None
        self.pi = None
        self.var = None
  

    def run(self):
        

        """
        Instantiate the random mu, pi and var
        """
        self.mu = [-8,8,5]
        self.pi = [1/3,1/3,1/3]
        self.var = [5,3,1]
        
        
        
        """
        E-Step
        """
        
        for iter in range(self.iterations):

            """Create the array r with dimensionality nxK"""
            r = np.zeros((len(X_tot),3))  

  
            """
            Probability for each datapoint x_i to belong to gaussian g 
            """
            for c,g,p in zip(range(3),[norm(loc=self.mu[0],scale=self.var[0]),
                                       norm(loc=self.mu[1],scale=self.var[1]),
                                       norm(loc=self.mu[2],scale=self.var[2])],self.pi):
                r[:,c] = p*g.pdf(X_tot) # Write the probability that x belongs to gaussian c in column c. 
                                      # Therewith we get a 60x3 array filled with the probability that each x_i belongs to one of the gaussians
            """
            Normalize the probabilities such that each row of r sums to 1 and weight it by mu_c == the fraction of points belonging to 
            cluster c
            """
            for i in range(len(r)):
                r[i] = r[i]/(np.sum(pi)*np.sum(r,axis=1)[i])


            """Plot the data"""

            fig = plt.figure(figsize=(10,10))
            ax0 = fig.add_subplot(111)


            for i in range(len(r)):
                ax0.scatter(self.X[i],0,c=np.array([r[i][0],r[i][1],r[i][2]]),s=100) 


            """Plot the gaussians"""


            for g,c in zip([norm(loc=self.mu[0],scale=self.var[0]).pdf(np.linspace(-20,20,num=60)),
                            norm(loc=self.mu[1],scale=self.var[1]).pdf(np.linspace(-20,20,num=60)),
                            norm(loc=self.mu[2],scale=self.var[2]).pdf(np.linspace(-20,20,num=60))],['r','g','b']):
                ax0.plot(np.linspace(-20,20,num=60),g,c=c)
            
            
            
            """M-Step"""
    

            """calculate m_c"""
            m_c = []
            for c in range(len(r[0])):
                m = np.sum(r[:,c])
                m_c.append(m) # For each cluster c, calculate the m_c and add it to the list m_c

            """calculate pi_c"""
            for k in range(len(m_c)):
                self.pi[k] = (m_c[k]/np.sum(m_c)) # For each cluster c, calculate the fraction of points pi_c which belongs to cluster c

            """calculate mu_c"""
            self.mu = np.sum(self.X.reshape(len(self.X),1)*r,axis=0)/m_c


            """calculate var_c"""
            var_c = []

            for c in range(len(r[0])):
                var_c.append((1/m_c[c])*np.dot(((np.array(r[:,c]).reshape(60,1))*(self.X.reshape(len(self.X),1)-self.mu[c])).T,(self.X.reshape(len(self.X),1)-self.mu[c])))



            plt.show()
    

GM1D = GM1D(X_tot,10)
GM1D.run()

expectation-maximization-and-gaussian-mixture-models-gmm 4: Graph 3

expectation-maximization-and-gaussian-mixture-models-gmm 5: Graph 4

expectation-maximization-and-gaussian-mixture-models-gmm 6: Graph 5

expectation-maximization-and-gaussian-mixture-models-gmm 7: Graph 6

expectation-maximization-and-gaussian-mixture-models-gmm 8: Graph 7

expectation-maximization-and-gaussian-mixture-models-gmm 9: Graph 8

expectation-maximization-and-gaussian-mixture-models-gmm 10: Graph 9

expectation-maximization-and-gaussian-mixture-models-gmm 11: Graph 10

expectation-maximization-and-gaussian-mixture-models-gmm 12: Graph 11

expectation-maximization-and-gaussian-mixture-models-gmm 13: Graph 12

As you can see, our three randomly initialized gaussians have fitted the data. Beautiful, isn't it? The last step would now be to plot the log likelihood function to show that this function converges as the number of iterations becomes large and therewith there will be no improvement in our GMM and we can stop the algorithm to iterate. Since I have introduced this in the multidimensional case below I will skip this step here. But there isn't any magical, just compute the value of the loglikelihood as described in the pseudocode above for each iteration, save these values in a list and plot the values after the iterations. As said, I have implemented this step below and you will see how we can compute it in Python.

So we have now seen that, and how, the GMM works for the one dimensional case. But how can we accomplish this for datasets with more than one dimension? Well, it turns out that there is no reason to be afraid since once you have understand the one dimensional case, everything else is just an adaption and I still have shown in the pseudocode above, the formulas you need for the multidimensional case. So the difference to the one dimensional case is that our datasets do no longer consist of one column but have multiple columns and therewith our above $x_i$ is no longer a scalar but a vector ($\boldsymbol{x_i}$) consisting of as many elements as there are columns in the dataset. Since there are multiple columns, also the mean per class is no longer a scalar but $\mu_c$ but a vector $\boldsymbol{\mu_c}$ consisting of as many elements as there are columns in the dataset. Also, the variance is no longer a scalar for each cluster c ($\sigma^2$) but becomes a covariance matrix $\boldsymbol{\Sigma_c}$ of dimensionality nxn where n is the number of columns (dimensions) in the dataset. The calculations retain the same!

So let's derive the multi dimensional case in Python. I have added comments at all critical steps to help you to understand the code. Additionally, I have wrote the code in such a way that you can adjust how many sources (==clusters) you want to fit and how many iterations you want to run the model. By calling the EMM function with different values for number_of_sources and iterations. The actual fitting of the GMM is done in the run() function. I have also introduced a predict() function which allows us to predict the probabilities of membership for a new, unseen datapoint to belong to the fitted gaussians (clusters). So in principal, the below code is split in two parts: The run() part where we train the GMM and iteratively run through the E and M steps, and the predict() part where we predict the probability for a new datapoint. I recommend to go through the code line by line and maybe plot the result of a line with smth. like plot(result of line 44) if you are unsure what is going on -This procedure has helped the author many times-.

I have to make a final side note: I have introduced a variable called self.reg_cov. This variable is smth. we need to prevent singularity issues during the calculations of the covariance matrices. This is a mathematical problem which could arise during the calculation of the covariance matrix and hence is not critical for the understanding of the GMM itself. Though, it turns out that if we run into a singular covariance matrix, we get an error. To prevent this, we introduce the mentioned variable. For those interested in why we get a singularity matrix and what we can do against it, I will add the section "Singularity issues during the calculations of GMM" at the end of this chapter.

GMM in Python from scratch - multi dimensional case

import matplotlib.pyplot as plt
from matplotlib import style
style.use('fivethirtyeight')
from sklearn.datasets.samples_generator import make_blobs
import numpy as np
from scipy.stats import multivariate_normal


# 0. Create dataset
X,Y = make_blobs(cluster_std=1.5,random_state=20,n_samples=500,centers=3)

# Stratch dataset to get ellipsoid data
X = np.dot(X,np.random.RandomState(0).randn(2,2))


class GMM:

    def __init__(self,X,number_of_sources,iterations):
        self.iterations = iterations
        self.number_of_sources = number_of_sources
        self.X = X
        self.mu = None
        self.pi = None
        self.cov = None
        self.XY = None
        
    

    """Define a function which runs for iterations, iterations"""
    def run(self):
        self.reg_cov = 1e-6*np.identity(len(self.X[0]))
        x,y = np.meshgrid(np.sort(self.X[:,0]),np.sort(self.X[:,1]))
        self.XY = np.array([x.flatten(),y.flatten()]).T
           
                    
        """ 1. Set the initial mu, covariance and pi values"""
        self.mu = np.random.randint(min(self.X[:,0]),max(self.X[:,0]),size=(self.number_of_sources,len(self.X[0]))) # This is a nxm matrix since we assume n sources (n Gaussians) where each has m dimensions
        self.cov = np.zeros((self.number_of_sources,len(X[0]),len(X[0]))) # We need a nxmxm covariance matrix for each source since we have m features --> We create symmetric covariance matrices with ones on the digonal
        for dim in range(len(self.cov)):
            np.fill_diagonal(self.cov[dim],5)


        self.pi = np.ones(self.number_of_sources)/self.number_of_sources # Are "Fractions"
        log_likelihoods = [] # In this list we store the log likehoods per iteration and plot them in the end to check if
                             # if we have converged
            
        """Plot the initial state"""    
        fig = plt.figure(figsize=(10,10))
        ax0 = fig.add_subplot(111)
        ax0.scatter(self.X[:,0],self.X[:,1])
        ax0.set_title('Initial state')
        for m,c in zip(self.mu,self.cov):
            c += self.reg_cov
            multi_normal = multivariate_normal(mean=m,cov=c)
            ax0.contour(np.sort(self.X[:,0]),np.sort(self.X[:,1]),multi_normal.pdf(self.XY).reshape(len(self.X),len(self.X)),colors='black',alpha=0.3)
            ax0.scatter(m[0],m[1],c='grey',zorder=10,s=100)
        
        for i in range(self.iterations):               

            """E Step"""
            r_ic = np.zeros((len(self.X),len(self.cov)))

            for m,co,p,r in zip(self.mu,self.cov,self.pi,range(len(r_ic[0]))):
                co+=self.reg_cov
                mn = multivariate_normal(mean=m,cov=co)
                r_ic[:,r] = p*mn.pdf(self.X)/np.sum([pi_c*multivariate_normal(mean=mu_c,cov=cov_c).pdf(X) for pi_c,mu_c,cov_c in zip(self.pi,self.mu,self.cov+self.reg_cov)],axis=0)

            """
            The above calculation of r_ic is not that obvious why I want to quickly derive what we have done above.
            First of all the nominator:
            We calculate for each source c which is defined by m,co and p for every instance x_i, the multivariate_normal.pdf() value.
            For each loop this gives us a 100x1 matrix (This value divided by the denominator is then assigned to r_ic[:,r] which is in 
            the end a 100x3 matrix).
            Second the denominator:
            What we do here is, we calculate the multivariate_normal.pdf() for every instance x_i for every source c which is defined by
            pi_c, mu_c, and cov_c and write this into a list. This gives us a 3x100 matrix where we have 100 entrances per source c.
            Now the formula wants us to add up the pdf() values given by the 3 sources for each x_i. Hence we sum up this list over axis=0.
            This gives us then a list with 100 entries.
            What we have now is FOR EACH LOOP a list with 100 entries in the nominator and a list with 100 entries in the denominator
            where each element is the pdf per class c for each instance x_i (nominator) respectively the summed pdf's of classes c for each 
            instance x_i. Consequently we can now divide the nominator by the denominator and have as result a list with 100 elements which we
            can then assign to r_ic[:,r] --> One row r per source c. In the end after we have done this for all three sources (three loops)
            and run from r==0 to r==2 we get a matrix with dimensionallity 100x3 which is exactly what we want.
            If we check the entries of r_ic we see that there mostly one element which is much larger than the other two. This is because
            every instance x_i is much closer to one of the three gaussians (that is, much more likely to come from this gaussian) than
            it is to the other two. That is practically speaing, r_ic gives us the fraction of the probability that x_i belongs to class
            c over the probability that x_i belonges to any of the classes c (Probability that x_i occurs given the 3 Gaussians).
            """

            """M Step"""

            # Calculate the new mean vector and new covariance matrices, based on the probable membership of the single x_i to classes c --> r_ic
            self.mu = []
            self.cov = []
            self.pi = []
            log_likelihood = []

            for c in range(len(r_ic[0])):
                m_c = np.sum(r_ic[:,c],axis=0)
                mu_c = (1/m_c)*np.sum(self.X*r_ic[:,c].reshape(len(self.X),1),axis=0)
                self.mu.append(mu_c)

                # Calculate the covariance matrix per source based on the new mean
                self.cov.append(((1/m_c)*np.dot((np.array(r_ic[:,c]).reshape(len(self.X),1)*(self.X-mu_c)).T,(self.X-mu_c)))+self.reg_cov)
                # Calculate pi_new which is the "fraction of points" respectively the fraction of the probability assigned to each source 
                self.pi.append(m_c/np.sum(r_ic)) # Here np.sum(r_ic) gives as result the number of instances. This is logical since we know 
                                                # that the columns of each row of r_ic adds up to 1. Since we add up all elements, we sum up all
                                                # columns per row which gives 1 and then all rows which gives then the number of instances (rows) 
                                                # in X --> Since pi_new contains the fractions of datapoints, assigned to the sources c,
                                                # The elements in pi_new must add up to 1

            
            
            """Log likelihood"""
            log_likelihoods.append(np.log(np.sum([k*multivariate_normal(self.mu[i],self.cov[j]).pdf(X) for k,i,j in zip(self.pi,range(len(self.mu)),range(len(self.cov)))])))

            

            """
            This process of E step followed by a M step is now iterated a number of n times. In the second step for instance,
            we use the calculated pi_new, mu_new and cov_new to calculate the new r_ic which are then used in the second M step
            to calculat the mu_new2 and cov_new2 and so on....
            """

        fig2 = plt.figure(figsize=(10,10))
        ax1 = fig2.add_subplot(111) 
        ax1.set_title('Log-Likelihood')
        ax1.plot(range(0,self.iterations,1),log_likelihoods)
        #plt.show()
    
    """Predict the membership of an unseen, new datapoint"""
    def predict(self,Y):
        # PLot the point onto the fittet gaussians
        fig3 = plt.figure(figsize=(10,10))
        ax2 = fig3.add_subplot(111)
        ax2.scatter(self.X[:,0],self.X[:,1])
        for m,c in zip(self.mu,self.cov):
            multi_normal = multivariate_normal(mean=m,cov=c)
            ax2.contour(np.sort(self.X[:,0]),np.sort(self.X[:,1]),multi_normal.pdf(self.XY).reshape(len(self.X),len(self.X)),colors='black',alpha=0.3)
            ax2.scatter(m[0],m[1],c='grey',zorder=10,s=100)
            ax2.set_title('Final state')
            for y in Y:
                ax2.scatter(y[0],y[1],c='orange',zorder=10,s=100)
        prediction = []        
        for m,c in zip(self.mu,self.cov):  
            #print(c)
            prediction.append(multivariate_normal(mean=m,cov=c).pdf(Y)/np.sum([multivariate_normal(mean=mean,cov=cov).pdf(Y) for mean,cov in zip(self.mu,self.cov)]))
        #plt.show()
        return prediction
         
    
    
GMM = GMM(X,3,50)     
GMM.run()
GMM.predict([[0.5,0.5]])

OUTPUT:

[3.5799079955839772e-06, 0.00013180910068621356, 0.9998646109913182]

expectation-maximization-and-gaussian-mixture-models-gmm 14: Graph 13

expectation-maximization-and-gaussian-mixture-models-gmm 15: Graph 14

expectation-maximization-and-gaussian-mixture-models-gmm 16: Graph 15

So now we have seen that we can create an arbitrary dataset, fit a GMM to this data which is first finding gaussian distributed clusters (sources) in this dataset and second allows us to predict the membership probability of an unseen datapoint to these sources.

What can we do with this model at the end of the day? Well, imagine you get a dataset like the above where someone tells you: "Hey I have a dataset and I know that there are 5 clusters. Unfortunately, I don't know which label belongs to which cluster, and hence I have a unlabeled dataset. Can you help me to find the clusters?". You can answer: "Yeah, I can by using a GMM approach!". Your opposite is delightful. A few days later the same person knocks on your door and says: "Hey I want to thank you one more time for you help. I want to let you know that I now have a new datapoint for for which I know it's target value. Can you do smth. useful with it?" you answer: "Well, I think I can. By using our created GMM model on this new datapoint, I can calculate the probability of membership of this datapoint to belong to each of the clusters. If we are lucky and our calculations return a very high probability for this datapoint for one cluster we can assume that all the datapoints belonging to this cluster have the same target value as this datapoint. Therewith, we can label all the unlabeled datapoints of this cluster (given that the clusters are tightly clustered -to be sure-). Therewith we can make a unlabeled dataset a (partly) labeled dataset and use some kind of supervised learning algorithms in the next step. Cool, isn't it?

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

See our Python training courses

See our Machine Learning with Python training courses

GMM using sklearn

So now we will create a GMM Model using the prepackaged sklearn.mixture.GaussianMixture method. As we can see, the actual set up of the algorithm, that is the instantiation as well as the calling of the fit() method does take us only one line of code. Cool isn't it?

import matplotlib.pyplot as plt
from matplotlib import style
style.use('fivethirtyeight')
from sklearn.datasets.samples_generator import make_blobs
import numpy as np
from scipy.stats import multivariate_normal
from sklearn.mixture import GaussianMixture

# 0. Create dataset
X,Y = make_blobs(cluster_std=0.5,random_state=20,n_samples=1000,centers=5)

# Stratch dataset to get ellipsoid data
X = np.dot(X,np.random.RandomState(0).randn(2,2))

x,y = np.meshgrid(np.sort(X[:,0]),np.sort(X[:,1]))
XY = np.array([x.flatten(),y.flatten()]).T

GMM = GaussianMixture(n_components=5).fit(X) # Instantiate and fit the model
print('Converged:',GMM.converged_) # Check if the model has converged
means = GMM.means_ 
covariances = GMM.covariances_


# Predict
Y = np.array([[0.5],[0.5]])
prediction = GMM.predict_proba(Y.T)
print(prediction)

# Plot   
fig = plt.figure(figsize=(10,10))
ax0 = fig.add_subplot(111)
ax0.scatter(X[:,0],X[:,1])
ax0.scatter(Y[0,:],Y[1,:],c='orange',zorder=10,s=100)
for m,c in zip(means,covariances):
    multi_normal = multivariate_normal(mean=m,cov=c)
    ax0.contour(np.sort(X[:,0]),np.sort(X[:,1]),multi_normal.pdf(XY).reshape(len(X),len(X)),colors='black',alpha=0.3)
    ax0.scatter(m[0],m[1],c='grey',zorder=10,s=100)
    
plt.show()

OUTPUT:

Converged: True
[[9.36305075e-82 1.94756664e-93 4.00098007e-33 5.02664415e-44
  1.00000000e+00]]

expectation-maximization-and-gaussian-mixture-models-gmm 17: Graph 16

So as you can see, we get very nice results. Congratulations, Done!

Singularity issues during the calculations of GMM

This section will give an insight into what is happening that leads to a singular covariance matrix during the fitting of an GMM to a dataset, why this is happening, and what we can do to prevent that.

Therefore we best start by recapitulating the steps during the fitting of a Gaussian Mixture Model to a dataset.

Decide how many sources/clusters (c) you want to fit to your data
Initialize the parameters mean $\mu_c$, covariance $\Sigma_c$, and fraction_per_class $\pi_c$ per cluster c

$\underline{E-Step}$

Calculate for each datapoint $x_i$ the probability $r_{ic}$ that datapoint $x_i$ belongs to cluster c with:

$$r_{ic} = \frac{\pi_c N(\boldsymbol{x_i} \ | \ \boldsymbol{\mu_c},\boldsymbol{\Sigma_c})}{\Sigma_{k=1}^K \pi_k N(\boldsymbol{x_i \ | \ \boldsymbol{\mu_k},\boldsymbol{\Sigma_k}})}$$ where $N(\boldsymbol{x} \ | \ \boldsymbol{\mu},\boldsymbol{\Sigma})$ describes the mulitvariate Gaussian with:

$$N(\boldsymbol{x_i},\boldsymbol{\mu_c},\boldsymbol{\Sigma_c}) \ = \ \frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma_c|^{\frac{1}{2}}}exp(-\frac{1}{2}(\boldsymbol{x_i}-\boldsymbol{\mu_c})^T\boldsymbol{\Sigma_c^{-1}}(\boldsymbol{x_i}-\boldsymbol{\mu_c}))$$

$r_{ic}$ gives us for each datapoint $x_i$ the measure of: $\frac{Probability \ that \ x_i \ belongs \ to \ class \ c}{Probability \ of \ x_i \ over \ all \ classes}$ hence if $x_i$ is very close to one gaussian c, it will get a high $r_{ic}$ value for this gaussian and relatively low values otherwise.

$\underline{M-Step}$

For each cluster c: Calculate the total weight $m_c$ (loosely speaking the fraction of points allocated to cluster c) and update $\pi_c$, $\mu_c$, and $\Sigma_c$ using $r_{ic}$ with:

$$m_c \ = \ \Sigma_i r_ic$$

$$\pi_c \ = \ \frac{m_c}{m}$$

$$\boldsymbol{\mu_c} \ = \ \frac{1}{m_c}\Sigma_i r_{ic} \boldsymbol{x_i} $$

$$\boldsymbol{\Sigma_c} \ = \ \frac{1}{m_c}\Sigma_i r_{ic}(\boldsymbol{x_i}-\boldsymbol{\mu_c})^T(\boldsymbol{x_i}-\boldsymbol{\mu_c})$$
Mind that you have to use the updated means in this last formula.

Iteratively repeat the E and M step until the log-likelihood function of our model converges where the log likelihood is computed with:

$$ln \ p(\boldsymbol{X} \ | \ \boldsymbol{\pi},\boldsymbol{\mu},\boldsymbol{\Sigma}) \ = \ \Sigma_{i=1}^N \ ln(\Sigma_{k=1}^K \pi_k N(\boldsymbol{x_i} \ | \ \boldsymbol{\mu_k},\boldsymbol{\Sigma_k}))$$

So now we have derived the single steps during the calculation we have to consider what it mean for a matrix to be singular. A matrix is singular if it is not invertible. A matrix is invertible if there is a matrix $X$ such that $AX = XA = I$. If this is not given, the matrix is said to be singular. That is, a matrix like:

\begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix}

is not invertible and following singular. It is also plausible, that if we assume that the above matrix is matrix $A$ there could not be a matrix $X$ which gives dotted with this matrix the identity matrix $I$ (Simply take this zero matrix and dot-product it with any other 2x2 matrix and you will see that you will always get the zero matrix). But why is this a problem for us? Well, consider the formula for the multivariate normal above. There you would find $\boldsymbol{\Sigma_c^{-1}}$ which is the invertible of the covariance matrix. Since a singular matrix is not invertible, this will throw us an error during the computation.

So now that we know how a singular, not invertible matrix looks like and why this is important to us during the GMM calculations, how could we ran into this issue? First of all, we get this $\boldsymbol{0}$ covariance matrix above if the Multivariate Gaussian falls into one point during the iteration between the E and M step. This could happen if we have for instance a dataset to which we want to fit 3 gaussians but which actually consists only of two classes (clusters) such that loosely speaking, two of these three gaussians catch their own cluster while the last gaussian only manages it to catch one single point on which it sits. We will see how this looks like below. But step by step: Assume you have a two dimensional dataset which consist of two clusters but you don't know that and want to fit three gaussian models to it, that is c = 3. You initialize your parameters in the E step and plot the gaussians on top of your data which looks smth. like (maybe you can see the two relatively scattered clusters on the bottom left and top right):

Illustrates multivariate Gaussians

Having initialized the parameter, you iteratively do the E, T steps. During this procedure the three Gaussians are kind of wandering around and searching for their optimal place. If you observe the model parameters, that is $\mu_c$ and $\pi_c$ you will observe that they converge, that it after some number of iterations they will no longer change and therewith the corresponding Gaussian has found its place in space. In the case where you have a singularity matrix you encounter smth. like:

Illustrates multivatriate gaussians

Where I have circled the third gaussian model with red. So you see, that this Gaussian sits on one single datapoint while the two others claim the rest. Here I have to notice that to be able to draw the figure like that I already have used covariance-regularization which is a method to prevent singularity matrices and is described below.

Ok , but now we still do not know why and how we encounter a singularity matrix. Therefore we have to look at the calculations of the $r_{ic}$ and the $cov$ during the E and M steps. If you look at the $r_{ic}$ formula again:

$$r_{ic} = \frac{\pi_c N(\boldsymbol{x_i} \ | \ \boldsymbol{\mu_c},\boldsymbol{\Sigma_c})}{\Sigma_{k=1}^K \pi_k N(\boldsymbol{x_i \ | \ \boldsymbol{\mu_k},\boldsymbol{\Sigma_k}})}$$ you see that there the $r_{ic}$'s would have large values if they are very likely under cluster c and low values otherwise. To make this more apparent consider the case where we have two relatively spread gaussians and one very tight gaussian and we compute the $r_{ic}$ for each datapoint $x_i$ as illustrated in the figure:

Illustrates multivatriate gaussians

So go through the datapoints from left to right and imagine you would write down the probability for each $x_i$ that it belongs to the red, blue and yellow gaussian. What you can see is that for most of the $x_i$ the probability that it belongs to the yellow gaussian is very little. In the case above where the third gaussian sits onto one single datapoint, $r_{ic}$ is only larger than zero for this one datapoint while it is zero for every other $x_i$. (collapses onto this datapoint --> This happens if all other points are more likely part of gaussian one or two and hence this is the only point which remains for gaussian three --> The reason why this happens can be found in the interaction between the dataset itself in the initializaion of the gaussians. That is, if we had chosen other initial values for the gaussians, we would have seen another picture and the third gaussian maybe would not collapse). This is sufficient if you further and further spikes this gaussian. The $r_{ic}$ table then looks smth. like:

Illustrates multivatriate gaussians

As you can see, the $r_{ic}$ of the third column, that is for the third gaussian are zero instead of this one row. If we look up which datapoint is represented here we get the datapoint: [ 23.38566343 8.07067598]. Ok, but why do we get a singularity matrix in this case? Well, and this is our last step, therefore we have to once more consider the calculation of the covariance matrix which is:

$$\boldsymbol{\Sigma_c} \ = \ \Sigma_i r_{ic}(\boldsymbol{x_i}-\boldsymbol{\mu_c})^T(\boldsymbol{x_i}-\boldsymbol{\mu_c})$$ we have seen that all $r_{ic}$ are zero instead for the one $x_i$ with [23.38566343 8.07067598]. Now the formula wants us to calculate $(\boldsymbol{x_i}-\boldsymbol{\mu_c})$. If we look at the $\boldsymbol{\mu_c}$ for this third gaussian we get [23.38566343 8.07067598]. Oh, but wait, that exactly the same as $x_i$ and that's what Bishop wrote with:"Suppose that one of the components of the mixture model, let us say the $j$ th component, has its mean $\boldsymbol{\mu_j}$ exactly equal to one of the data points so that $\boldsymbol{\mu_j} = \boldsymbol{x_n}$ for some value of n" (Bishop, 2006, p.434). So what will happen? Well, this term will be zero and hence this datapoint was the only chance for the covariance-matrix not to get zero (since this datapoint was the only one where $r_{ic}$>0), it now gets zero and looks like:

\begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix}

Consequently as said above, this is a singular matrix and will lead to an error during the calculations of the multivariate gaussian. So how can we prevent such a situation. Well, we have seen that the covariance matrix is singular if it is the $\boldsymbol{0}$ matrix. Hence to prevent singularity we simply have to prevent that the covariance matrix becomes a $\boldsymbol{0}$ matrix. This is done by adding a very little value (in sklearn's GaussianMixture this value is set to 1e-6) to the digonal of the covariance matrix. There are also other ways to prevent singularity such as noticing when a gaussian collapses and setting its mean and/or covariance matrix to a new, arbitrarily high value(s). This covariance regularization is also implemented in the code below with which you get the described results. Maybe you have to run the code several times to get a singular covariance matrix since, as said. this must not happen each time but also depends on the initial set up of the gaussians.

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

See our Python training courses

See our Machine Learning with Python training courses

References

VanderPlas,J. (2017). Python Data Science Handbook. Sebastopol: O'Reilly, pp.465
Mitchel,T. (1997). Machine Learning. McGraw-Hill: New York, pp.191
Bishop,P. (2006). Pattern Recognition and Machine Learning. Berlin, Heidelberg: Springer, pp.430
https://www.youtube.com/watch?v=qMTuMa86NzU
https://www.youtube.com/watch?v=iQoXFmbXRJA
https://www.youtube.com/watch?v=zL_MHtT56S0
https://www.youtube.com/watch?v=BWXd5dOkuTo

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

Upcoming online Courses

Efficient Data Analysis with Pandas

Python and Machine Learning Course

Machine Learning from Data Preparation to Deep Learning

See our Python training courses

See our Machine Learning with Python training courses