 1. Preliminary Topics
 2.Gaussian Mixture Model (GMM) and ExpectationMaximization(EM) Algorithm
 3.EM Algorithm for Univariate GMM
 4.EM Algorithm for Multivariate GMM
 5.Summary
1. Preliminary Topics
1.1 Gaussian Distribution
The Gaussian distribution is very widely used to fit random data. The probability density for a onedimensional random variable $x$ follows:
where

$\mu $ is the mean or expectation of the distribution,

$\sigma$ is the standard deviation, and $ \sigma ^{2}$ is the variance.
More generally, when the data set is a ddimensional data, it can be fit by a multivariate Gaussian model. The probability density is:
where
 $x$ is a dbyN vector, representing N sets of ddimensional random data,
 $\mu$ is a dby1 vector, representing the mean of each dimension,
 $\Sigma$ is a dbyd matrix, representing the covariance matrix
1.2 Jensen’s Inequality
Here statements of Jensen’s inequality in the context of probability theory. These would be used to simplify the target function in an EM process.
Theorem.For convex function $f$ and a random variable $x$:
Theorem. For concave function $f$ and a random variable $x$:
1.3 Matrix Derivatives
In order to solve the parameters in a Gaussian mixture model, we need some rules about derivatives of a matrix or a vector. Here are some useful equations cited from The Matrix Cookbook.
2.Gaussian Mixture Model (GMM) and ExpectationMaximization(EM) Algorithm
2.1 GMM
For a complex data set in the realworld, it normally consists of a mixture of multiple stochastic processes. Therefore a single Gaussian distribution cannot fit such data set. Instead, a Gaussian mixture model is used to describe a combination of $K$ Gaussian distribution.
Suppose we have a training set of $N$ independent data points $x = {x_1, x_2 …x_i… x_N}$, and the values show multiple peaks. We can model this data set by a Gaussian mixture model


When each data sample $x_i$ is ddimensional, and the data set $x$ seem scattering to multiple clusters, the data can be modeled by a multivariate version Gaussian mixture model.
In the models, $\Theta$ means all parameters, and $\alpha_k$ is the prior probability of th $k^{th}$ Gaussian model, and
2.2 EM
The expectationmaximization(EM) algorithm is an iterative supervised training algorithm. The task is formulated as:
 We have a training set of $N$ independent data points $x$.
 Either we know or have a good guess, that the data set is a mixture of $K$ Gaussian distributions.
 The task is to estimate the GMM parameters: K set of ($\alpha_{k}$, $\mu_k$, $\sigma_k$), or ($\alpha_{k}$, $\mu_k$, $\Sigma_{k}$).
Likelihood Function:
For estimation problems based on data set of independent samples, maximumlikelihood estimation (MLE) is a very widely used and straight forward method to perform estimation.
The probability of N independent tests is described as the product of probability of each test. This is called the likelihood function:
MLE is to estimate the parameters $\Theta$ by maximizing the likelihood function.
By applying the MLE , the likelihood function for uni and multiple variate Gaussian mixture models are very complicated.
To estimate K set of Gaussian parameters directly and explicitly is difficult. The EM algorithm simplifies the likelihood function of GMM, and provides an iterative way to optimize the estimation.Here we try to briefly describe the EM algorithm for GMM parameter estimation.
First, the likelihood function of a GMM model can be simplified by taking the log likelihood function. An formula with the form of summation is easier for separating independent data samples and taking derivatives of parameters.
Latent Parameters:
There is no efficient way to explicitly maximizing the log likelihood function for GMM above. The EM algorithm introduces a latent parameter $z$, that $z \in {1 ,2 … k … K}$. That is used to describe the probability of a given training sample $x_i$ belonging to cluster z, given full GMM parameters:
Introduce the latent parameter $z$ in the probability distribution of $x_i$.
Compared with $p(x\Theta) = \sum_{k}^{}\alpha_{k}\textit{N}(x\mu_k, \sigma_k)$,
we can conclude that $\alpha_k \text{ is the prior probability of } p(z=k)$.
and the conditional probability of $x$ given $z=k$ is the $k^{th}$ Gaussian model.
Now the latent parameter can be introduced into the log likelihood function. Be noted that an redundant term $p(zx_{i},\mu_k, \sigma_k) $ is added, in order to match the form of Jensen’s inequality.
Simplify the Likelihood function:
However the summation inside a log function make it difficult to maximize. Here recall Jensen’s inequality:
Let $u$ represent $\frac{p(x_{i}  z=k, \mu_k, \sigma_k)p(z=k)}{p(z  x_{i},\mu_k, \sigma_k)}$ to match Jensen’s inequality.
We get
Therefore,
The posterior probability can be derived by the Bayes’ law.
Define
Then
This equation defines a lower bound for the log likelihood function. Therefore, an iterative target function for the EM algorithm is defined:
After $t$ iterations, we’ve got $\Theta^t$，and hence the latent $\omega_{i,k}^t$。 Apply the latest latent parameters in $Q(\Theta,\Theta^{t})$，and then we can update $\Theta^{t+1}$ by maiximization.
Iterative Optimization:
First the parameters $\Theta$ are initialized, and then $\omega$ and $\Theta$ are updated iteratively.

After iteration t, a set of parameters $\Theta^t$ have been achieved.

Calculate latent parameters $\omega_{i,k}^t$ by applying $\Theta^t$ into the GMM. This step is called expectation step.

Apply the latest latent parameters $\omega_{i,k}^t$ in the target function. The target function is derived by simplifying the log likelihood funciton by Jensen’s inequality.

With $\omega_{i,k}$, maximize the target log likelihood function, to update GMM parameters $\Theta^{t+1}$. This step is called maximization step.
3.EM Algorithm for Univariate GMM
The complete form of the EM target function for a univariate GMM is
3.1 EStep:
The Estep is to estimate the latent parameters for each training sample on K Gaussian models. Hence the latent parameter $\omega$ is a NbyK matrix.
On every iteration, $\omega_{i,k}$ is calculated from the latest Gaussian parameters $(\alpha_k, \mu_k, \sigma_k)$
3.2 MStep:
The target likelihood function can be expanded to decouple items clearly.
Update $\alpha_k:$
As defined in GMM, $\alpha_k$ is constrained by $\sum_{k}\alpha_k =1$, so estimating $\alpha_k$ is a constrained optimization problem.
The method of Lagrange multipliers is used to find the local maxima of such constrained optimization problem. We can construct a Lagrangean function:
The local maxima $\alpha_{k}^{t+1}$ should make the derivative of the Lagrangean function equal to 0. Hence,
By summing the equation for all $k$, the value of $\lambda$ can be calculated.
Therefore, $\alpha_k$ on iteration $t+1$ based on latent parameters on iteration $t$ is updated by
Update $\mu_k:$
$\mu_k$ is unconstrained, and can be derived by taking the derivative of the target likelihood function.
Let $\frac{\partial Q(\Theta,\Theta^{t})}{\partial \mu_k}=0$, hence
Hence $\mu_k$ on iteration $t+1$ can be updated as a form of weighted mean of $x$.
Update $\sigma_k:$
Similarly, updated $\sigma_k$ is derived by taking the derivative of the target likelihood function with respect to $\sigma_k$.
Let
.
We get
For $\sigma_k$, we can update $\sigma_k^2$, which is enough for Gaussian model calculation. New $sigma_k^2$ depends on $\mu_k$, so normally $\mu_k^{t+1}$ is calculated first and then applied to the update equation for $\sigma_k^2$
4.EM Algorithm for Multivariate GMM
Similarlyt the target likelihood function for a multivariat GMM is
Be aware that
 $x_i$ is a dby1 vecotr,
 $\alpha_k$ is a real number between [0,1],
 $\mu_k$ is a dby1 vector,
 $\Sigma_k$ is a dbyd matrix.
 $\omega$ is a NbyK matrix.
4.1 EStep:
The Estep to estimate the latent parameters is the same as univariate GMM, except that the Gaussian distribution is a multivariate one, which is more complicated.
The target likelihood function can be expanded.
4.2 MStep:
Update $\alpha_{k}:$
The formula to update $\alpha_k$ for multivariate GMMs is exactly the same as univariate GMMs.
Hence we get the same update equation.
Update $\mu_k:$
Take the derivative of $Q(\Theta,\Theta^{t})$ with respec to $\mu_k$, we get
As the covariance matrix $\Sigma_{k}$ is symmetric, the inverse of it is also symmetric. We can apply $\frac{\partial (x s)^TW(xs)}{\partial x} = 2W(xs)$ (see first section) to the partial derivative.
Hence $\mu_k$ on iteration $t+1$ is also updated as a form of weighted mean of $x$. However, in this scenario $\mu_k$ is a dby1 vector.
Update $\Sigma_k:$
Let $\frac{\partial Q(\Theta,\Theta^{t})}{\partial \Sigma_k^{1}} =0$, we get
By employing $\frac{\partial \ln \det(X)}{\partial X^{1}} =X^T $ and $\frac{\partial a^TXa}{\partial X} = aa^T$(see section one) for the symmetric covariance matrix $\Sigma_k$, and find the maxima of $ Q(\Theta,\Theta^{t})$.
Similarly, we get the update equation for $\Sigma_k$ at iteration $t+1$, and it depends on $\mu_k$. So again $\mu_k^{t+1}$ is calculated first and then applied to the update equation for $\Sigma_k$
5.Summary
Univariate GMM  Multivariate GMM  

Init  $$\alpha_{k}^0, \mu_k^0, \sigma_k^0$$  $$\alpha_{k}^0, \mu_k^0, \Sigma_k^0$$ 
EStep  $$\omega_{i,k}^t = \frac{\alpha_{k}^t\textit{N}(x_{i} \mu_k^t, \sigma_k^t)}{\sum_{k}\alpha_{k}^t\textit{N}(x_{i} \mu_k^t, \sigma_k^t)}$$  $$\omega_{i,k}^t = \frac{\alpha_{k}^t\textit{N}(x_{i} \mu_k^t, \Sigma_k^t)}{\sum_{k}\alpha_{k}^t\textit{N}(x_{i} \mu_k^t, \Sigma_k^t)}$$ 
MStep  $$ \begin{aligned} \alpha_k^{t+1} &= \frac{\sum_{i}\omega_{i,k}^t}{N}\\ \mu_k^{t+1} &= \frac{\sum_{i}\omega_{i,k}^t x_i}{\sum_{i}\omega_{i,k}^t}\\ (\sigma_k^2)^{t+1} &= \frac{\sum_{i}\omega_{i,k}^t(x_i\mu_k^{t+1})^2 }{\sum_{i}\omega_{i,k}^t} \end{aligned} $$  $$ \begin{aligned} \alpha_k^{t+1} &= \frac{\sum_{i}\omega_{i,k}^t}{N}\\ \mu_k^{t+1} &= \frac{\sum_{i}\omega_{i,k}^tx_i}{\sum_{i}\omega_{i,k}^t}\\ \Sigma_k^{t+1} &= \frac{\sum_{i}\omega_{i,k}^t (x_i\mu_k^{t+1})(x_i\mu_k^{t+1})^T }{\sum_{i}\omega_{i,k}^t} \end{aligned} $$ 