Clustering: Gaussian Mixture Models
Gaussian mixture models assume that the data is generated from a specified number of Gaussian distributions, known as mixtures. The form of data generating distribution for each function is known (they are Gaussian distributions), but their parameters and prior probabilities are unknown. Also unknown are which mixture each individual data vector was generated by.
The aim is to be able to obtain:

Posteriori probabilities regarding which mixture generated each datum (mixture membership).

The prior probabilities of a datum being generated by a particular mixture.

The parameters of the data generating Gaussian distribution associated with each mixture.
Given knowledge of the prior probabilities and parameters of the mixtures, we can immediately calculate the posterior probability regarding which mixture generated each datum. Calculating this is done by Bayes theorem. Where \(\mathcal{K}\) is a random variable over the mixtures, this is:
\[P(\mathcal{K}=kX=X_i)=\frac{P(X=X_i\mathcal{K}=k)P(\mathcal{K}=k)}{\sum_{j=1}^K P(X=X_i\mathcal{K}=j)P(\mathcal{K}=j)}\]Equally, given the posterior probabilities regarding mixture membership, we are able to estimate both the parameters of the Gaussian distributions associated with each mixture and the prior probabilities of the mixtures: We just need to weight data items using their posterior probabilities of belonging to different mixtures when we calculate the means, covariance and prior probability for the mixtures. (See the update equations in the maximization step of the EM algorithm below.)
Note that this is identical to a missing data problem: It is as if we had a column in our data specifying the mixture generating the known values for the datum, but all values in this column are missing. Such a variable is known as a latent variable. In this case working out the distributions over this latent variable forms part of the goal, but sometimes latent variables are introduced as intermediaries where working out the distribution over their values for each datum is not part of the end goal.
The result is a soft clustering, where each datum is given a probability of being generated by each mixture.
The Expectation Maximization Algorithm
The most common method to acquire the desired values is the ExpectationMaximization (EM) algorithm. It exploits the relationships explained above to in an iterative process:

Current estimates of the parameters of the Gaussian distributions and prior distribution over the mixtures are used to calculate the posterior probabilities for each datum being generated by each mixture. This is the expectation step.

Current estimates of the posterior probabilities for each datum being generated by each mixture are used to calculate the maximum likelihood estimates of the parameters of the Gaussian distributions and prior distribution over the mixtures. This is the maximization step.
This process is typically initiated by a random assignment of the posterior probabilities for each datum being generated by each mixture leading to the first step being a maximization step. It is continued until convergence.
To give the equations for the updates required in each step, we introduce some symbolism:
\(K\)  the number of clusters/mixtures. 
\(N\)  the number of data vectors. 
\(\phi_k^t\)  the prior probability of cluster \(k\) at step \(t\) in the EM process. This is sometimes called the mixture weights. I.e. \(P(\mathcal{K}=k)\) 
\(\mu_k^t\)  the mean parameter of cluster k at step \(t\) in the EM process. 
\(\Sigma_k^t\)  the covariance parameter of cluster k at step \(t\) in the EM process. 
\(\rho_{i,k}^{t}\)  the posterior probability that \(X_i\) belongs to cluster \(k\). I.e. \(P(\mathcal{K}=k\)\(X=X_i)\) 
The update equations for the maximization step are:
\[\phi_k^{t+1} =\frac{1}{N}\sum_{i=1}^N \rho_{i,k}^{t}\] \[\mu_k^{t+1} = \frac{\sum_{i=1}^N \rho_{i,k}^{t} X_i}{\sum_{i=1}^N \rho_{i,k}^{t}}\] \[\Sigma_k^{t+1} = \frac{\sum_{i=1}^N \rho_{i,k}^{t} [X_i  \mu_k^{t+1}][X_i  \mu_k^{t+1}]^T}{\sum_{i=1}^N \rho_{i,k}^{t}}\]The update equations for the expectation step are:
\[\rho_{i,k}^{t+1}=\frac{w_k^t p_k(X_i ; \mu_k^t , \Sigma_k^t)}{\sum_{j=0}^K w_j^t p_j(X_i ; \mu_j^t , \Sigma_j^t)}\]See the aside below about the ‘;’ symbol in probability notation if you are unsure what it means.
The EM algorithm can coverge to local optima, and the common response to this is to run the algorithm several times from different initial asignments of posterior probabilities for each datum being generated by each mixture. This can be time consuming, and alternative approaches exist, including annealing techniques.
Aside: If you are uncertain what the ‘;’ symbol means when used in probability notation, a quick explanation is provided in the Semicolon in probability notation document available in the downloads section at the end of this article.
© Dr Michael Ashcroft