Gaussian mixture models: k-means on steroids

Posted on Sat 22 December 2018 in machine learning • 5 min read

This post is part 3 of the "clustering" series:

  1. Generate datasets to understand some clustering algorithms behavior
  2. K-means is not all about sunshines and rainbows
  3. Gaussian mixture models: k-means on steroids
  4. How many red Christmas baubles on the tree?
  5. Chaining effect in clustering
  6. Animate intermediate results of your algorithm
  7. Accuracy: from classification to clustering evaluation

The k-means algorithm assumes the data is generated by a mixture of Gaussians, each having the same proportion and variance, and no covariance. These assumptions can be alleviated with a more generic algorithm: the CEM algorithm applied on a mixture of Gaussians.

To illustrate this, we will first apply a more generic clustering algorithm than k-means on nine synthetic datasets previously used in this series. The internal criterion and a variation of it will be discussed along with the algorithms to optimize them. The complexity of the algorithms will be exposed: it explains why using the k-means algorithm is often justified instead of a more generic algorithm.

Gaussian mixture models behavior on generated datasets

Let’s first apply a Gaussian mixture model on some datasets. For this purpose, Rmixmod library (I recommend the article on Rxmimod in Journal of Statistical Software for further reading) is used with its default parameters (only the number of clusters is specified).

library("Rmixmod")

processMixmod <- function(data) {
    d = data %>% select(x, y)
    x = mixmodCluster(d, 2)
    x["bestResult"]["partition"]
}

dataMixmod = data %>%
             group_by(dataset) %>%
             do(data.frame(., cluster = factor(processMixmod(.))))

scatterPlotWithLabels(dataMixmod)
CEM algorithm applied on the 9 datasets

One can view from the plots that the models fit better the datasets than k-means which only fits well the first dataset. In particular, the first five datasets are generated by a mixture of Gaussians, hence the Gaussian mixture models works well. The other datasets, except than the two classes of uniform data are not fitted correctly since they are generated differently (by the way, the last dataset exhibits no classes).

The internal criterion optimized by the algorithm used, namely EM, explains why it works for the mixture of Gaussians.

EM and CEM algorithms

The EM algorithm is an iterative algorithm which iterates over two steps: an "expectation" step and a "maximisation" step, hence his name. The CEM algorithm includes an other step of "classification" between the two steps of the EM algorithm. The two algorithms differ also slightly by their criterion.

EM algorithm

With the EM algorithm, we seek to maximize the log likelihood:

\begin{equation*} L_M(\theta) = log(\prod_{i}f(x_i, \theta))= \sum_i \log (\sum_{k=1}^K \pi_k \varphi_k (x_i, \alpha_k)) \end{equation*}

where:

  • \(K\) is the number of clusters;
  • \(\theta = (\pi, \alpha)\) with \(\pi\) is a vector of proportions and \(\alpha\) is a vector of parameters of a component;
  • \(\varphi_k(x_i, \alpha_k)\) is the density of an observation \(x_i\) from the \(k^{th}\) component.

Maximizing the likelihood means that we want to maximize the plausibility of the model parameter values, given the observed data. The EM algorithm applied to a mixture of Gaussians tries to find the parameters of the mixture (the proportions) and the Gaussians (the means and the covariance matrices) that fits best the data. The cluster assignations are then found a posteriori: the points generated by a Gaussian are to be classified in the same cluster.

CEM algorithm

The CEM algorithm considers the cluster assignations as parameters to be learned. The criterion optimized by CEM is:

\begin{equation*} L_C(\theta) = log(f(x, \theta, z))= \sum_{i, k} z_{ik} \log (\pi_k \varphi_k (x_i, \alpha_k)) \end{equation*}

with \(z_{ik} = 1\) if the point \(x_i\) is assigned to the cluster \(k\) and else \(0\).

The CEM algorithm is often prefered to EM when clustering since it aims to learn the cluster assignation directly, not afterwards. It often converges faster that EM in practice. It is also a generalization of k-means.

CEM algorithm as an extension of k-means

The CEM algorithm applied on a mixture of Gaussians with some assumptions on the Gaussians (no covariance and equal variance) and on the mixture (equal proportions) is actually the same as the k-means algorithm. With these assumptions, the criterions are the same (see the previous post on kmeans in this series on clustering), and it can be shown that the steps are also the same.

Let's see this in practice when fitting CEM with a mixture of Gaussians with diagonal variance matrices and equal proportions, volumes and shapes (see page 25 of the reference manual of Rmixmod):

library("Rmixmod")

processMixmod2 <- function(data) {
    d = data %>% select(x, y)
    x = mixmodCluster(d, 2,
                      models=mixmodGaussianModel(listModels=c("Gaussian_p_L_I")),
                      strategy=new("Strategy", algo="CEM"))
    x["bestResult"]["partition"]
}

dataMixmod = data %>%
             group_by(dataset) %>%
             do(data.frame(., cluster = factor(processMixmod2(.))))

scatterPlotWithLabels(dataMixmod)
CEM algorithm applied on the 9 datasets with k-means assumptions

We obtain the same results than with k-means.

It seems that since Gaussian mixture models are more general, there is no reason to use k-means. However, it should be noted that k-means has less parameters to estimate and is thus faster and requires less memory. Let's explore this.

Number of parameters to estimate

When fitting a mixture of Gaussians, we have to estimate the proportions of the mixture, and the parameters of each Gaussian:

  • Since the proportions sum to one, we only have to estimate \(K-1\) proportions, with \(K\) the number of Gaussians.
  • We also have to estimate a mean for each Gaussian, thus \(K d\) parameters with \(d\) the number of dimensions.
  • The covariance matrix is a \(d \times d\) square matrix which is symmetric (the entries below the main diagonal are the same as the entries upper the main diagonal). We have thus \(\frac{d (d-1)}{2}\) values to estimate for each Gaussian.

Overall, we have, for the general form of the mixture of Gaussians (without any conditions) : \(K-1 + Kd + \frac{Kd(d-1)}{2}\) parameters to estimate, with:

  • \(K\) the number of Gaussians (or classes)
  • \(d\) the number of variables

If we set conditions on the model, we can lower the number of parameters to estimate. For instance, if we assume that the covariance matrices are diagonal with equal variance (a scalar times the identity matrix) and that the proportions of the mixture are the same, we have: \(d + 1\) parameters to estimate. It corresponds to the assumptions of k-means.

To sum up

  • Gaussian mixture models with the CEM algorithm can be seen as a generalisation of k-means: whereas k-means fit data generated by a mixture of Gaussians with same proportions and diagonal covariance matrices, Gaussian mixture models allows to enforce other assumptions on the proportions of the mixture and on the covariance matrices of the Gaussians.
  • Setting conditions on the parameters of the Gaussians and the mixture allows to reduce the number of parameters to estimate and thus reduce the runtime of the algorithm.
  • Several algorithms intend to fit a mixture of Gaussians. For instance, the EM algorithm estimates the parameters of the mixture and the classes are found a posteriori. The CEM algorithm estimates instead the parameters and the clusters assignations simultaneously, allowing a boost on the execution time.