en fr

K-means is not all about sunshines and rainbows

Posted on Sun 09 December 2018 in machine learning • 6 min read

This post is part 2 of the "clustering" series:

  1. Generate datasets to understand some clustering algorithms behavior
  2. K-means is not all about sunshines and rainbows
  3. Gaussian mixture models: k-means on steroids
  4. How many red Christmas baubles on the tree?
  5. Chaining effect in clustering
  6. Animate intermediate results of your algorithm
  7. Accuracy: from classification to clustering evaluation

K-means is the most known and used clustering algorithm. It has however several drawbacks and does not behave nicely on some datasets.

In fact, every clustering algorithm has its own strenghts and drawbacks. Each relies on some assumptions on the dataset and leverages these properties to cluster the data into groups. The No Free Lunch Theorem states that "any two algorithms are equivalent when their performance is averaged across all possible problems". It means that we cannot use one algorithm for every dataset. The role of the user is thus to choose the right algorithm for the dataset to handle. To do so, it is important to know how the algorithms behave on several datasets and understand how they will handle new datasets. The properties of the datasets and the assumptions on which the algorithm rely, along with its criterion, can inform us on this.

This blog post presents the k-means clustering algorithm by:

  • applying it with the R language on the datasets generated in a previous post;
  • looking at the assumptions it makes on the dataset;
  • presenting its criterion and the family of algorithms it belongs to;

It is not an extensive overview on k-means: For instance, it does not present the algorithm nor its computational cost. We will also not look at some important problems we face when using it, such as falling into local optima and finding the appropriate number of clusters.

K-means behavior on generated datasets

Let's first see how it behaves on the some generated datasets. We can generate some interesting datasets with the following R code:

library(mvtnorm)
library(dplyr)
library(movMF)

generateGaussianData <- function(n, center, sigma, label) {
    data = rmvnorm(n, mean = center, sigma = sigma)
    data = data.frame(data)
    names(data) = c("x", "y")
    data = data %>% mutate(class=factor(label))
    data
}

dataset1 <- {
    # cluster 1
    n = 500
    center = c(5, 5)
    sigma = matrix(c(1, 0, 0, 1), nrow = 2)
    data1 = generateGaussianData(n, center, sigma, 1)

    # cluster 2
    n = 500
    center = c(1, 1)
    sigma = matrix(c(1, 0, 0, 1), nrow = 2)
    data2 = generateGaussianData(n, center, sigma, 2)

    # all data
    data = bind_rows(data1, data2)

    data$dataset = "1 - Mixture of Gaussians"

    data
}

dataset2 <- {
    # cluster 1
    n = 2000
    center = c(5, 5)
    sigma = matrix(c(1, 0, 0, 1), nrow = 2)
    data1 = generateGaussianData(n, center, sigma, 1)

    # cluster 2
    n = 50
    center = c(1, 1)
    sigma = matrix(c(1, 0, 0, 1), nrow = 2)
    data2 = generateGaussianData(n, center, sigma, 2)

    # all data
    data = bind_rows(data1, data2)

    data$dataset = "2 - Different sizes"

    data
}

dataset3 <- {
    # cluster 1
    n = 100
    center = c(5, 5)
    sigma = matrix(c(3, 0, 0, 3), nrow = 2)
    data1 = generateGaussianData(n, center, sigma, 1)

    # cluster 2
    n = 100
    center = c(1, 1)
    sigma = matrix(c(0.1, 0, 0, 0.1), nrow = 2)
    data2 = generateGaussianData(n, center, sigma, 2)

    # all data
    data = bind_rows(data1, data2)

    data$dataset = "3 - Different variances"

    data
}

dataset4 <- {
    # cluster 1
    n = 500
    center = c(4, 4)
    sigma = matrix(c(1, -0.99, -0.99, 1), nrow = 2)
    data1 = generateGaussianData(n, center, sigma, 1)

    # cluster 2
    n = 500
    center = c(5, 5)
    sigma = matrix(c(1, -0.99, -0.99, 1), nrow = 2)
    data2 = generateGaussianData(n, center, sigma, 2)

    # all data
    data = bind_rows(data1, data2)

    data$dataset = "4 - Non zero covariance"

    data
}

generateMFData <- function(n, theta, cluster, scale=1) {
    data = rmovMF(n, theta)
    data = data.frame(data[,1], data[,2])
    data = scale * data
    names(data) = c("x", "y")
    data = data %>% mutate(class=factor(cluster))
    data
}

dataset6 <- {
    # cluster 1
    n = 100
    data1 = generateMFData(n, 1 * c(-1, 1), 1, 5)

    # cluster 2
    n = 100
    data2 = generateMFData(n, 1 * c(1, -1), 2, 1)

    # all data
    data = bind_rows(data1, data2)

    data$dataset = "6 - Spherical classes"

    data
}

generateSpiralData <- function(n) {
    maxRadius = 7
    xShift = 2.5
    yShift = 2.5
    angleStart = 2.5 * pi
    noiseVariance = 0.1

    # first spiral
    firstSpiral <- function() {
        d1 = data.frame(0:(n-1))
        colnames(d1) <- c("i")
        d1 %>% mutate(angle = angleStart + 2.5 * pi * (i / n),
                      radius = maxRadius * (n + n/5 - i)/ (n + n/5),
                      x = radius * sin(angle),
                      y = radius * cos(angle),
                      class=1)
    }
    d1 = firstSpiral()

    # second spiral
    d2 = d1 %>% mutate(x = -x, y = -y, class=2)

    # combine, add noise, and shift
    generateNoise <- function(n) {
        sigma = matrix(c(noiseVariance, 0, 0, noiseVariance), nrow = 2)
        noise = rmvnorm(n, mean = c(0, 0), sigma = sigma)
        df = data.frame(noise)
        colnames(df) <- c("xNoise", "yNoise")
        df
    }
    d1 %>%
        bind_rows(d2) %>%
        bind_cols(generateNoise(2*n)) %>%
        transmute(x = x + xShift + xNoise,
                  y = y + yShift + yNoise,
                  dataset = "7 - Spirals",
                  class = factor(class))
}

dataset7 = generateSpiralData(500)

dataset8 <- {
    generateUniformData <- function(cluster, minX, maxX) {
        x = runif(500, min = minX, max = maxX)
        y = runif(500, min = -4, max = 9)
        data.frame(x, y) %>% mutate(class=cluster)
    }

    generateUniformData(1, -4, 1) %>% bind_rows(generateUniformData(2, 3, 9)) %>%
        mutate(dataset = "8 - Uniform data",
               class = factor(class))
}

dataset9 <- {
    generateuniformdata <- function(cluster) {
        x = runif(500, min = -4, max = 9)
        y = runif(500, min = -4, max = 9)
        data.frame(x, y) %>% mutate(class=cluster)
    }

    generateuniformdata(1) %>% bind_rows(generateuniformdata(2)) %>%
        mutate(dataset = "9 - unclustered data",
               class = factor(class))
}

data = bind_rows(dataset1, dataset2, dataset3, dataset4, dataset5, dataset6, dataset7, dataset8, dataset9)

One can then apply the k-means algorithm on these datasets with following code:

dataKmeans = data %>%
             group_by(dataset) %>%
             do(data.frame(.,
                           cluster = factor(kmeans(cbind(.$x,.$y),
                           centers=2)$cluster)))

Note that we used the do function of dplyr to apply the algorithm on each dataset. The kmeans algorithm comes built-in with R (other implementations are available through libraries).

Plotting the nine datasets and color the points given the clusters found by the algorithm can be done with:

library(ggplot2)

scatterPlotWithLabels <- function(data) {
    scatterPlot = ggplot(data, aes(x=x, y=y, shape=class, color=cluster)) +
        geom_point() +
        coord_fixed() +
        scale_shape_manual(values=c(2, 3)) +
        facet_wrap(~dataset, nrow=3)
    scatterPlot
}

scatterPlotWithLabels(dataKmeans)

It draws the following plot:

k-means applied on the 9 datasets

It works pretty well for the first dataset, but unfortunately not for the others. This behavior is predictable when we know the criterion the algorithm is optimizing and how the data is generated.

K-means assumptions and criterion

For a given number of cluster K, and a set of N vectors \(x_i , i \in [1, N]\), K-means aims to minimize the following criterion:

\begin{equation*} W(z, \mu) = \sum_{k=1}^{K} \sum_{i=1}^{N} z_{ik} ||x_i - \mu_k||^2 \end{equation*}

with:

  • \(z_{ik} \in {0, 1}\) indicates if the vector \(x_i\) belongs to the cluster \(k\);
  • \(\mu_k \in \mathbb{R}^p\), the center of the cluster \(k\).

It minimizes the intra-cluster variance, i.e. the sum of the squared distances between the center of the cluster and the objects associated with it.

To understand why the algorithm works well on the first dataset and not on the others, it is helpfull to see that the criterion is a particular case of a broader one: the Gaussian mixture models try to maximize the following criterion:

\begin{equation*} L_C(\theta) = \sum_{i, k} z_{ik} \log (\pi_k \phi_k(x_i, \alpha_k)) \end{equation*}

where:

  • \(\theta = (\pi, \alpha)\) with \(\pi\) is a vector of proportions and \(\alpha\) is a vector of parameters of a component;
  • \(\phi_k(x_i, \alpha_k)\) is the probability of an observation \(x_i\) to be from the \(k^{th}\) component.

If we consider a mixture of Gaussians, the density function is defined by:

\begin{equation*} \phi_k(x; \mu, \Sigma) = \frac{1}{(2 \pi)^{n/2} |\Sigma|^{1/2}} \exp (- \frac{1}{2} (x - \mu)^T \Sigma^{-1} (x-\mu)) \end{equation*}

with:

  • \(n\) the number of dimensions of the vector \(x\);
  • \(\mu\) the mean vector;
  • \(\Sigma\) the covariance matrix.

If we consider a mixture of Gaussians with an identity covariance matrix (same variance and no covariance) and equal proportions (all \(\pi_k\) are equals), we obtain, up to a constant, the criterion of k-means.

Thus it is predictable that the k-means algorithm works well on the first dataset since it fits perfectly the assumptions on the probability distribution generating the objects to cluster.

Take away

Here are a few things to remember from this blog post:

  • There is no clustering algorithm to rule them all: the user should choose carefully the algorithm given the data and the criterion to optimize
  • K-means assumes the data is generated by a mixture of Gaussians with the same variance, the same proportion (same number of objects per class) and no intra-class covariance.

The second point can be see two ways:

  • If you have a prior on the data, you can choose the algorithm to cluster data according to the criterion it optimizes and the prior you have on the data.
  • If you don’t have any prior on the data, you can still apply k-means but should not overinterpret the results. For instance, the size of the clusters is enforced by the algorithm, thus you cannot infer that the underlying classes have equal proportions.