K-means is not all about sunshines and rainbows
Posted on Sun 09 December 2018 in machine learning • 6 min read
This post is part 2 of the "clustering" series:
- Generate datasets to understand some clustering algorithms behavior
- K-means is not all about sunshines and rainbows
- Gaussian mixture models: k-means on steroids
- How many red Christmas baubles on the tree?
- Chaining effect in clustering
- Animate intermediate results of your algorithm
- Accuracy: from classification to clustering evaluation
K-means is the most known and used clustering algorithm. It has however several drawbacks and does not behave nicely on some datasets.
In fact, every clustering algorithm has its own strenghts and drawbacks. Each relies on some assumptions on the dataset and leverages these properties to cluster the data into groups. The No Free Lunch Theorem states that "any two algorithms are equivalent when their performance is averaged across all possible problems". It means that we cannot use one algorithm for every dataset. The role of the user is thus to choose the right algorithm for the dataset to handle. To do so, it is important to know how the algorithms behave on several datasets and understand how they will handle new datasets. The properties of the datasets and the assumptions on which the algorithm rely, along with its criterion, can inform us on this.
This blog post presents the k-means clustering algorithm by:
- applying it with the R language on the datasets generated in a previous post;
- looking at the assumptions it makes on the dataset;
- presenting its criterion and the family of algorithms it belongs to;
It is not an extensive overview on k-means: For instance, it does not present the algorithm nor its computational cost. We will also not look at some important problems we face when using it, such as falling into local optima and finding the appropriate number of clusters.
K-means behavior on generated datasets
Let's first see how it behaves on the some generated datasets. We can generate some interesting datasets with the following R code:
library(mvtnorm)
library(dplyr)
library(movMF)
generateGaussianData <- function(n, center, sigma, label) {
data = rmvnorm(n, mean = center, sigma = sigma)
data = data.frame(data)
names(data) = c("x", "y")
data = data %>% mutate(class=factor(label))
data
}
dataset1 <- {
# cluster 1
n = 500
center = c(5, 5)
sigma = matrix(c(1, 0, 0, 1), nrow = 2)
data1 = generateGaussianData(n, center, sigma, 1)
# cluster 2
n = 500
center = c(1, 1)
sigma = matrix(c(1, 0, 0, 1), nrow = 2)
data2 = generateGaussianData(n, center, sigma, 2)
# all data
data = bind_rows(data1, data2)
data$dataset = "1 - Mixture of Gaussians"
data
}
dataset2 <- {
# cluster 1
n = 2000
center = c(5, 5)
sigma = matrix(c(1, 0, 0, 1), nrow = 2)
data1 = generateGaussianData(n, center, sigma, 1)
# cluster 2
n = 50
center = c(1, 1)
sigma = matrix(c(1, 0, 0, 1), nrow = 2)
data2 = generateGaussianData(n, center, sigma, 2)
# all data
data = bind_rows(data1, data2)
data$dataset = "2 - Different sizes"
data
}
dataset3 <- {
# cluster 1
n = 100
center = c(5, 5)
sigma = matrix(c(3, 0, 0, 3), nrow = 2)
data1 = generateGaussianData(n, center, sigma, 1)
# cluster 2
n = 100
center = c(1, 1)
sigma = matrix(c(0.1, 0, 0, 0.1), nrow = 2)
data2 = generateGaussianData(n, center, sigma, 2)
# all data
data = bind_rows(data1, data2)
data$dataset = "3 - Different variances"
data
}
dataset4 <- {
# cluster 1
n = 500
center = c(4, 4)
sigma = matrix(c(1, -0.99, -0.99, 1), nrow = 2)
data1 = generateGaussianData(n, center, sigma, 1)
# cluster 2
n = 500
center = c(5, 5)
sigma = matrix(c(1, -0.99, -0.99, 1), nrow = 2)
data2 = generateGaussianData(n, center, sigma, 2)
# all data
data = bind_rows(data1, data2)
data$dataset = "4 - Non zero covariance"
data
}
generateMFData <- function(n, theta, cluster, scale=1) {
data = rmovMF(n, theta)
data = data.frame(data[,1], data[,2])
data = scale * data
names(data) = c("x", "y")
data = data %>% mutate(class=factor(cluster))
data
}
dataset6 <- {
# cluster 1
n = 100
data1 = generateMFData(n, 1 * c(-1, 1), 1, 5)
# cluster 2
n = 100
data2 = generateMFData(n, 1 * c(1, -1), 2, 1)
# all data
data = bind_rows(data1, data2)
data$dataset = "6 - Spherical classes"
data
}
generateSpiralData <- function(n) {
maxRadius = 7
xShift = 2.5
yShift = 2.5
angleStart = 2.5 * pi
noiseVariance = 0.1
# first spiral
firstSpiral <- function() {
d1 = data.frame(0:(n-1))
colnames(d1) <- c("i")
d1 %>% mutate(angle = angleStart + 2.5 * pi * (i / n),
radius = maxRadius * (n + n/5 - i)/ (n + n/5),
x = radius * sin(angle),
y = radius * cos(angle),
class=1)
}
d1 = firstSpiral()
# second spiral
d2 = d1 %>% mutate(x = -x, y = -y, class=2)
# combine, add noise, and shift
generateNoise <- function(n) {
sigma = matrix(c(noiseVariance, 0, 0, noiseVariance), nrow = 2)
noise = rmvnorm(n, mean = c(0, 0), sigma = sigma)
df = data.frame(noise)
colnames(df) <- c("xNoise", "yNoise")
df
}
d1 %>%
bind_rows(d2) %>%
bind_cols(generateNoise(2*n)) %>%
transmute(x = x + xShift + xNoise,
y = y + yShift + yNoise,
dataset = "7 - Spirals",
class = factor(class))
}
dataset7 = generateSpiralData(500)
dataset8 <- {
generateUniformData <- function(cluster, minX, maxX) {
x = runif(500, min = minX, max = maxX)
y = runif(500, min = -4, max = 9)
data.frame(x, y) %>% mutate(class=cluster)
}
generateUniformData(1, -4, 1) %>% bind_rows(generateUniformData(2, 3, 9)) %>%
mutate(dataset = "8 - Uniform data",
class = factor(class))
}
dataset9 <- {
generateuniformdata <- function(cluster) {
x = runif(500, min = -4, max = 9)
y = runif(500, min = -4, max = 9)
data.frame(x, y) %>% mutate(class=cluster)
}
generateuniformdata(1) %>% bind_rows(generateuniformdata(2)) %>%
mutate(dataset = "9 - unclustered data",
class = factor(class))
}
data = bind_rows(dataset1, dataset2, dataset3, dataset4, dataset5, dataset6, dataset7, dataset8, dataset9)
One can then apply the k-means algorithm on these datasets with following code:
dataKmeans = data %>%
group_by(dataset) %>%
do(data.frame(.,
cluster = factor(kmeans(cbind(.$x,.$y),
centers=2)$cluster)))
Note that we used the do function of dplyr to apply the algorithm on each dataset. The kmeans algorithm comes built-in with R (other implementations are available through libraries).
Plotting the nine datasets and color the points given the clusters found by the algorithm can be done with:
library(ggplot2)
scatterPlotWithLabels <- function(data) {
scatterPlot = ggplot(data, aes(x=x, y=y, shape=class, color=cluster)) +
geom_point() +
coord_fixed() +
scale_shape_manual(values=c(2, 3)) +
facet_wrap(~dataset, nrow=3)
scatterPlot
}
scatterPlotWithLabels(dataKmeans)
It draws the following plot:
It works pretty well for the first dataset, but unfortunately not for the others. This behavior is predictable when we know the criterion the algorithm is optimizing and how the data is generated.
K-means assumptions and criterion
For a given number of cluster K, and a set of N vectors \(x_i , i \in [1, N]\), K-means aims to minimize the following criterion:
with:
- \(z_{ik} \in {0, 1}\) indicates if the vector \(x_i\) belongs to the cluster \(k\);
- \(\mu_k \in \mathbb{R}^p\), the center of the cluster \(k\).
It minimizes the intra-cluster variance, i.e. the sum of the squared distances between the center of the cluster and the objects associated with it.
To understand why the algorithm works well on the first dataset and not on the others, it is helpfull to see that the criterion is a particular case of a broader one: the Gaussian mixture models try to maximize the following criterion:
where:
- \(\theta = (\pi, \alpha)\) with \(\pi\) is a vector of proportions and \(\alpha\) is a vector of parameters of a component;
- \(\phi_k(x_i, \alpha_k)\) is the probability of an observation \(x_i\) to be from the \(k^{th}\) component.
If we consider a mixture of Gaussians, the density function is defined by:
with:
- \(n\) the number of dimensions of the vector \(x\);
- \(\mu\) the mean vector;
- \(\Sigma\) the covariance matrix.
If we consider a mixture of Gaussians with an identity covariance matrix (same variance and no covariance) and equal proportions (all \(\pi_k\) are equals), we obtain, up to a constant, the criterion of k-means.
Thus it is predictable that the k-means algorithm works well on the first dataset since it fits perfectly the assumptions on the probability distribution generating the objects to cluster.
Take away
Here are a few things to remember from this blog post:
- There is no clustering algorithm to rule them all: the user should choose carefully the algorithm given the data and the criterion to optimize
- K-means assumes the data is generated by a mixture of Gaussians with the same variance, the same proportion (same number of objects per class) and no intra-class covariance.
The second point can be see two ways:
- If you have a prior on the data, you can choose the algorithm to cluster data according to the criterion it optimizes and the prior you have on the data.
- If you don’t have any prior on the data, you can still apply k-means but should not overinterpret the results. For instance, the size of the clusters is enforced by the algorithm, thus you cannot infer that the underlying classes have equal proportions.