Generate datasets to understand some clustering algorithms behavior
Posted on Sun 11 November 2018 in machine learning • 7 min read
This post is part 1 of the "clustering" series:
- Generate datasets to understand some clustering algorithms behavior
- K-means is not all about sunshines and rainbows
- Gaussian mixture models: k-means on steroids
- How many red Christmas baubles on the tree?
- Chaining effect in clustering
- Animate intermediate results of your algorithm
- Accuracy: from classification to clustering evaluation
In order to understand how a clustering algorithm works, good sample datasets are useful to highlight its behavior under certain circumstances. This post shows how to generate 9 datasets:
- a mixture of two Gaussians with same size, variance and no covariance,
- Gaussians which differ only from their means and sizes,
- Gaussians which differ only from their means and variances,
- Gaussians with a different direction,
- a general mixture of Gaussians,
- spherical data,
- a mixture of uniform data,
- and unclustered data.
Mixture of two Gaussians
Our first dataset consists of a mixture of two Gaussians.
Let's first define a function to generate a Gaussian. This function takes four arguments:
- a number of samples to generate, n,
- a center, corresponding to the mean of the Gaussian,
- a covariance matrix, sigma,
- and a cluster label.
library(mvtnorm)
library(dplyr)
generateGaussianData <- function(n, center, sigma, label) {
data = rmvnorm(n, mean = center, sigma = sigma)
data = data.frame(data)
names(data) = c("x", "y")
data = data %>% mutate(class=factor(label))
data
}
We use the function rmvnorm from the package mvtnorm to generate random numbers following a multivariate normal distribution.
The package dplyr is used to manipulate dataframes, especially the %>% operator allows to pass the variable on the left of the operator as the first argument of the function on the right. It is convenient since we can read the processing pipeline from the left to the right, the transformation steps being the functions between the %>% operators. For instance, the mutate function creates a new column named class.
We can now generate our first dataset:
dataset1 <- {
# cluster 1
n = 500
center = c(5, 5)
sigma = matrix(c(1, 0, 0, 1), nrow = 2)
data1 = generateGaussianData(n, center, sigma, 1)
# cluster 2
n = 500
center = c(1, 1)
sigma = matrix(c(1, 0, 0, 1), nrow = 2)
data2 = generateGaussianData(n, center, sigma, 2)
# all data
data = bind_rows(data1, data2)
data$dataset = "1 - Mixture of Gaussians"
data
}
We defined two clusters of equal size (500 samples each), with an identity covariance matrix. The function bind_rows is defined by the dplyr package and stacks the two dataframes data1 and data2. The variable data is thus a dataframe with \(500 + 500 = 1000\) rows, each row being an observation.
We can see how the dataframe looks like with:
str(dataset1)
It outputs:
'data.frame': 1000 obs. of 4 variables:
$ x : num 3,79 5,09 5,50 4,50 4,94 ...
$ y : num 5,35 5,95 6,52 4,82 5,52 ...
$ class : chr "1" "1" "1" "1" ...
$ dataset: chr "1 - Mixture of Gaussians" "1 - Mixture of Gaussians" "1 - Mixture of Gaussians" "1 - Mixture of Gaussians" ...
We can also visualize this dataset:
library(ggplot2)
dataset1 %>% ggplot(aes(x=x, y=y, shape=class)) +
geom_point() +
coord_fixed() +
scale_shape_manual(values=c(2, 3))
We used the ggplot2 package to have a scatter plot with geam_point. The function coord_fixed() is here to have the same scale for the two axis, and we define the symbols to use for the points with scale_shape_manual(values=c(2, 3)). We take the columns x and y as coordinates and the shape is given by the class column.
Here is the plot we get:
This example is very specific since the variances of the two classes are equal, there is no covariance between x and y inside a class, and the two classes have the same size (i.e. number of samples per class). It makes some algorithms (k-means for instance) behave well, i.e. able to recover the true classes.
Mixture of two Gaussians with different sizes
For the second dataset, we generate two Gaussians as before, but this time, the two classes have distinct sizes:
dataset2 <- {
# cluster 1
n = 2000
center = c(5, 5)
sigma = matrix(c(1, 0, 0, 1), nrow = 2)
data1 = generateGaussianData(n, center, sigma, 1)
# cluster 2
n = 50
center = c(1, 1)
sigma = matrix(c(1, 0, 0, 1), nrow = 2)
data2 = generateGaussianData(n, center, sigma, 2)
# all data
data = bind_rows(data1, data2)
data$dataset = "2 - Different sizes"
data
}
We can plot it as before:
dataset2 %>% ggplot(aes(x=x, y=y, shape=class)) +
geom_point() +
coord_fixed() +
scale_shape_manual(values=c(2, 3))
It produces:
We will see in a following post that this simple relaxation of the assumption on the sizes of the classes can make some algorithms (k-means for instance) behave differently, so that they cannot recover the true classes.
Mixture of two Gaussians with different variances
The third dataset is also generated by two Gaussians but with different variances:
dataset3 <- {
# cluster 1
n = 100
center = c(5, 5)
sigma = matrix(c(3, 0, 0, 3), nrow = 2)
data1 = generateGaussianData(n, center, sigma, 1)
# cluster 2
n = 100
center = c(1, 1)
sigma = matrix(c(0.1, 0, 0, 0.1), nrow = 2)
data2 = generateGaussianData(n, center, sigma, 2)
# all data
data = bind_rows(data1, data2)
data$dataset = "3 - Different variances"
data
}
The corresponding plot is:
Again, a simple relaxation, this time on the variances, makes some algorithms unable to recover the true classes.
Mixture of Gaussians with non zero covariance
The fourth dataset consists of two Gaussians, but this time with high covariance:
dataset4 <- {
# cluster 1
n = 500
center = c(4, 4)
sigma = matrix(c(1, -0.99, -0.99, 1), nrow = 2)
data1 = generateGaussianData(n, center, sigma, 1)
# cluster 2
n = 500
center = c(5, 5)
sigma = matrix(c(1, -0.99, -0.99, 1), nrow = 2)
data2 = generateGaussianData(n, center, sigma, 2)
# all data
data = bind_rows(data1, data2)
data$dataset = "4 - Non zero covariance"
data
}
The corresponding plot is:
Mixture of Gaussians with different sizes, variances and covariance
The Gaussian distribution has two parameters: the mean and the covariance matrix. When generating a mixture of Gaussians, a third parameter step in. Let's see how to combine the previous examples: changing all these parameters at the same time:
dataset5 <- {
# cluster 1
n = 1000
center = c(1, 1)
sigma = matrix(c(3, -0.9, -0.9, 3), nrow = 2)
data1 = generateGaussianData(n, center, sigma, 1)
# cluster 2
n = 50
center = c(5, 5)
sigma = matrix(c(1, 0.9, 0.9, 1), nrow = 2)
data2 = generateGaussianData(n, center, sigma, 2)
# all data
data = bind_rows(data1, data2)
data$dataset = "5 - Disparate Gaussians"
data
}
The corresponding plot is:
We can now move on to another distribution.
Spherical data
Some real world datasets are inherently spherical, i.e. the points are lying on the surface of a sphere, so generating a spherical dataset is helpful to understand how an algorithm behave on this kind of data, in a controlled environment (we know our dataset better when we generate it). Our spherical dataset consists of a mixture of data following a Von Mises-Fisher distribution.
Let's first define a function to generate one class, i.e. one Von Mises-Fisher distribution:
library(movMF)
generateMFData <- function(n, theta, cluster, scale=1) {
data = rmovMF(n, theta)
data = data.frame(data[,1], data[,2])
data = scale * data
names(data) = c("x", "y")
data = data %>% mutate(class=factor(cluster))
data
}
The function rmovMF from the movMF package generates samples given the two parameters of the Von Mises-Fisher distribution. We can now generate our mixture:
dataset6 <- {
# cluster 1
n = 100
data1 = generateMFData(n, 1 * c(-1, 1), 1, 5)
# cluster 2
n = 100
data2 = generateMFData(n, 1 * c(1, -1), 2, 1)
# all data
data = bind_rows(data1, data2)
data$dataset = "6 - Spherical classes"
data
}
The corresponding plot is:
We can see that all the points are lying on the surface of the hypersphere.
Spirals
We are going to generate the following dataset:
For this, we will create data for one spiral, by creating columns for:
- the angle which spreads evenly between angleStart and angleStart + 2.5 * pi,
- the radius which decrements for each sample,
- the x and y coordinates, based on the radius and the angle.
The second spiral is the same as the first one, except we take the negative values of x and y.
We then combine the two dataframes and add some noise, and shift it so the coordinates stay in the same range as in the other datasets.
Here is the complete code to produce it:
generateSpiralData <- function(n) {
maxRadius = 7
xShift = 2.5
yShift = 2.5
angleStart = 2.5 * pi
noiseVariance = 0.1
# first spiral
firstSpiral <- function() {
d1 = data.frame(0:(n-1))
colnames(d1) <- c("i")
d1 %>% mutate(angle = angleStart + 2.5 * pi * (i / n),
radius = maxRadius * (n + n/5 - i)/ (n + n/5),
x = radius * sin(angle),
y = radius * cos(angle),
class=1)
}
d1 = firstSpiral()
# second spiral
d2 = d1 %>% mutate(x = -x, y = -y, class=2)
# combine, add noise, and shift
generateNoise <- function(n) {
sigma = matrix(c(noiseVariance, 0, 0, noiseVariance), nrow = 2)
noise = rmvnorm(n, mean = c(0, 0), sigma = sigma)
df = data.frame(noise)
colnames(df) <- c("xNoise", "yNoise")
df
}
d1 %>%
bind_rows(d2) %>%
bind_cols(generateNoise(2*n)) %>%
transmute(x = x + xShift + xNoise,
y = y + yShift + yNoise,
dataset = "7 - Spirals",
class = factor(class))
}
dataset7 = generateSpiralData(500)
We used the transmute function from dplyr to create new columns and remove the other ones (used only temporarily).
Uniform data
We can generate data from the uniform distribution with the function runif:
dataset8 <- {
generateUniformData <- function(cluster, minX, maxX) {
x = runif(500, min = minX, max = maxX)
y = runif(500, min = -4, max = 9)
data.frame(x, y) %>% mutate(class=cluster)
}
generateUniformData(1, -4, 1) %>% bind_rows(generateUniformData(2, 3, 9)) %>%
mutate(dataset = "8 - Uniform data",
class = factor(class))
}
The corresponding plot is:
Unclustered data
The last dataset is almost the same as the previous one but the ranges for the uniform distributions are the same, so it is impossible to distinguish between the two classes:
dataset9 <- {
generateuniformdata <- function(cluster) {
x = runif(500, min = -4, max = 9)
y = runif(500, min = -4, max = 9)
data.frame(x, y) %>% mutate(class=cluster)
}
generateuniformdata(1) %>% bind_rows(generateuniformdata(2)) %>%
mutate(dataset = "9 - unclustered data",
class = factor(class))
}
The corresponding plot is:
All datasets in one plot
To get a big picture of all the datasets we generated, we can combine them all, in a single dataframe, and plot it in a single plot:
data = bind_rows(dataset1, dataset2, dataset3, dataset4, dataset5, dataset6, dataset7, dataset8, dataset9)
scatterPlot <- function(data) {
scatterPlot = ggplot(data, aes(x=x, y=y, shape=class)) +
geom_point() +
coord_fixed() +
scale_shape_manual(values=c(2, 3)) +
facet_wrap(~dataset, nrow=3)
scatterPlot
}
scatterPlot(data)
The facet_wrap function of the ggplot2 package allows us to plot all the datasets in a single plot:
We will use these datasets in other posts to illustrate the behavior of some clustering algorithms.