Accuracy: from classification to clustering evaluation

Posted on Tue 04 June 2019 in machine learning • 4 min read

This post is part 7 of the "clustering" series:

  1. Generate datasets to understand some clustering algorithms behavior
  2. K-means is not all about sunshines and rainbows
  3. Gaussian mixture models: k-means on steroids
  4. How many red Christmas baubles on the tree?
  5. Chaining effect in clustering
  6. Animate intermediate results of your algorithm
  7. Accuracy: from classification to clustering evaluation

Accuracy is often used to measure the quality of a classification. It is also used for clustering. However, the scikit-learn accuracy_score function only provides a lower bound of accuracy for clustering. This blog post explains how accuracy should be computed for clustering.

Let's first recap what accuracy is for a classification task.

Accuracy for classification

For \(n\) samples in a dataset, let \(y_i\) be the class label for the \(i\)-th sample and \(\hat{y}_i\) the predicted value. Accuracy between the class labels \(y\) and the predicted values \(\hat{y}\) is defined by:

\begin{equation*} accuracy(y, \hat{y}) = \frac{1}{n} \sum_{i=0}^{n-1} 1(\hat{y}_i = y_i) \end{equation*}

where \(x \rightarrow 1(x)\) is the indicator function: \(1(\hat{y}_i = y_i) = 1\) if \(\hat{y}_i = y_i\) and \(0\) else.

It is computed as the sum of the diagonal elements of the confusion matrix, divided by the number of samples to get a value between 0 and 1.

For clustering, there is however no association provided by the clustering algorithm between the class labels and the predicted cluster labels. We therefore need to change the formula a little.

Accuracy for clustering

For clustering, we have to find the best match between the class labels and the cluster labels, so accuracy is defined by:

\begin{equation*} accuracy(y, \hat{y}) = \max_{perm \in P} \frac{1}{n} \sum_{i=0}^{n-1} 1(perm(\hat{y}_i) = y_i) \end{equation*}

where \(P\) is the set of all permutations in \([1; K]\) where \(K\) is the number of clusters.

It is important to note that there are \(K !\) permutations in \(P\) which is quite high and the computation of accuracy is therefore prohibitive if we apply blindly this formula. The Hungarian algorithm can help us compute it in \(O(K^3)\) which is significantly faster.

Let's see in practice how to compute accuracy on clustering results in a polynomial time.

Hands-on

In this part, we are going to first get some textual data to process, then cluster the documents and compute the accaracy between the clustering results and the known classes.

Get some data

Let's get some documents from the 20 Newsroups data set. We retrieve only five classes from it:

from sklearn.datasets import fetch_20newsgroups

categories = [
    'rec.motorcycles',
    'rec.sport.baseball',
    'comp.graphics',
    'sci.space',
    'talk.politics.mideast'
]

ng5 = fetch_20newsgroups(categories=categories, shuffle=True)

labels = ng5.target
documents = ng5.data

Latent Semantic Indexing is a well known method to transform documents into low dimensional vectors (low compared to the size of the vocabulary, i.e. the number of unique words in the dataset):

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('svd', TruncatedSVD()),
    ('normalizer', Normalizer())
])
pipeline.set_params(svd__n_components=300)
A = pipeline.fit_transform(documents)

Each line of the matrix A is a document represented by a vector. We can therefore apply a clustering algorithm on this matrix.

Clustering

Spherical k-means is a good algorithm to cluster textual data. One implementation is given by the Coclust Python library:

from coclust.clustering import SphericalKmeans

skm = SphericalKmeans(n_clusters=5)
skm.fit(A)
predicted_labels = skm.labels_

We are now ready to compute the accuracy between labels and predicted_labels. As described before, we can do this by first computing the confusion matrix.

Confusion matrix

Scikit-learn library provides a function called confusion_matrix to create a Numpy array containing the values of the confusion matrix:

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(labels, predicted_labels)

Let's visualize it with Seaborn visualization library:

import seaborn as sns; sns.set()

ax = sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
Confusion matrix

To compute the accuracy, one need to sum the values on the diagonal, but for clustering the rows (or columns) of the confusion matrix should be permuted to obtain the maximum value for the sum. This is not the case here. Let's see how to find the best permutation.

Permutation maximizing the sum of the diagonal elements

Scikit-learn provides a function linear_assignment to apply the Hungarian algorithm. One can use it to create our reordered confusion matrix cm2:

from sklearn.utils.linear_assignment_ import linear_assignment
import numpy as np

def _make_cost_m(cm):
    s = np.max(cm)
    return (- cm + s)

indexes = linear_assignment(_make_cost_m(cm))
js = [e[1] for e in sorted(indexes, key=lambda x: x[0])]
cm2 = cm[:, js]

Since linear_assigment minimizes a cost which should be positive, we defined the function _make_cost_m to transform the confusion matrix into a cost matrix. The reordered confusion matrix cm2 can be visualized:

Reordered confusion matrix

We are almost done. Let's finally compute the accuracy.

Compute accuracy

The accuracy is the sum of the diagonal elements divided by the number of samples:

np.trace(cm2) / np.sum(cm2)

Instead of implementing all this stuff ourselves, we could just use accuracy function provided by Coclust:

from coclust.evaluation.external import accuracy

accuracy(labels, predicted_labels)

It should be noted that scikit-learn accuracy_score function does not give the accuracy score for the clustering since it does not permute the rows of the confusion matrix:

from sklearn.metrics.classification import accuracy_score

accuracy_score(labels, predicted_labels)

gives the same value as:

np.trace(cm) / np.sum(cm)

Take away

Here is a summary of the key elements of this post:

  • accuracy_score provided by scikit-learn is meant to deal with classification results, not clustering.
  • Computing accuracy for clustering can be done by reordering the rows (or columns) of the confusion matrix so that the sum of the diagonal values is maximal.
  • The linear assignment problem can be solved in \(O(n^3)\) instead of \(O(n!)\).
  • Coclust library provides an implementation of the accuracy for clustering results.