A framework for benchmarking clustering algorithms

The evaluation of clustering algorithms can involve running them on a variety of benchmark problems, and comparing their outputs to the reference, ground-truth groupings provided by experts. Unfortunately, many research papers and graduate theses consider only a small number of datasets. Also, the fact that there can be many equally valid ways to cluster a given problem set is rarely taken into account. In order to overcome these limitations, we have developed a framework whose aim is to introduce a consistent methodology for testing clustering algorithms. Furthermore, we have aggregated, polished, and standardised many clustering benchmark dataset collections referred to across the machine learning and data mining literature, and included new datasets of different dimensionalities, sizes, and cluster types. An interactive datasets explorer, the documentation of the Python API, a description of the ways to interact with the framework from other programming languages such as R or MATLAB, and other details are all provided at.


Introduction
Cluster analysis [1][2][3] is a data mining task where we discover semantically useful dataset partitions in a purely unsupervised manner.We know that there is no single "best" all-purpose algorithm [4], but some methods are better than others for certain problem types.However, a lot is still yet to be done [3,5,6] with regard to separating promising approaches from the systematically disappointing ones.
One approach to clustering validation relies on using the so-called internal measures, which are supposed to summarise the quality of partitions into a single number [7][8][9].In practice, they can only focus on a single property of a given split (e.g., set separability or compactness) and the partitions they promote might be far from sound [10].
Another approach is to use the external validity measures [11][12][13][14] that quantify the similarity between the generated clusterings and the reference (ground-truth) partitions provided by experts.
Unfortunately, it is not rare for research papers and graduate theses to consider only a small number of benchmark datasets.We regularly come across the same 5-10 test problems from the UCI [15] database.This is obviously too few to make any evaluation rigorous enough and thus may lead to overfitting [16,17].Some authors propose their own datasets, but do not test their methods against other benchmark batteries.This might give rise to biased conclusions, as there is a risk that only the problems "easy" for a method of interest were included.On the other hand, the researchers Email address: m.gagolewski@deakin.edu.au(M.Gagolewski) URL: https://www.gagolewski.com(M.Gagolewski) ORCID(s): 0000-0003-0637-6028 (M.Gagolewski) who generously share their data (e.g., [15,[18][19][20][21]), unfortunately, might not make the interaction with their batteries particularly smooth, as each of them uses different file formats.Furthermore, the existing repositories do not reflect the idea there might be many equally valid/plausible/useful partitions of the same dataset; see [2,22] for discussion.
On the other hand, some well-agreed-upon benchmark problems for a long time have existed in other machine learning domains (classification and regression datasets from the aforementioned UCI [15]; but also test functions for testing global optimisation solvers, e.g., [23,24]).
In order to overcome these gaps, the current project proposes a consistent framework for benchmarking clustering algorithms.Its description is given in the next section.Then, in Section 3, we describe a Python API (the clusteringbenchmarks package available at PyPI; see https://pypi.org/project/clustering-benchmarks/)that makes the interaction therewith relatively easy.Section 4 concludes the paper and proposes a few ideas for the future evolution of this framework.
Note that the datasets and the described software are independent of each other.Thanks to this, new datasets can easily be added in the future.Also, the users are free to use their own collections or access data from within other programming environments.The current framework defines the suggested unified file format which is detailed on the project's homepage.
Reference partitions.When referring to a particular benchmark problem, we use the convention "battery/dataset", e.g, "wut/x2".Let  be one of such datasets that consists of  points in ℝ  .Each dataset is equipped with a reference partition assigned by experts.Such a grouping of the points into  ≥ 2 clusters is encoded using a label vector , where   ∈ {1, … , } gives the cluster ID of the -th object.For instance, the left subfigure of Figure 1 depicts the ground-truth 3-clustering of wut/x2 (which is based on the information about how this dataset has been generated from a mixture of three Gaussian distributions).3) [33,34], k-means, and ITM [35] ( = 3 and  = 4).Confusion matrices and normalised clustering accuracies (NCA; Eq. (1); comparisons against the reference partitions depicted in Figure 1) are also reported.Note that the second ground-truth partition features some noise points: hence, in the  = 4 case, the first row of the confusion matrix is not taken into account.
Running the algorithm in question.Let us consider a clustering algorithm whose quality we would like to assess.
When we apply it on  to discover a new -partition (in an unsupervised manner, i.e., without revealing the true ), we obtain a vector of predicted labels encoding a new grouping, ŷ.For example, the first row of scatterplots in Figure 2 depicts the 3-partitions of wut/x2 discovered by three different methods.
Assessing partition similarity.Ideally, we would like to work with algorithms that yield partitions closely matching the reference ones.This should be true on as wide a set of problems as possible.Hence, we need to relate the predicted labels to the reference ones.We can determine the confusion matrix , where  , denotes the number of points in the -th reference cluster that the algorithm assigned to the -th cluster.Even though such a matrix summarises all the information required to judge the similarity between the two partitions, if we wish to compare the quality of different algorithms, we would rather have it aggregated in the form of a single number.As one of the many external cluster validity indices (see, e.g., [12][13][14]), we can use the normalised clustering accuracy [11] given by: which is the averaged percentage of correctly classified points in each cluster above the perfectly uniform label distri-bution.As the actual cluster IDs do not matter (a partition is a set of clusters and sets are, by definition, unordered), the optimal matching between the cluster labels is performed automatically by finding the best permutation  of the set {1, … , }.
There can be many valid partitions.What is more, it is in the very spirit of unsupervised learning that, in many cases, there might be many equally valid ways to split a given dataset.An algorithm should be rewarded for finding a partition that closely matches any of the reference ones.This might require running the method multiple times (unless it is a hierarchical one) to find the clusterings of different cardinalities.Then, the generated outputs are evaluated against all the available reference labellings and the maximal similarity score is reported.
Noise points.Also, to make the clustering problem more difficult, some datasets might feature noise points (e.g., outliers or irrelevant points in between the actual clusters).They are specially marked in the ground-truth vectors: we assigned them cluster IDs of 0; compare the right subfigure of Figure 1, where they are coloured grey.A clustering algorithm must never be informed about the location of such "problematic" points.Once the partition of the dataset is determined, they are excluded from the computation of the external cluster validity measures.In other words, it does not matter to which clusters the noise points are allocated.

The Python API
To facilitate the employment of the aforementioned framework, we have implemented an open-source package for Python named clustering-benchmarks.It can be installed from PyPI (https://pypi.org/project/clustering-benchmarks/), e.g., via a call to pip3 install clustering-benchmarks.Then, it can be imported by calling: import clustbench # clustering-benchmarks import os.path, genieclust, sklearn.cluster# we will need these later Fetching benchmark data.The example datasets repository [25] (or any custom repository provided by the user) can be queried easily.Let us assume that we store it in the following directory: data_path = os.path.join("~","Projects", "clustering-data-v1") # example A particular dataset (here, for example: wut/x2) can be accessed by calling: battery, dataset = "wut", "x2" b = clustbench.load_dataset(battery,dataset, path=data_path) The above call returns a named tuple, whose data field gives the data matrix, labels gives the list of all ground-truth partitions (encoded as label vectors), and n_clusters gives the corresponding numbers of subsets.For instance, here is a way in which we have generated Figure 1.

Computing external cluster validity measures.
Here is a way to compute the external cluster validity measures: round(clustbench.get_score(b.labels,res["Genie_G0.3"]), 2) ## 0.87 By default, the aforementioned normalised clustering accuracy (Eq.( 1)) is applied, but this might be changed to any other score by setting the metric argument explicitly.As explained above, we compare the predicted clusterings against all the reference partitions (ignoring the noise points), and report the maximal score.
Applying clustering methods manually.We can use clustbench.fit_predict_many to generate all the partitions required to compare ourselves against the reference labels.Let us test the k-means algorithm as implemented in the scikit-learn package [36]: We see that k-means (which specialises in detecting symmetric Gaussian-like blobs) performs better than Genie on this particular dataset; see Figure 2 for an illustration (also featuring the results generated by the ITM method [35]).The project's homepage and documentation discuss many more functions.

Conclusion
The current project is designed to be extensible so that it can accommodate new datasets and/or label vectors in the future -so as to make the clustering algorithm evaluation much more rigorous.Any contributions are warmly welcome; see https://github.com/gagolews/clustering-benchmarks/issuesfor a feature request and bug tracker.In particular, we have implemented an interactive standalone application that can be used for preparing our own two-dimensional datasets (Colouriser).
Future versions of the benchmark suite will include methods for generating random samples of arbitrary sizes/cluster size distribution similar to a given dataset (e.g., with more noise points).Thanks to this, in the case of algorithms that feature many tunable parameters, it will be possible to implement some means to separate validation datasets (where we are allowed to learn the "best" settings; see, e.g., [17] and the references therein) from the testing ones (used in the final comparisons), which is a quite standard approach in other machine learning domains.
Moreover, the framework can be extended to cover overlapping clusterings as well as semi-supervised learning tasks, where an algorithm knows about the right assignment of some of the input points in advance.