It is very common among researchers or graduate students who are working on clustering papers to consider only a small number of datasets, which obviously is too few to make any evaluations rigorous enough. The other reason is that most of datasets are boring and there is really anything of significant in them. Other type of authors propose their own datasets, but forget to test their methods against other benchmarks, in result they risk their evaluations be biased.
This article is intended for all those who are in process to develop or improve the existing clustering solutions and looking for challenging datasets to evaluate the performance of their solution. It’s very hard to find the datasets with their ground truth. The datasets listed below are datasets of different dimensionalities, sizes, and cluster types. They contains ground truth to validate your clustering solution.
- Basic clustering benchmark datasets from Machine Learning Lab, University of Eastern Finland, School of Computing
- Clueminer project dataset: contains several artificially generated datasets and real world datasets
- Marek Gagolewski’s clustering benchmark suite for clustering algorithms — Version 1
- UC Irvine Machine Learning Repository
- IFCS Cluster Benchmark Data Repository
- 102 clustering datasets available on data world
- CLUTO – clustering high-dimensional datasets
- Stanford large network dataset collection
Selecting a dataset alone is not going to solve the problem. You need to understand the limitation of your method or model you have developed or using it. They don’t fit in every scenario. So be smart and choose the method or model wisely.