Investigation of internal validity measures for K-means clustering

Jonathan Baarsch; M. Emre Celebi

Conference Proceedings

Investigation of internal validity measures for K-means clustering

Lecture Notes in Engineering and Computer Science (2012) 2195 471-476

ISSN: 20780958

16Citations

121Readers

Abstract

Clustering is a fundamental task in data mining and knowledge discovery. The most widely used technique for clustering is the k-means algorithm, which is dependent on the choice number of clusters, k. In unsupervised situations, the choice of an appropriate value for k is difficult. To overcome this challenge, validity measures attempt to determine how accurately the clusters reflect the data. However, numerous validity measures proliferate, and different measures often produce disparate results. This paper reports an experiment to evaluate commonly used cluster validity measures, including Dunn, Davies-Bouldin, Calinski-Harabasz, Silhouette, Point Bi-serial, PBM, and Sum-of-Squares. These measures were applied to k-means clusterings of 125 artificially generated data sets. The Sum-of-Squares method was found to be the most effective for predicting an appropriate value for k. Silhouette was found to be a good alternative, and Calinski-Harabasz and Davies-Bouldin both made only moderate showings compared to the other two. Dunn, Point Bi-serial, and PBM performed quite poorly. The results also suggest that validity measures could be used as explanatory tools in their own right.

Author supplied keywords

Cite

CITATION STYLE

APA

Baarsch, J., & Celebi, M. E. (2012). Investigation of internal validity measures for K-means clustering. In Lecture Notes in Engineering and Computer Science (Vol. 2195, pp. 471–476). Newswood Limited.

Investigation of internal validity measures for K-means clustering

Abstract

Author supplied keywords

Cite

Register to see more suggestions