Investigation of internal validity measures for K-means clustering

ISSN: 20780958
16Citations
Citations of this article
121Readers
Mendeley users who have this article in their library.

Abstract

Clustering is a fundamental task in data mining and knowledge discovery. The most widely used technique for clustering is the k-means algorithm, which is dependent on the choice number of clusters, k. In unsupervised situations, the choice of an appropriate value for k is difficult. To overcome this challenge, validity measures attempt to determine how accurately the clusters reflect the data. However, numerous validity measures proliferate, and different measures often produce disparate results. This paper reports an experiment to evaluate commonly used cluster validity measures, including Dunn, Davies-Bouldin, Calinski-Harabasz, Silhouette, Point Bi-serial, PBM, and Sum-of-Squares. These measures were applied to k-means clusterings of 125 artificially generated data sets. The Sum-of-Squares method was found to be the most effective for predicting an appropriate value for k. Silhouette was found to be a good alternative, and Calinski-Harabasz and Davies-Bouldin both made only moderate showings compared to the other two. Dunn, Point Bi-serial, and PBM performed quite poorly. The results also suggest that validity measures could be used as explanatory tools in their own right.

Cite

CITATION STYLE

APA

Baarsch, J., & Celebi, M. E. (2012). Investigation of internal validity measures for K-means clustering. In Lecture Notes in Engineering and Computer Science (Vol. 2195, pp. 471–476). Newswood Limited.

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free