Data clustering techniques are valuable tools for researchers working with large databases of multivariate data. In this tutorial, we present a simple yet powerful one: the k-means clustering technique, through three different algorithms: the Forgy/Lloyd, algorithm, the MacQueen algorithm and the Hartigan & Wong algorithm. We then present an implementation in Mathematica and various examples of the different options available to illustrate the application of the technique. Data clustering techniques are descriptive data analysis techniques that can be applied to multivariate data sets to uncover the structure present in the data. They are particularly useful when classical second order statistics (the sample mean and covariance) cannot be used. Namely, in exploratory data analysis, one of the assumptions that is made is that no prior knowledge about the dataset, and therefore the dataset's distribution, is available. In such a situation, data clustering can be a valuable tool. Data clustering is a form of unsupervised classification, as the clusters are formed by evaluating similarities and dissimilarities of intrinsic characteristics between different cases, and the grouping of cases is based on those emergent similarities and not on an external criterion. Also, these techniques can be useful for datasets of any dimensionality over three, as it is very difficult for humans to compare items of such complexity reliably without a support to aid the comparison. The technique presented in this tutorial, k-means clustering, belongs to partitioning-based techniques grouping, which are based on the iterative relocation of data points between clusters. It is used to divide either the cases or the variables of a dataset into non-overlapping groups, or clusters, based on the characteristics uncovered. Whether the algorithm is applied to the cases or the variables of the dataset depends on which dimensions of this dataset we want to reduce the dimensionality of. The goal is to produce groups of cases/variables with a high degree of similarity within each group and a low degree of similarity between groups (Hastie, Tibshirani & Friedman, 2001). The k-means clustering technique can also be described as a centroid model as one vector representing the mean is used to describe each cluster. MacQueen (1967), the creator of one of the k-means algorithms presented in this paper, considered the main use of k-means clustering to be more of a way for researchers to gain qualitative and quantitative insight into large multivariate data sets than a way to find a unique and definitive grouping for the data. K-means clustering is very useful in exploratory data analysis and data mining in any field of research, and as the growth in computer power has been followed by a growth in the occurrence of large data sets. Its ease of implementation, computational efficiency and low memory consumption has kept the k-means clustering very popular, even compared to other clustering techniques. Such other clustering techniques include connectivity models like hierarchical clustering methods (Hastie, Tibshirani & Friedman, 2000). These have the advantage of allowing for an unknown number of clusters to be searched for in the data, but are very costly computationally due to the fact that they are based on the dissimilarity matrix. Also included in cluster analysis methods are distribution models like expectation-maximisation algorithms and density models (Ankerst, Breunig, Kriegel & Sander, 1999). A secondary goal of k-means clustering is the reduction of the complexity of the data. A good example would be letter grades (Faber, 1994). The numerical grades are clustered into the letters and represented by the average
CITATION STYLE
Morissette, L., & Chartier, S. (2013). The k-means clustering technique: General considerations and implementation in Mathematica. Tutorials in Quantitative Methods for Psychology, 9(1), 15–24. https://doi.org/10.20982/tqmp.09.1.p015
Mendeley helps you to discover research relevant for your work.