Information-theoretic clustering aims to exploit information-theoretic measures as the clustering criteria. A common practice on this topic is the so-called Info-Kmeans, which performs K-means clustering with KL-divergence as the proximity function. While research efforts devoted to Info-Kmeans have shown promising results, a remaining challenge is to deal with high-dimensional sparse data such as text corpora. Indeed, it is possible that the centroids contain many zero-value features for high-dimensional text vectors, which lead to infinite KL-divergence values and create a dilemma in assigning objects to centroids during the iteration process of Info-Kmeans. To meet this challenge, we propose a Summation-based Incremental Learning (SAIL) algorithm for Info-Kmeans clustering in this chapter. Specifically, by using an equivalent objective function, SAIL replaces the computation of KL-divergence by the incremental computation of the Shannon entropy, which successfully avoids the zero-value dilemma. To improve the clustering quality, we further introduce the Variable Neighborhood Search (VNS) meta-heuristic and propose the V-SAIL algorithm, which is then accelerated by a multithreading scheme in PV-SAIL. Experimental results on various real-world text collections have shown that, with SAIL as a booster, the clustering performance of Info-Kmeans can be significantly improved. Also, V-SAIL and PV-SAIL indeed help to improve the clustering quality at a low cost of computation.
CITATION STYLE
Wu, J. (2012). Information-Theoretic K-means for Text Clustering (pp. 69–98). https://doi.org/10.1007/978-3-642-29807-3_4
Mendeley helps you to discover research relevant for your work.