Information-Theoretic K-means for Text Clustering

  • Wu J
N/ACitations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Information-theoretic clustering aims to exploit information-theoretic measures as the clustering criteria. A common practice on this topic is the so-called Info-Kmeans, which performs K-means clustering with KL-divergence as the proximity function. While research efforts devoted to Info-Kmeans have shown promising results, a remaining challenge is to deal with high-dimensional sparse data such as text corpora. Indeed, it is possible that the centroids contain many zero-value features for high-dimensional text vectors, which lead to infinite KL-divergence values and create a dilemma in assigning objects to centroids during the iteration process of Info-Kmeans. To meet this challenge, we propose a Summation-based Incremental Learning (SAIL) algorithm for Info-Kmeans clustering in this chapter. Specifically, by using an equivalent objective function, SAIL replaces the computation of KL-divergence by the incremental computation of the Shannon entropy, which successfully avoids the zero-value dilemma. To improve the clustering quality, we further introduce the Variable Neighborhood Search (VNS) meta-heuristic and propose the V-SAIL algorithm, which is then accelerated by a multithreading scheme in PV-SAIL. Experimental results on various real-world text collections have shown that, with SAIL as a booster, the clustering performance of Info-Kmeans can be significantly improved. Also, V-SAIL and PV-SAIL indeed help to improve the clustering quality at a low cost of computation.

Cite

CITATION STYLE

APA

Wu, J. (2012). Information-Theoretic K-means for Text Clustering (pp. 69–98). https://doi.org/10.1007/978-3-642-29807-3_4

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free