Clustering of natural text collections is generally difficult due to the high dimensionality, heterogeneity, and large size of text collections. These characteristics compound the problem of determining the appropriate similarity space for clustering algorithms. In this paper, we propose to use the spectral analysis of the similarity space of a text collection to predict clustering behavior before actual clustering is performed. Spectral analysis is a technique that has been adopted across different domains to analyze the key encoding information of a system. Spectral analysis for prediction is useful in first determining the quality of the similarity space and discovering any possible problems the selected feature set may present. Our experiments showed that such insights can be obtained by analyzing the spectrum of the similarity matrix of a text collection. We showed that spectrum analysis can be used to estimate the number of clusters in advance.
CITATION STYLE
Li, W., Ng, W. K., & Lim, E. P. (2004). Spectral analysis of text collection for similarity-based clustering. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3056, pp. 389–393). Springer Verlag. https://doi.org/10.1007/978-3-540-24775-3_47
Mendeley helps you to discover research relevant for your work.