An efficient clustering approach for large document collections

Bo Han; Lishan Kang; Huazhu Song

Conference Proceedings

An efficient clustering approach for large document collections

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2005) 3584 LNAI 240-247

DOI: 10.1007/11527503_29

0Citations

4Readers

Get full text

Abstract

A vast amount of unstructured text data, such as scientific publications, commercial reports and webpages are required to be quickly categorized into different semantic groups for facilitating online information query. However, the state-of-the art clustering methods are suffered from the huge size of documents with high-dimensional text features. In this paper, we propose an efficient clustering algorithm for large document collections, which performs clustering in three stages: 1) by using permutation test, the informative topic words are identified so as to reduce feature dimension; 2) selecting a small number of most typical documents to perform initial clustering 3) refining clustering on all documents. The algorithm was tested by the 20 newsgroup data and experimental results showed that, comparing with the methods which cluster corpus based on all document samples and full features directly, this approach significantly reduced the time cost in an order while slightly improving the clustering quality. © Springer-Verlag Berlin Heidelberg 2005.

Cite

CITATION STYLE

APA

Han, B., Kang, L., & Song, H. (2005). An efficient clustering approach for large document collections. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3584 LNAI, pp. 240–247). Springer Verlag. https://doi.org/10.1007/11527503_29

An efficient clustering approach for large document collections

Abstract

Cite

Register to see more suggestions