Efficient Clustering of Very Large Document Collections

Inderjit S. Dhillon; James Fan; Yuqiang Guan

Book Chapter

Efficient Clustering of Very Large Document Collections

Dhillon I
Fan J
Guan Y

DOI: 10.1007/978-1-4615-1733-7_20

N/ACitations

65Readers

Get full text

Abstract

An invaluable portion of scienti¯c data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to orga- nize this collection into clusters of related documents. By using a vector space model, text data can be treated as high-dimensional but sparse numerical data vectors. It is a contemporary challenge to e±ciently preprocess and cluster very large document collections. In this paper we present a time and memory e±cient technique for the entire clus- tering process, including the creation of the vector space model. This e±ciency is obtained by (i) a memory-e±cient multi-threaded prepro- cessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Detailed experi- mental results are presented | a highlight of our results is that we are able to e®ectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption.

Cite

CITATION STYLE

APA

Dhillon, I. S., Fan, J., & Guan, Y. (2001). Efficient Clustering of Very Large Document Collections (pp. 357–381). https://doi.org/10.1007/978-1-4615-1733-7_20

Efficient Clustering of Very Large Document Collections

Abstract

Cite

Register to see more suggestions