Efficient Clustering of Very Large Document Collections

  • Dhillon I
  • Fan J
  • Guan Y
N/ACitations
Citations of this article
65Readers
Mendeley users who have this article in their library.
Get full text

Abstract

An invaluable portion of scienti¯c data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to orga- nize this collection into clusters of related documents. By using a vector space model, text data can be treated as high-dimensional but sparse numerical data vectors. It is a contemporary challenge to e±ciently preprocess and cluster very large document collections. In this paper we present a time and memory e±cient technique for the entire clus- tering process, including the creation of the vector space model. This e±ciency is obtained by (i) a memory-e±cient multi-threaded prepro- cessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Detailed experi- mental results are presented | a highlight of our results is that we are able to e®ectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption.

Cite

CITATION STYLE

APA

Dhillon, I. S., Fan, J., & Guan, Y. (2001). Efficient Clustering of Very Large Document Collections (pp. 357–381). https://doi.org/10.1007/978-1-4615-1733-7_20

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free