Representative based document clustering

Arko Banerjee; Arun K. Pujari

Conference Proceedings

Representative based document clustering

Smart Innovation, Systems and Technologies (2014) 27(VOL 1) 403-411

DOI: 10.1007/978-3-319-07353-8_47

1Citations

3Readers

Get full text

Abstract

In this paper we propose a novel approach to document clustering by introducing a representative-based document similarity model that treats a document as an ordered sequence of words and partitions it into chunks for gaining valuable proximity information between words. Chunks are subsequences in a document that have low internal entropy and high boundary entropy. A chunk can be a phrase, a word or a part of word. We implement a linear time unsupervised algorithm that segments sequence of words into chunks. Chunks that occur frequently are considered as representatives of the document set. The representative based document similarity model, containing a term-document matrix with respect to the representatives, is a compact representation of the vector space model that improves quality of document clustering over traditional methods. © Springer International Publishing Switzerland 2014.

Author supplied keywords

Cite

CITATION STYLE

APA

Banerjee, A., & Pujari, A. K. (2014). Representative based document clustering. In Smart Innovation, Systems and Technologies (Vol. 27, pp. 403–411). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-319-07353-8_47

Representative based document clustering

Abstract

Author supplied keywords

Cite

Register to see more suggestions