Representative based document clustering

1Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper we propose a novel approach to document clustering by introducing a representative-based document similarity model that treats a document as an ordered sequence of words and partitions it into chunks for gaining valuable proximity information between words. Chunks are subsequences in a document that have low internal entropy and high boundary entropy. A chunk can be a phrase, a word or a part of word. We implement a linear time unsupervised algorithm that segments sequence of words into chunks. Chunks that occur frequently are considered as representatives of the document set. The representative based document similarity model, containing a term-document matrix with respect to the representatives, is a compact representation of the vector space model that improves quality of document clustering over traditional methods. © Springer International Publishing Switzerland 2014.

Cite

CITATION STYLE

APA

Banerjee, A., & Pujari, A. K. (2014). Representative based document clustering. In Smart Innovation, Systems and Technologies (Vol. 27, pp. 403–411). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-319-07353-8_47

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free