In this paper we propose a novel approach to document clustering by introducing a representative-based document similarity model that treats a document as an ordered sequence of words and partitions it into chunks for gaining valuable proximity information between words. Chunks are subsequences in a document that have low internal entropy and high boundary entropy. A chunk can be a phrase, a word or a part of word. We implement a linear time unsupervised algorithm that segments sequence of words into chunks. Chunks that occur frequently are considered as representatives of the document set. The representative based document similarity model, containing a term-document matrix with respect to the representatives, is a compact representation of the vector space model that improves quality of document clustering over traditional methods. © Springer International Publishing Switzerland 2014.
CITATION STYLE
Banerjee, A., & Pujari, A. K. (2014). Representative based document clustering. In Smart Innovation, Systems and Technologies (Vol. 27, pp. 403–411). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-319-07353-8_47
Mendeley helps you to discover research relevant for your work.