A LDA-based algorithm for length-aware text clustering

Xinhuan Chen; Yong Zhang; Yanshen Yin; Chao Li; Chunxiao Xing

Conference Proceedings

A LDA-based algorithm for length-aware text clustering

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2014) 8709 LNCS 503-510

DOI: 10.1007/978-3-319-11116-2_45

0Citations

4Readers

Get full text

Abstract

The proliferation of texts in Web presents great challenges on knowledge discovery in text collections. Clustering provides us with a powerful tool to organize the information and recognize the structure of the information. Most text clustering techniques are designed to deal with either long or short texts. However many real-life collections are often made up of both long and short texts, namely mixed length texts. The current text clustering techniques are unsatisfactory, for they don't distinguish the sparseness and high dimension of the mixed length texts. In this paper, we propose a novel approach - Length-Aware Dual Latent Dirichlet Allocation (ADLDA), which is used for clustering the mixed length texts via obtaining auxiliary knowledge from long (short) texts for short (long) texts in the collections. The degree of mutual auxiliary is based on the ratio of long texts and short texts in a corpus. Experimental results on real datasets show our approach achieves superior performance over other state-of the-art text clustering approaches for mixed length texts. © 2014 Springer International Publishing Switzerland.

Author supplied keywords

Cite

CITATION STYLE

APA

Chen, X., Zhang, Y., Yin, Y., Li, C., & Xing, C. (2014). A LDA-based algorithm for length-aware text clustering. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8709 LNCS, pp. 503–510). Springer Verlag. https://doi.org/10.1007/978-3-319-11116-2_45

A LDA-based algorithm for length-aware text clustering

Abstract

Author supplied keywords

Cite

Register to see more suggestions