Enabling hierarchical dirichlet processes to work better for short texts at large scale

Khai Mai; Sang Mai; Anh Nguyen; Ngo Van Linh; Khoat Than

Conference Proceedings

Enabling hierarchical dirichlet processes to work better for short texts at large scale

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2016) 9652 LNAI 431-442

DOI: 10.1007/978-3-319-31750-2_34

8Citations

14Readers

Get full text

Abstract

Analyzing texts from social media often encounters many challenges, including shortness, dynamic, and huge size. Short texts do not provide enough information so that statistical models often fail to work. In this paper, we present a very simple approach (namely, bagof- biterms) that helps statistical models such as Hierarchical Dirichlet Processes (HDP) to work well with short texts. By using both terms (words) and biterms to represent documents, bag-of-biterms (BoB) provides significant benefits: (1) it naturally lengthens representation and thus helps us reduce bad effects of shortness; (2) it enables the posterior inference in a large class of probabilistic models including HDP to be less intractable; (3) no modification of existing models/methods is necessary, and thus BoB can be easily employed in a wide class of statistical models. To evaluate those benefits of BoB, we take Online HDP into account in that it can deal with dynamic and massive text collections, and we do experiments on three large corpora of short texts which are crawled from Twitter, Yahoo Q&A, and New York Times. Extensive experiments show that BoB can help HDP work significantly better in both predictiveness and quality.

Author supplied keywords

Cite

CITATION STYLE

APA

Mai, K., Mai, S., Nguyen, A., Van Linh, N., & Than, K. (2016). Enabling hierarchical dirichlet processes to work better for short texts at large scale. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9652 LNAI, pp. 431–442). Springer Verlag. https://doi.org/10.1007/978-3-319-31750-2_34

Enabling hierarchical dirichlet processes to work better for short texts at large scale

Abstract

Author supplied keywords

Cite

Register to see more suggestions