Enabling hierarchical dirichlet processes to work better for short texts at large scale

8Citations
Citations of this article
14Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Analyzing texts from social media often encounters many challenges, including shortness, dynamic, and huge size. Short texts do not provide enough information so that statistical models often fail to work. In this paper, we present a very simple approach (namely, bagof- biterms) that helps statistical models such as Hierarchical Dirichlet Processes (HDP) to work well with short texts. By using both terms (words) and biterms to represent documents, bag-of-biterms (BoB) provides significant benefits: (1) it naturally lengthens representation and thus helps us reduce bad effects of shortness; (2) it enables the posterior inference in a large class of probabilistic models including HDP to be less intractable; (3) no modification of existing models/methods is necessary, and thus BoB can be easily employed in a wide class of statistical models. To evaluate those benefits of BoB, we take Online HDP into account in that it can deal with dynamic and massive text collections, and we do experiments on three large corpora of short texts which are crawled from Twitter, Yahoo Q&A, and New York Times. Extensive experiments show that BoB can help HDP work significantly better in both predictiveness and quality.

Cite

CITATION STYLE

APA

Mai, K., Mai, S., Nguyen, A., Van Linh, N., & Than, K. (2016). Enabling hierarchical dirichlet processes to work better for short texts at large scale. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9652 LNAI, pp. 431–442). Springer Verlag. https://doi.org/10.1007/978-3-319-31750-2_34

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free