Uncovering thematic structures of SNS and blog posts is a crucial yet challenging task, because of the severe data sparsity induced by the short length of texts and diverse use of vocabulary. This hinders effective topic inference of traditional LDA because it infers topics based on document-level co-occurrence of words. To robustly infer topics in such contexts, we propose a latent concept topic model (LCTM). Unlike LDA, LCTM reveals topics via co-occurrence of latent concepts, which we introduce as latent variables to capture conceptual similarity of words. More specifically, LCTM models each topic as a distribution over the latent concepts, where each latent concept is a localized Gaussian distribution over the word embedding space. Since the number of unique concepts in a corpus is often much smaller than the number of unique words, LCTM is less susceptible to the data sparsity. Experiments on the 20Newsgroups show the effectiveness of LCTM in dealing with short texts as well as the capability of the model in handling held-out documents with a high degree of OOV words.
CITATION STYLE
Hu, W., & Tsujii, J. (2016). A latent concept topic model for robust topic inference using word embeddings. In 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Short Papers (pp. 380–386). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/p16-2062
Mendeley helps you to discover research relevant for your work.