Dirichlet process mixture models based topic identification for short text streams

4Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Topic detection and tracking (TDT) has been extensively studied and applied in recent years. However, prior work is mostly based on regular news text, the problem of scaling to short stories remains pretty much open. Besides, prior work conducts topic identification on separated stories by assuming story segmentation as prerequisites, which is another challenging yet critical task for TDT research. In this paper, we propose a Dirichlet Process Mixture Model (DPMM) based topic identification method, which deals with topic segmentation, topic detection and tracking in an unified model, and achieves reasonable results for short stories. We first present DPMM and its application in topic identification task. Then we discuss two different solutions specifically designed to solve sparseness problem associated with short stories. One is the design of algorithm flow. Instead of a single short text, the processing unit of topic identification is converted to session firstly. The other applies extended DPMM model which takes account of word dependency when estimating distributions of words associated with every known topic. Whereafter, we extend DPMM to identify topic for spontaneous text streams by managing topic segmentation, topic detection and tracking simultaneously. The attractive advantage of DPMM is the number of mixture components needs not been fixed in advance, and it does not need prior knowledge about number and content of topics. So compared with other existing methods, it is more suitable for streaming topic identification. Our empirical results on TDT3 evaluation data verify that DPMM is valid in the task of topic identification for short text data with stream properties, and extended DPMM outperforms original DPMM methods. © 2011 IEEE.

Cite

CITATION STYLE

APA

Wang, C., Yuan, C., Wang, X., & Xue, W. (2011). Dirichlet process mixture models based topic identification for short text streams. In NLP-KE 2011 - Proceedings of the 7th International Conference on Natural Language Processing and Knowledge Engineering (pp. 80–87). https://doi.org/10.1109/NLPKE.2011.6138173

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free