Dirichlet process mixture models based topic identification for short text streams
Topic detection and tracking (TDT) has been extensively studied and applied in recent years. However, prior work is mostly based on regular news text, the problem of scaling to short stories remains pretty much open. Besides, prior work conducts topic identification on separated stories by assuming story segmentation as prerequisites, which is another challenging yet critical task for TDT research. In this paper, we propose a Dirichlet Process Mixture Model (DPMM) based topic identification method, which deals with topic segmentation, topic detection and tracking in an unified model, and achieves reasonable results for short stories. We first present DPMM and its application in topic identification task. Then we discuss two different solutions specifically designed to solve sparseness problem associated with short stories. One is the design of algorithm flow. Instead of a single short text, the processing unit of topic identification is converted to session firstly. The other applies extended DPMM model which takes account of word dependency when estimating distributions of words associated with every known topic. Whereafter, we extend DPMM to identify topic for spontaneous text streams by managing topic segmentation, topic detection and tracking simultaneously. The attractive advantage of DPMM is the number of mixture components needs not been fixed in advance, and it does not need prior knowledge about number and content of topics. So compared with other existing methods, it is more suitable for streaming topic identification. Our empirical results on TDT3 evaluation data verify that DPMM is valid in the task of topic identification for short text data with stream properties, and extended DPMM outperforms original DPMM methods.