Skip to content
Conference proceedings

Dirichlet process mixture models based topic identification for short text streams

Wang C, Yuan C, Wang X, Xue W ...see all

NLP-KE 2011 - Proceedings of the 7th International Conference on Natural Language Processing and Knowledge Engineering (2011) pp. 80-87

  • 6


    Mendeley users who have this article in their library.
  • 4


    Citations of this article.
  • N/A


    ScienceDirect users who have downloaded this article.
Sign in to save reference


Topic detection and tracking (TDT) has been extensively studied and applied in recent years. However, prior work is mostly based on regular news text, the problem of scaling to short stories remains pretty much open. Besides, prior work conducts topic identification on separated stories by assuming story segmentation as prerequisites, which is another challenging yet critical task for TDT research. In this paper, we propose a Dirichlet Process Mixture Model (DPMM) based topic identification method, which deals with topic segmentation, topic detection and tracking in an unified model, and achieves reasonable results for short stories. We first present DPMM and its application in topic identification task. Then we discuss two different solutions specifically designed to solve sparseness problem associated with short stories. One is the design of algorithm flow. Instead of a single short text, the processing unit of topic identification is converted to session firstly. The other applies extended DPMM model which takes account of word dependency when estimating distributions of words associated with every known topic. Whereafter, we extend DPMM to identify topic for spontaneous text streams by managing topic segmentation, topic detection and tracking simultaneously. The attractive advantage of DPMM is the number of mixture components needs not been fixed in advance, and it does not need prior knowledge about number and content of topics. So compared with other existing methods, it is more suitable for streaming topic identification. Our empirical results on TDT3 evaluation data verify that DPMM is valid in the task of topic identification for short text data with stream properties, and extended DPMM outperforms original DPMM methods.

Author-supplied keywords

  • DPMM
  • Dirichlet Process Mixture Model
  • data streams
  • extended DPMM
  • static short text
  • topic identification

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document

Get full text


  • Chan Wang

  • Caixia Yuan

  • Xiaojie Wang

  • Wenwei Xue

Cite this document

Choose a citation style from the tabs below