Researchers using social media data want to understand the discussions occurring in and about their respective fields. These domain experts often turn to topic models to help them see the entire landscape of the conversation, but unsupervised topic models often produce topic sets that miss topics experts expect or want to see. To solve this problem, we propose Guided Topic-Noise Model (GTM), a semi-supervised topic model designed with large domain-specific social media data sets in mind. The input to GTM is a set of topics that are of interest to the user and a small number of words or phrases that belong to those topics. These seed topics are used to guide the topic generation process, and can be augmented interactively, expanding the seed word list as the model provides new relevant words for different topics. GTM uses a novel initialization and a new sampling algorithm called Generalized Polya Urn (GPU) seed word sampling to produce a topic set that includes expanded seed topics, as well as new unsupervised topics. We demonstrate the robustness of GTM on open-ended responses from a public opinion survey and four domain-specific Twitter data sets.
CITATION STYLE
Churchill, R., Singh, L., Ryan, R., & Davis-Kean, P. (2022). A Guided Topic-Noise Model for Short Texts. In WWW 2022 - Proceedings of the ACM Web Conference 2022 (pp. 2870–2878). Association for Computing Machinery, Inc. https://doi.org/10.1145/3485447.3512007
Mendeley helps you to discover research relevant for your work.