Optimizing semantic coherence in topic models

David Mimno; Hanna M. Wallach; Edmund Talley; Miriam Leenders; Andrew McCallum

Conference Proceedings

Optimizing semantic coherence in topic models

EMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (2011) 262-272

1.9kCitations

1.3kReaders

Abstract

Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce low-dimensional subspaces (topics) that are obviously flawed to human domain experts. The contributions of this paper are threefold: (1) An analysis of the ways in which topics can be flawed; (2) an automated evaluation metric for identifying such topics that does not rely on human annotators or reference collections outside the training data; (3) a novel statistical topic model based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH). © 2011 Association for Computational Linguistics.

Cite

CITATION STYLE

APA

Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In EMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 262–272).

Optimizing semantic coherence in topic models

Abstract

Cite

Register to see more suggestions