Document clustering meets topic modeling with word embeddings

Gianni Costa; Riccardo Ortale

Conference ProceedingsOPEN ACCESS

Document clustering meets topic modeling with word embeddings

Proceedings of the 2020 SIAM International Conference on Data Mining, SDM 2020 (2020) 244-252

DOI: 10.1137/1.9781611976236.28

12Citations

21Readers

Abstract

We propose a new statistical-learning approach to marrying topic modeling and document clustering. In particular, a Bayesian generative model of text collections is developed, in which the two foresaid tasks are incorporated as coupled latent factors, that govern document wording. The latter consists of word embeddings, so as to capture the semantic and syntactic regularities among words. Collapsed Gibbs sampling is derived mathematically and implemented algorithmically, along with parameter estimation, with the aim to jointly perform topic modeling and document clustering through Bayesian reasoning. Comparative tests on benchmark real-world corpora reveal the effectiveness of the devised approach in clustering collections of text documents and coherently recovering their semantics.

Author supplied keywords

Cite

CITATION STYLE

APA

Costa, G., & Ortale, R. (2020). Document clustering meets topic modeling with word embeddings. In Proceedings of the 2020 SIAM International Conference on Data Mining, SDM 2020 (pp. 244–252). Society for Industrial and Applied Mathematics Publications. https://doi.org/10.1137/1.9781611976236.28

Document clustering meets topic modeling with word embeddings

Abstract

Author supplied keywords

Cite

Register to see more suggestions