Analyses of Multi-collection Corpora via Compound Topic Modeling

Clint P. George; Wei Xia; George Michailidis

Conference Proceedings

Analyses of Multi-collection Corpora via Compound Topic Modeling

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2019) 11943 LNCS 205-218

DOI: 10.1007/978-3-030-37599-7_18

0Citations

5Readers

Get full text

Abstract

Popular probabilistic topic models have typically centered on one single text collection, which is deficient for comparative text analyses. We consider a setting where we have partitionable corpora. Each subcollection shares a single set of topics, but there exists relative variation in topic proportions among collections. We propose the compound latent Dirichlet allocation (cLDA) model that encourages generalizability, depends less on user-input parameters, and includes any prior knowledge corpus organization structure. For parameter estimation, we study Markov chain Monte Carlo (MCMC) and variational inference approaches extensively and suggest an efficient MCMC method. We evaluate cLDA using both synthetic and real-world corpora and cLDA shows superior performance over the state-of-the-art models.

Author supplied keywords

Cite

CITATION STYLE

APA

George, C. P., Xia, W., & Michailidis, G. (2019). Analyses of Multi-collection Corpora via Compound Topic Modeling. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11943 LNCS, pp. 205–218). Springer. https://doi.org/10.1007/978-3-030-37599-7_18

Analyses of Multi-collection Corpora via Compound Topic Modeling

Abstract

Author supplied keywords

Cite

Register to see more suggestions