Latent dirichlet allocation for automatic document categorization

István Bíró; Jácint Szabó

Conference ProceedingsOPEN ACCESS

Latent dirichlet allocation for automatic document categorization

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2009) 5782 LNAI(PART 2) 430-441

DOI: 10.1007/978-3-642-04174-7_28

14Citations

42Readers

Abstract

In this paper we introduce and evaluate a technique for applying latent Dirichlet allocation to supervised semantic categorization of documents. In our setup, for every category an own collection of topics is assigned, and for a labeled training document only topics from its category are sampled. Thus, compared to the classical LDA that processes the entire corpus in one, we essentially build separate LDA models for each category with the category-specific topics, and then these topic collections are put together to form a unified LDA model. For an unseen document the inferred topic distribution gives an estimation how much the document fits into the category. We use this method for Web document classification. Our key results are 46% decrease in 1-AUC value in classification accuracy over tf.idf with SVM and 43% over the plain LDA baseline with SVM. Using a careful vocabulary selection method and a heuristic which handles the effect that similar topics may arise in distinct categories the improvement is 83% over tf.idf with SVM and 82% over LDA with SVM in 1-AUC. © 2009 Springer Berlin Heidelberg.

Cite

CITATION STYLE

APA

Bíró, I., & Szabó, J. (2009). Latent dirichlet allocation for automatic document categorization. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5782 LNAI, pp. 430–441). https://doi.org/10.1007/978-3-642-04174-7_28

Latent dirichlet allocation for automatic document categorization

Abstract

Cite

Register to see more suggestions