Evaluating the Robustness of Embedding-Based Topic Models to OCR Noise

Elaine Zosa; Stephen Mutuvi; Mark Granroth-Wilding; Antoine Doucet

Conference Proceedings

Evaluating the Robustness of Embedding-Based Topic Models to OCR Noise

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2021) 13133 LNCS 392-400

DOI: 10.1007/978-3-030-91669-5_30

3Citations

3Readers

Get full text

Abstract

Unsupervised topic models such as Latent Dirichlet Allocation (LDA) are popular tools to analyse digitised corpora. However, the performance of these tools have been shown to degrade with OCR noise. Topic models that incorporate word embeddings during inference have been proposed to address the limitations of LDA, but these models have not seen much use in historical text analysis. In this paper we explore the impact of OCR noise on two embedding-based models, Gaussian LDA and the Embedded Topic Model (ETM) and compare their performance to LDA. Our results show that these models, especially ETM, are slightly more resilient than LDA in the presence of noise in terms of topic quality and classification accuracy.

Author supplied keywords

Cite

CITATION STYLE

APA

Zosa, E., Mutuvi, S., Granroth-Wilding, M., & Doucet, A. (2021). Evaluating the Robustness of Embedding-Based Topic Models to OCR Noise. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13133 LNCS, pp. 392–400). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-91669-5_30

Evaluating the Robustness of Embedding-Based Topic Models to OCR Noise

Abstract

Author supplied keywords

Cite

Register to see more suggestions