Evaluating the impact of OCR errors on topic modeling

Stephen Mutuvi; Antoine Doucet; Moses Odeo; Adam Jatowt

Conference Proceedings

Evaluating the impact of OCR errors on topic modeling

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2018) 11279 LNCS 3-14

DOI: 10.1007/978-3-030-04257-8_1

17Citations

5Readers

Get full text

Abstract

Historical documents pose a challenge for character recognition due to various reasons such as font disparities across different materials, lack of orthographic standards where same words are spelled differently, material quality and unavailability of lexicons of known historical spelling variants. As a result, optical character recognition (OCR) of those documents often yield unsatisfactory OCR accuracy and render digital material only partially discoverable and the data they hold difficult to process. In this paper, we explore the impact of OCR errors on the identification of topics from a corpus comprising text from historical OCRed documents. Based on experiments performed on OCR text corpora, we observe that OCR noise negatively impacts the stability and coherence of topics generated by topic modeling algorithms and we quantify the strength of this impact.

Author supplied keywords

Cite

CITATION STYLE

APA

Mutuvi, S., Doucet, A., Odeo, M., & Jatowt, A. (2018). Evaluating the impact of OCR errors on topic modeling. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11279 LNCS, pp. 3–14). Springer Verlag. https://doi.org/10.1007/978-3-030-04257-8_1

Evaluating the impact of OCR errors on topic modeling

Abstract

Author supplied keywords

Cite

Register to see more suggestions