Evaluating the impact of OCR errors on topic modeling

17Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Historical documents pose a challenge for character recognition due to various reasons such as font disparities across different materials, lack of orthographic standards where same words are spelled differently, material quality and unavailability of lexicons of known historical spelling variants. As a result, optical character recognition (OCR) of those documents often yield unsatisfactory OCR accuracy and render digital material only partially discoverable and the data they hold difficult to process. In this paper, we explore the impact of OCR errors on the identification of topics from a corpus comprising text from historical OCRed documents. Based on experiments performed on OCR text corpora, we observe that OCR noise negatively impacts the stability and coherence of topics generated by topic modeling algorithms and we quantify the strength of this impact.

Cite

CITATION STYLE

APA

Mutuvi, S., Doucet, A., Odeo, M., & Jatowt, A. (2018). Evaluating the impact of OCR errors on topic modeling. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11279 LNCS, pp. 3–14). Springer Verlag. https://doi.org/10.1007/978-3-030-04257-8_1

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free