The Other C: Correcting OCR Words in the Presence of Diacritical Marks

0Citations
Citations of this article
2Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We propose a lexicon based method whose purpose is correcting a word recognized by an OCR engine (a classifier). This post-processing method was originally designed to be used for language models that support diacritical marks, such as Portuguese. Since these special marks can be confused with noise by the classifier, wrong predictions can be derived if only the top hypothesis per glyph of the original image is preserved. To cope with this, our method uses a filtering strategy to select the best hypotheses for each glyph, which are used to produce candidate queries. A best query is selected in terms of confidence rate and edit distance to the word. A similarity search method over the best query suggests a correction. Experiments show the method improves prediction accuracy considerably for Portuguese words correction.

Author supplied keywords

Cite

CITATION STYLE

APA

Mergen, S. L. S., & de Abreu Schmidt, L. (2018). The Other C: Correcting OCR Words in the Presence of Diacritical Marks. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11122 LNAI, pp. 222–230). Springer Verlag. https://doi.org/10.1007/978-3-319-99722-3_23

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free