The Other C: Correcting OCR Words in the Presence of Diacritical Marks

Sérgio Luís Sardi Mergen; Leonardo de Abreu Schmidt

Conference Proceedings

The Other C: Correcting OCR Words in the Presence of Diacritical Marks

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2018) 11122 LNAI 222-230

DOI: 10.1007/978-3-319-99722-3_23

0Citations

2Readers

Get full text

Abstract

We propose a lexicon based method whose purpose is correcting a word recognized by an OCR engine (a classifier). This post-processing method was originally designed to be used for language models that support diacritical marks, such as Portuguese. Since these special marks can be confused with noise by the classifier, wrong predictions can be derived if only the top hypothesis per glyph of the original image is preserved. To cope with this, our method uses a filtering strategy to select the best hypotheses for each glyph, which are used to produce candidate queries. A best query is selected in terms of confidence rate and edit distance to the word. A similarity search method over the best query suggests a correction. Experiments show the method improves prediction accuracy considerably for Portuguese words correction.

Author supplied keywords

Cite

CITATION STYLE

APA

Mergen, S. L. S., & de Abreu Schmidt, L. (2018). The Other C: Correcting OCR Words in the Presence of Diacritical Marks. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11122 LNAI, pp. 222–230). Springer Verlag. https://doi.org/10.1007/978-3-319-99722-3_23

The Other C: Correcting OCR Words in the Presence of Diacritical Marks

Abstract

Author supplied keywords

Cite

Register to see more suggestions