We propose a lexicon based method whose purpose is correcting a word recognized by an OCR engine (a classifier). This post-processing method was originally designed to be used for language models that support diacritical marks, such as Portuguese. Since these special marks can be confused with noise by the classifier, wrong predictions can be derived if only the top hypothesis per glyph of the original image is preserved. To cope with this, our method uses a filtering strategy to select the best hypotheses for each glyph, which are used to produce candidate queries. A best query is selected in terms of confidence rate and edit distance to the word. A similarity search method over the best query suggests a correction. Experiments show the method improves prediction accuracy considerably for Portuguese words correction.
CITATION STYLE
Mergen, S. L. S., & de Abreu Schmidt, L. (2018). The Other C: Correcting OCR Words in the Presence of Diacritical Marks. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11122 LNAI, pp. 222–230). Springer Verlag. https://doi.org/10.1007/978-3-319-99722-3_23
Mendeley helps you to discover research relevant for your work.