Learning string distance with smoothing for OCR spelling correction

12Citations
Citations of this article
41Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Large databases of scanned documents (medical records, legal texts, historical documents) require natural language processing for retrieval and structured information extraction. Errors caused by the optical character recognition (OCR) system increase ambiguity of recognized text and decrease performance of natural language processing. The paper proposes OCR post correction system with parametrized string distance metric. The correction system learns specific error patterns from incorrect words and common sequences of correct words. A smoothing technique is proposed to assign non-zero probability to edit operations not present in the training corpus. Spelling correction accuracy is measured on database of OCR legal documents in English language. Language model and learning string metric with smoothing improves Viterbi-based search for the best sequence of corrections and increases performance of the spelling correction system.

Cite

CITATION STYLE

APA

Hládek, D., Staš, J., Ondáš, S., Juhár, J., & Kovács, L. (2017). Learning string distance with smoothing for OCR spelling correction. Multimedia Tools and Applications, 76(22), 24549–24567. https://doi.org/10.1007/s11042-016-4185-5

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free