Survey of Post-OCR Processing Approaches

182Citations
Citations of this article
248Readers
Mendeley users who have this article in their library.

Abstract

Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones. While OCR engines can do well with modern text, their performance is unfortunately significantly reduced on historical materials. Additionally, many texts have already been processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected. This article clarifies the importance of enhancing quality of OCR results by studying their effects on information retrieval and natural language processing applications. We then define the post-OCR processing problem, illustrate its typical pipeline, and review the state-of-the-art post-OCR processing approaches. Evaluation metrics, accessible datasets, language resources, and useful toolkits are also reported. Furthermore, the work identifies the current trend and outlines some research directions of this field.

Cite

CITATION STYLE

APA

Nguyen, T. T. H., Jatowt, A., Coustaty, M., & Doucet, A. (2022, July 31). Survey of Post-OCR Processing Approaches. ACM Computing Surveys. Association for Computing Machinery. https://doi.org/10.1145/3453476

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free