A generative probabilistic OCR model for NLP applications

39Citations
Citations of this article
115Readers
Mendeley users who have this article in their library.

Abstract

In this paper, we introduce a generative probabilistic optical character recognition (OCR) model that describes an end-to-end process in the noisy channel framework, progressing from generation of true text through its transformation into the noisy output of an OCR system. The model is designed for use in error correction, with a focus on post-processing the output of black-box OCR systems in order to make it more useful for NLP tasks. We present an implementation of the model based on finite-state models, demonstrate the model’s ability to significantly reduce character and word error rate, and provide evaluation results involving automatic extraction of translation lexicons from printed text.

Cite

CITATION STYLE

APA

Kolak, O., Byrne, W., & Resnik, P. (2003). A generative probabilistic OCR model for NLP applications. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, HLT-NAACL 2003. Association for Computational Linguistics (ACL). https://doi.org/10.3115/1073445.1073463

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free