Improved typesetting models for historical OCR

Taylor Berg-Kirkpatrick; Dan Klein

Conference ProceedingsOPEN ACCESS

Improved typesetting models for historical OCR

52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference (2014) 2 118-123

DOI: 10.3115/v1/p14-2020

15Citations

103Readers

Abstract

We present richer typesetting models that extend the unsupervised historical document recognition system of Berg- Kirkpatrick et al. (2013). The first model breaks the independence assumption between vertical offsets of neighboring glyphs and, in experiments, substantially decreases transcription error rates. The second model simultaneously learns multiple font styles and, as a result, is able to accurately track italic and nonitalic portions of documents. Richer models complicate inference so we present a new, streamlined procedure that is over 25× faster than the method used by Berg- Kirkpatrick et al. (2013). Our final system achieves a relative word error reduction of 22% compared to state-of-the-art results on a dataset of historical newspapers. © 2014 Association for Computational Linguistics.

Cite

CITATION STYLE

APA

Berg-Kirkpatrick, T., & Klein, D. (2014). Improved typesetting models for historical OCR. In 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference (Vol. 2, pp. 118–123). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/p14-2020

Improved typesetting models for historical OCR

Abstract

Cite

Register to see more suggestions