From the paft to the fiiture: A fully automatic NMT and word embeddings method for OCR post-correction

22Citations
Citations of this article
76Readers
Mendeley users who have this article in their library.

Abstract

A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for training a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.

Cite

CITATION STYLE

APA

Hämäläinen, M., & Hengchen, S. (2019). From the paft to the fiiture: A fully automatic NMT and word embeddings method for OCR post-correction. In International Conference Recent Advances in Natural Language Processing, RANLP (Vol. 2019-September, pp. 431–436). Incoma Ltd. https://doi.org/10.26615/978-954-452-056-4_051

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free