Abstract
Transcribing documents from the printing press era, a challenge in its own right, is more complicated when documents interleave multiple languages - a common feature of 16th century texts. Additionally, many of these documents precede consistent orthographic conventions, making the task even harder. We extend the state-of-the-art historical OCR model of Berg-Kirkpatrick et al. (2013) to handle word-level code-switching between multiple languages. Further, we enable our system to handle spelling variability, including now-obsolete shorthand systems used by printers. Our results show average relative character error reductions of 14% across a variety of historical texts.
Cite
CITATION STYLE
Garrette, D., Alpert-Abrams, H., Berg-Kirkpatrick, T., & Klein, D. (2015). Unsupervised code-switching for multilingual historical document transcription. In NAACL HLT 2015 - 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (pp. 1036–1041). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/n15-1109
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.