Whitespace errors are common to digitized archives. This paper describes a lightweight unsupervised technique for recovering the original whitespace. Our approach is based on count statistics from Google n-grams, which are converted into a likelihood ratio test computed from interpolated trigram and bigram probabilities. To evaluate this approach, we annotate a small corpus of whitespace errors in a digitized corpus of newspapers from the 19th century United States. Our technique identifies and corrects most whitespace errors while introducing a minimal amount of oversegmentation: It achieves 77% recall at a false positive rate of less than 1%, and 91% recall at a false positive rate of less than 3%.
CITATION STYLE
Soni, S., Klein, L. F., & Eisenstein, J. (2019). Correcting whitespace errors in digitized historical texts. In LaTeCH@NAACL-HLT 2019 - 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Proceedings (pp. 98–103). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w19-2513
Mendeley helps you to discover research relevant for your work.