Correcting whitespace errors in digitized historical texts

Sandeep Soni; Lauren F. Klein; Jacob Eisenstein

Conference ProceedingsOPEN ACCESS

Correcting whitespace errors in digitized historical texts

LaTeCH@NAACL-HLT 2019 - 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Proceedings (2019) 98-103

DOI: 10.18653/v1/w19-2513

7Citations

69Readers

Abstract

Whitespace errors are common to digitized archives. This paper describes a lightweight unsupervised technique for recovering the original whitespace. Our approach is based on count statistics from Google n-grams, which are converted into a likelihood ratio test computed from interpolated trigram and bigram probabilities. To evaluate this approach, we annotate a small corpus of whitespace errors in a digitized corpus of newspapers from the 19th century United States. Our technique identifies and corrects most whitespace errors while introducing a minimal amount of oversegmentation: It achieves 77% recall at a false positive rate of less than 1%, and 91% recall at a false positive rate of less than 3%.

Cite

CITATION STYLE

APA

Soni, S., Klein, L. F., & Eisenstein, J. (2019). Correcting whitespace errors in digitized historical texts. In LaTeCH@NAACL-HLT 2019 - 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Proceedings (pp. 98–103). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w19-2513

Correcting whitespace errors in digitized historical texts

Abstract

Cite

Register to see more suggestions