The purpose of this paper is to compare two basic post-processing algorithms for correction of optical character recognition (OCR) errors in Swedish text. One is based on language knowledge and manual correction (lexical filter); the other is based on a generic algorithm using limited language knowledge in order to perform corrections (generic filter). The different methods aim to improve the quality of OCR generated Swedish patent text. Tests are conducted on 7,721 randomly selected patent claims generated by different OCR software tools. The OCR generated and automatically corrected (by the lexical or generic filter) texts are compared with manually corrected ground truth. The preliminary results indicate that the OCR tools are biased to different characters when generating text and the language knowledge of post correction influences the final results.
CITATION STYLE
Andersson, L., Rastas, H., & Rauber, A. (2014). Post OCR correction of swedish patent text: The difference between reading tongue ‘lästunga’ and security tab ‘låstunga.’ Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8849, 1–9. https://doi.org/10.1007/978-3-319-12979-2_1
Mendeley helps you to discover research relevant for your work.