Post OCR correction of swedish patent text: The difference between reading tongue ‘lästunga’ and security tab ‘låstunga’

Linda Andersson; Helena Rastas; Andreas Rauber

Journal Article

Post OCR correction of swedish patent text: The difference between reading tongue ‘lästunga’ and security tab ‘låstunga’

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2014) 8849 1-9

DOI: 10.1007/978-3-319-12979-2_1

1Citations

2Readers

Get full text

Abstract

The purpose of this paper is to compare two basic post-processing algorithms for correction of optical character recognition (OCR) errors in Swedish text. One is based on language knowledge and manual correction (lexical filter); the other is based on a generic algorithm using limited language knowledge in order to perform corrections (generic filter). The different methods aim to improve the quality of OCR generated Swedish patent text. Tests are conducted on 7,721 randomly selected patent claims generated by different OCR software tools. The OCR generated and automatically corrected (by the lexical or generic filter) texts are compared with manually corrected ground truth. The preliminary results indicate that the OCR tools are biased to different characters when generating text and the language knowledge of post correction influences the final results.

Author supplied keywords

Cite

CITATION STYLE

APA

Andersson, L., Rastas, H., & Rauber, A. (2014). Post OCR correction of swedish patent text: The difference between reading tongue ‘lästunga’ and security tab ‘låstunga.’ Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8849, 1–9. https://doi.org/10.1007/978-3-319-12979-2_1

Post OCR correction of swedish patent text: The difference between reading tongue ‘lästunga’ and security tab ‘låstunga’

Abstract

Author supplied keywords

Cite

Register to see more suggestions