Language modelling for the needs of OCR of medical texts

Maciej Piasecki; Grzegorz Godlewski

Conference Proceedings

Language modelling for the needs of OCR of medical texts

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2006) 4345 LNBI 273-284

DOI: 10.1007/11946465_25

3Citations

6Readers

Get full text

Abstract

In the paper different methods of construction of language models are discussed in relation to a corpora of medical texts written in an inflective language, namely Polish. The main result is the proposal of a method of language modelling which sequentially combines tri-grams of morphological base forms with tri-grams of words. The introduction of base form tri-grams increased the overall performance of the combined model, measured as the improvement in the accuracy of OCR of handwriting, as well, as the ability to generalisation. The latter was showed by using corpora of two different types as the training one and the test one. The detailed results of tests run on a large corpora of real life medical language are discussed in the paper. An experimental system of OCR of handwritten epicrises utilising the proposed model is presented. The proposed language model decreases the overall error of the system by 64.2% (51% in the case of different types of corpora). © Springer-Verlag Berlin Heidelberg 2006.

Cite

CITATION STYLE

APA

Piasecki, M., & Godlewski, G. (2006). Language modelling for the needs of OCR of medical texts. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4345 LNBI, pp. 273–284). Springer Verlag. https://doi.org/10.1007/11946465_25

Language modelling for the needs of OCR of medical texts

Abstract

Cite

Register to see more suggestions