Combining trigram and winnow in Thai OCR error correction

Surapant Meknavin; Boonserm Kijsirikul; Ananlada Chotimongkol; Cholwich Nuttee

Conference ProceedingsOPEN ACCESS

Combining trigram and winnow in Thai OCR error correction

Proceedings of the Annual Meeting of the Association for Computational Linguistics (1998) 2 836-842

DOI: 10.3115/980691.980707

7Citations

76Readers

Abstract

For languages that have no explicit word boundary such as Thai, Chinese and Japanese, correcting words in text is harder than in English because of additional ambiguities in locating error words. The traditional method handles this by hypothesizing that every substrings in the input sentence could be error words and trying to correct all of them. In this paper, we propose the idea of reducing the scope of spelling correction by focusing only on dubious areas in the input sentence. Boundaries of these dubious areas could be obtained approximately by applying word segmentation algorithm and finding word sequences with low probability. To generate the candidate correction words, we used a modified edit distance which reflects the characteristic of Thai OCR errors. Finally, a part-of-speech trigram model and Winnow algorithm are combined to determine the most probable correction.

Cite

CITATION STYLE

APA

Meknavin, S., Kijsirikul, B., Chotimongkol, A., & Nuttee, C. (1998). Combining trigram and winnow in Thai OCR error correction. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 2, pp. 836–842). Association for Computational Linguistics (ACL). https://doi.org/10.3115/980691.980707

Combining trigram and winnow in Thai OCR error correction

Abstract

Cite

Register to see more suggestions