Comparative Study of Models Trained on Synthetic Data for Ukrainian Grammatical Error Correction

Maksym Bondarenko; Artem Yushko; Andrii Shportko; Andrii Fedorych

Conference ProceedingsOPEN ACCESS

Comparative Study of Models Trained on Synthetic Data for Ukrainian Grammatical Error Correction

EACL 2023 - 2nd Ukrainian Natural Language Processing Workshop, UNLP 2023 - Proceedings of the Workshop (2023) 103-113

DOI: 10.18653/v1/2023.unlp-1.13

1Citations

14Readers

Abstract

The task of Grammatical Error Correction (GEC) has been extensively studied for the English language. However, its application to low-resource languages, such as Ukrainian, remains an open challenge. In this paper, we develop sequence tagging and neural machine translation models for the Ukrainian language as well as a set of algorithmic correction rules to augment those systems. We also develop synthetic data generation techniques for the Ukrainian language to create high-quality human-like errors. Finally, we determine the best combination of synthetically generated data to augment the existing UA-GEC corpus and achieve the state-of-the-art results of 0.663 F0.5 score on the newly established UA-GEC benchmark. The code and trained models will be made publicly available on GitHub and HuggingFace.

Cite

CITATION STYLE

APA

Bondarenko, M., Yushko, A., Shportko, A., & Fedorych, A. (2023). Comparative Study of Models Trained on Synthetic Data for Ukrainian Grammatical Error Correction. In EACL 2023 - 2nd Ukrainian Natural Language Processing Workshop, UNLP 2023 - Proceedings of the Workshop (pp. 103–113). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.unlp-1.13

Comparative Study of Models Trained on Synthetic Data for Ukrainian Grammatical Error Correction

Abstract

Cite

Register to see more suggestions