Abstract
Highly accurate machine translation systems are very important in societies and countries where multilinguality is very common, and where English often does not suffice. The Indian subcontinent (or South Asia) is such a region, with all the Indic languages currently being under-represented in the NLP ecosystem. It is essential to thoroughly explore various techniques to improve the performance of such low-resource languages at least using the data available in open-source, which itself is something not very explored in the Indic ecosystem. In our work, we perform a study with a focus on improving the performance of very-low-resource South Asian languages, especially of countries in addition to India. Specifically, we propose how unified models can be built that can exploit the data from comparatively resource-rich languages of the same region. We propose strategies to unify different types of unexplored scripts, especially Perso-Arabic scripts and Indic scripts to build multilingual models for all the South Asian languages despite the script barrier. We also study how augmentation techniques like back-translation can be made use-of to build unified models just using openly available raw data, to understand what levels of improvements can be expected for these Indic languages.
Cite
CITATION STYLE
Gokul, N. C. (2022). Unified NMT models for the Indian subcontinent, transcending script-barriers. In DeepLo 2022 - 3rd Workshop on Deep Learning Approaches for Low-Resource NLP, Proceedings of the DeepLo Workshop (pp. 227–236). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.deeplo-1.23
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.