Pre-trained multilingual language models are the foundation of many NLP approaches, including cross-lingual transfer solutions. However, languages with small available monolingual corpora are often not well-supported by these models leading to poor performance. We propose an unsupervised approach to improve the cross-lingual representations of low-resource languages by bootstrapping word translation pairs from monolingual corpora and using them to improve language alignment in pretrained language models. We perform experiments on nine languages, using contextual word retrieval and zero-shot named entity recognition to measure both intrinsic cross-lingual word representation quality and downstream task performance, showing improvements on both tasks. Our results show that it is possible to improve pre-trained multilingual language models by relying only on non-parallel resources.
CITATION STYLE
Hangya, V., Saadi, H. S., & Fraser, A. (2022). Improving Low-Resource Languages in Pre-Trained Multilingual Language Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 (pp. 11993–12006). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.emnlp-main.822
Mendeley helps you to discover research relevant for your work.