Improving Low-Resource Languages in Pre-Trained Multilingual Language Models

Viktor Hangya; Hossain Shaikh Saadi; Alexander Fraser

Conference ProceedingsOPEN ACCESS

Improving Low-Resource Languages in Pre-Trained Multilingual Language Models

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 (2022) 11993-12006

DOI: 10.18653/v1/2022.emnlp-main.822

20Citations

47Readers

Abstract

Pre-trained multilingual language models are the foundation of many NLP approaches, including cross-lingual transfer solutions. However, languages with small available monolingual corpora are often not well-supported by these models leading to poor performance. We propose an unsupervised approach to improve the cross-lingual representations of low-resource languages by bootstrapping word translation pairs from monolingual corpora and using them to improve language alignment in pretrained language models. We perform experiments on nine languages, using contextual word retrieval and zero-shot named entity recognition to measure both intrinsic cross-lingual word representation quality and downstream task performance, showing improvements on both tasks. Our results show that it is possible to improve pre-trained multilingual language models by relying only on non-parallel resources.

Cite

CITATION STYLE

APA

Hangya, V., Saadi, H. S., & Fraser, A. (2022). Improving Low-Resource Languages in Pre-Trained Multilingual Language Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 (pp. 11993–12006). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.emnlp-main.822

Improving Low-Resource Languages in Pre-Trained Multilingual Language Models

Abstract

Cite

Register to see more suggestions