Mini But Mighty: Efficient Multilingual Pretraining with Linguistically-Informed Data Selection

Tolúlopé Ògúnrèmí; Dan Jurafsky; Christopher D. Manning

Conference ProceedingsOPEN ACCESS

Mini But Mighty: Efficient Multilingual Pretraining with Linguistically-Informed Data Selection

EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2023 (2023) 1221-1236

DOI: 10.18653/v1/2023.findings-eacl.93

8Citations

21Readers

Abstract

With the prominence of large pretrained language models, low-resource languages are rarely modelled monolingually and become victims of the “curse of multilinguality” in massively multilingual models. Recently, AfriBERTa showed that training transformer models from scratch on 1GB of data from many unrelated African languages outperforms massively multilingual models on downstream NLP tasks. Here we extend this direction, focusing on the use of related languages. We propose that training on smaller amounts of data but from related languages could match the performance of models trained on large, unrelated data. We test our hypothesis on the Niger-Congo family and its Bantu and Volta-Niger sub-families, pretraining models with data solely from Niger-Congo languages and finetuning on 4 downstream tasks: NER, part-of-speech tagging, sentiment analysis and text classification. We find that models trained on genetically related languages achieve equal performance on downstream tasks in low-resource languages despite using less training data. We recommend selecting training data based on language-relatedness when pretraining language models for low-resource languages.

Cite

CITATION STYLE

APA

Ògúnrèmí, T., Jurafsky, D., & Manning, C. D. (2023). Mini But Mighty: Efficient Multilingual Pretraining with Linguistically-Informed Data Selection. In EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2023 (pp. 1221–1236). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-eacl.93

Mini But Mighty: Efficient Multilingual Pretraining with Linguistically-Informed Data Selection

Abstract

Cite

Register to see more suggestions