Low-Resource Machine Translation Using Cross-Lingual Language Model Pretraining

Francis Zheng; Machel Reid; Edison Marrese-Taylor; Yutaka Matsuo

Conference Proceedings

Low-Resource Machine Translation Using Cross-Lingual Language Model Pretraining

Proceedings of the 1st Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2021 (2021) 234-240

DOI: 10.18653/v1/2021.americasnlp-1.26

20Citations

68Readers

Get full text

Abstract

This paper describes UTokyo’s submission to the AmericasNLP 2021 Shared Task on machine translation systems for indigenous languages of the Americas. We present a low-resource machine translation system that improves translation accuracy using cross-lingual language model pretraining. Our system uses an mBART implementation of FAIRSEQ to pretrain on a large set of monolingual data from a diverse set of high-resource languages before finetuning on 10 low-resource indigenous American languages: Aymara, Bribri, Asháninka, Guaraní, Wixarika, Náhuatl, Hñähñu, Quechua, Shipibo-Konibo, and Rarámuri. On average, our system achieved BLEU scores that were 1.64 higher and CHRF scores that were 0.0749 higher than the baseline.

Cite

CITATION STYLE

APA

Zheng, F., Reid, M., Marrese-Taylor, E., & Matsuo, Y. (2021). Low-Resource Machine Translation Using Cross-Lingual Language Model Pretraining. In Proceedings of the 1st Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2021 (pp. 234–240). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.americasnlp-1.26

Low-Resource Machine Translation Using Cross-Lingual Language Model Pretraining

Abstract

Cite

Register to see more suggestions