Abstract
This paper describes UTokyo’s submission to the AmericasNLP 2021 Shared Task on machine translation systems for indigenous languages of the Americas. We present a low-resource machine translation system that improves translation accuracy using cross-lingual language model pretraining. Our system uses an mBART implementation of FAIRSEQ to pretrain on a large set of monolingual data from a diverse set of high-resource languages before finetuning on 10 low-resource indigenous American languages: Aymara, Bribri, Asháninka, Guaraní, Wixarika, Náhuatl, Hñähñu, Quechua, Shipibo-Konibo, and Rarámuri. On average, our system achieved BLEU scores that were 1.64 higher and CHRF scores that were 0.0749 higher than the baseline.
Cite
CITATION STYLE
Zheng, F., Reid, M., Marrese-Taylor, E., & Matsuo, Y. (2021). Low-Resource Machine Translation Using Cross-Lingual Language Model Pretraining. In Proceedings of the 1st Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2021 (pp. 234–240). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.americasnlp-1.26
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.