Abstract
Pretrained multilingual contextual representations have shown great success, but due to the limits of their pretraining data, their benefits do not apply equally to all language varieties. This presents a challenge for language varieties unfamiliar to these models, whose labeled and unlabeled data is too limited to train a monolingual model effectively. We propose the use of additional language-specific pretraining and vocabulary augmentation to adapt multilingual models to low-resource settings. Using dependency parsing of four diverse low-resource language varieties as a case study, we show that these methods significantly improve performance over baselines, especially in the lowest-resource cases, and demonstrate the importance of the relationship between such models’ pretraining data and target language varieties.
Cite
CITATION STYLE
Chau, E. C., Lin, L. H., & Smith, N. A. (2020). Parsing with multilingual BERT, a small corpus, and a small treebank. In Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020 (pp. 1324–1334). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.findings-emnlp.118
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.