Abstract
Pre-trained multilingual language models such as mBERT and XLM-R have demonstrated great potential for zero-shot cross-lingual transfer to low web-resource languages (LRL). However, due to limited model capacity, the large difference in the sizes of available monolingual corpora between high web-resource languages (HRL) and LRLs does not provide enough scope of co-embedding the LRL with the HRL, thereby affecting the downstream task performance of LRLs. In this paper, we argue that relatedness among languages in a language family along the dimension of lexical overlap may be leveraged to overcome some of the corpora limitations of LRLs. We propose Overlap BPE (OBPE), a simple yet effective modification to the BPE vocabulary generation algorithm which enhances overlap across related languages. Through extensive experiments on multiple NLP tasks and datasets, we observe that OBPE generates a vocabulary that increases the representation of LRLs via tokens shared with HRLs. This results in improved zero-shot transfer from related HRLs to LRLs without reducing HRL representation and accuracy. Unlike previous studies that dismissed the importance of token-overlap, we show that in the low-resource related language setting, token overlap matters. Synthetically reducing the overlap to zero can cause as much as a four-fold drop in zero-shot transfer accuracy.
Cite
CITATION STYLE
Patil, V., Talukdar, P., & Sarawagi, S. (2022). Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 219–233). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.acl-long.18
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.