Sparsely gated Mixture of Experts (MoE) models have been shown to be a compute-efficient method to scale model capacity for multilingual machine translation. However, for low-resource tasks, MoE models severely over-fit. We introduce in this work effective regularization strategies, namely (1) dropout techniques for MoE layers in Expert Output Masking (EOM) and Final Output Masking (FOM), (2) Conditional MoE Routing (CMR) that learns what tokens require the extra capacity of MoE layers and (3) Curriculum Learning methods that introduce low-resource pairs at later stages of training. All these methods prevent over-fitting and improve the performance of MoE models on low-resource tasks without adversely affecting high-resource tasks. On a massively multilingual machine translation benchmark, our strategies result in about +1 chrF++ improvement in very low resource language pairs.
CITATION STYLE
Elbayad, M., Sun, A., & Bhosale, S. (2023). Fixing MoE Over-Fitting on Low-Resource Languages in Multilingual Machine Translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 14237–14253). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.897
Mendeley helps you to discover research relevant for your work.