On the Benefits of Learning to Route in Mixture-of-Experts Models

Nishanth Dikkala; Nikhil Ghosh; Raghu Meka; Rina Panigrahy; Nikhil Vyas; Xin Wang

Conference ProceedingsOPEN ACCESS

On the Benefits of Learning to Route in Mixture-of-Experts Models

EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings (2023) 9376-9396

DOI: 10.18653/v1/2023.emnlp-main.583

9Citations

8Readers

Abstract

Mixture-of-Expert (MoE) models, such as the Switch Transformer, allow us to scale model sizes while keeping the amount of compute time fixed. Prior work has established the computational benefits of MoE models. We investigate whether they offer benefits other than scaling up. A core component of these models is a router that routes input tokens to different experts in a layer. We show theoretical and empirical evidence that the router's ability to route intelligently confers a significant advantage to MoE models. We study synthetic settings where the input data is distributed in clusters and show theoretically and empirically that the router learns the cluster structure. Then we perform experiments on real data using the T5X library, where we observe that a trainable router confers a non-trivial benefit instead of a non-trainable router.

Cite

CITATION STYLE

APA

Dikkala, N., Ghosh, N., Meka, R., Panigrahy, R., Vyas, N., & Wang, X. (2023). On the Benefits of Learning to Route in Mixture-of-Experts Models. In EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 9376–9396). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.emnlp-main.583

On the Benefits of Learning to Route in Mixture-of-Experts Models

Abstract

Cite

Register to see more suggestions