Scaling models to large sizes to improve performance has led a trend in deep learning, and sparsely activated Mixture-of-Expert (MoE) is a promising architecture to scale models. However, training MoE models in existing systems is expensive, mainly due to the All-to-All communication between layers.All-to-All communication originates from expert-centric paradigm: keeping experts in-place and exchanging intermediate data to feed experts. We propose the novel data-centric paradigm: keeping data in-place and moving experts between GPUs. Since experts' size can be smaller than the size of data, data-centric paradigm can reduce communication workload. Based on this insight, we develop Janus. First, Janus supports fine-grained asynchronous communication, which can overlap computation and communication. Janus implements a hierarchical communication to further reduce cross-node traffic by sharing the fetched experts in the same machine. Second, when scheduling the "fetching expert"requests, Janus implements a topology-aware priority strategy to utilize intra-node and inter-node links efficiently. Finally, Janus allows experts to be prefetched, which allows the downstream computation to start immediately once the previous step completes.Evaluated on a 32-A100 cluster, Janus can reduce the traffic up to 16× and achieves up to 2.06× speedup compared with current MoE training system.
CITATION STYLE
Liu, J., Wang, J. H., & Jiang, Y. (2023). Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models. In SIGCOMM 2023 - Proceedings of the ACM SIGCOMM 2023 Conference (pp. 486–498). Association for Computing Machinery, Inc. https://doi.org/10.1145/3603269.3604869
Mendeley helps you to discover research relevant for your work.