NV-group: Link-efficient reduction for distributed deep learning on modern dense GPU systems

32Citations
Citations of this article
14Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The advanced fabrics like NVIDIA NVLink are enabling the deployment of dense Graphics Processing Unit (GPU) systems such as DGX-2 and Summit. With the wide adoption of large-scale GPU-enabled systems for distributed deep learning (DL) training, it is vital to design efficient communication such as the Allreduce operation to achieve near-ideal speedup at scale. In this paper, we propose a link-efficient scheme through NVLink-aware cooperative reduction kernels to significantly accelerate Allreduce operations for distributed deep learning applications. By overlapping computation and communication and maximizing utilization of all available NVLinks between CPU and GPU, as well as among GPUs, we demonstrate 1.8X performance improvement of Allreduce on 1,536 GPUs compared to state-of-the-art GPU-Aware MPI and NVIDIA NCCL libraries. Finally, we demonstrate 93.9% and 89.7% scaling efficiency (i.e., 15X and 172X speedup) for training ResNet-50 models using TensorFlow on a 16-GPU DGX-2 node and on 192-GPUs of the Summit system, respectively. To the best of our knowledge, this is the first study that achieves near-ideal scaling efficiency for distributed DL training and deals with designs tailored for cutting-edge systems like DGX-2 and Summit clusters.

Cite

CITATION STYLE

APA

Chu, C. H., Kousha, P., Awan, A. A., Khorassani, K. S., Subramoni, H., & Panda, D. K. (2020). NV-group: Link-efficient reduction for distributed deep learning on modern dense GPU systems. In Proceedings of the International Conference on Supercomputing. Association for Computing Machinery. https://doi.org/10.1145/3392717.3392771

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free