NV-group: Link-efficient reduction for distributed deep learning on modern dense GPU systems

Ching Hsiang Chu; Pouya Kousha; Ammar Ahmad Awan; Kawthar Shafie Khorassani; Hari Subramoni; Dhabaleswar K. Panda

Conference ProceedingsOPEN ACCESS

NV-group: Link-efficient reduction for distributed deep learning on modern dense GPU systems

Proceedings of the International Conference on Supercomputing (2020)

DOI: 10.1145/3392717.3392771

32Citations

14Readers

Get full text

Abstract

The advanced fabrics like NVIDIA NVLink are enabling the deployment of dense Graphics Processing Unit (GPU) systems such as DGX-2 and Summit. With the wide adoption of large-scale GPU-enabled systems for distributed deep learning (DL) training, it is vital to design efficient communication such as the Allreduce operation to achieve near-ideal speedup at scale. In this paper, we propose a link-efficient scheme through NVLink-aware cooperative reduction kernels to significantly accelerate Allreduce operations for distributed deep learning applications. By overlapping computation and communication and maximizing utilization of all available NVLinks between CPU and GPU, as well as among GPUs, we demonstrate 1.8X performance improvement of Allreduce on 1,536 GPUs compared to state-of-the-art GPU-Aware MPI and NVIDIA NCCL libraries. Finally, we demonstrate 93.9% and 89.7% scaling efficiency (i.e., 15X and 172X speedup) for training ResNet-50 models using TensorFlow on a 16-GPU DGX-2 node and on 192-GPUs of the Summit system, respectively. To the best of our knowledge, this is the first study that achieves near-ideal scaling efficiency for distributed DL training and deals with designs tailored for cutting-edge systems like DGX-2 and Summit clusters.

Author supplied keywords

Cite

CITATION STYLE

APA

Chu, C. H., Kousha, P., Awan, A. A., Khorassani, K. S., Subramoni, H., & Panda, D. K. (2020). NV-group: Link-efficient reduction for distributed deep learning on modern dense GPU systems. In Proceedings of the International Conference on Supercomputing. Association for Computing Machinery. https://doi.org/10.1145/3392717.3392771

NV-group: Link-efficient reduction for distributed deep learning on modern dense GPU systems

Abstract

Author supplied keywords

Cite

Register to see more suggestions