Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters

Qinghua Zhou; Pouya Kousha; Quentin Anthony; Kawthar Shafie Khorassani; Aamir Shafi; Hari Subramoni; Dhabaleswar K. Panda

Conference Proceedings

Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2022) 13289 LNCS 3-25

DOI: 10.1007/978-3-031-07312-0_1

6Citations

9Readers

Get full text

Abstract

As more High-Performance Computing (HPC) and Deep Learning (DL) applications are adapting to scale using GPUs, the communication of GPU-resident data is becoming vital to end-to-end application performance. Among the available MPI operations in such applications, All-to-All is one of the most communication-intensive operations that becomes the bottleneck of efficiently scaling applications to larger GPU systems. Over the last decade, most research has focused on the optimization of large GPU-resident data transfers. However, for state-of-the-art GPU-Aware MPI libraries, MPI_Alltoall communication for large GPU-resident data still suffers from poor performance due to the throughput limitation of commodity networks. However, the development of GPU-based compression algorithms with high throughput can reduce the volume of data transferred. The recent research of point-to-point-based online compression with these compression algorithms has shown potential on modern GPU clusters. In this paper, we redesign an MPI library to enable efficient collective-level online compression with an optimized host-staging scheme for All-to-All communication. We demonstrate that the proposed design achieves benefits at both microbenchmark and application levels. At the microbenchmark level, the proposed design can reduce the All-to-All communication latency by up to 87%. For PSDNS, a traditional HPC application, our proposed design can reduce the All-to-All communication latency and total runtime by up to 29.2% and 21.8%, respectively, while ensuring data validation and not affecting the application convergence time. For Microsoft’s DeepSpeed, a DL optimization library, the proposed design reduces the MPI_Alltoall runtime by up to 26.4% compared to a state-of-the-art MPI library with point-to-point compression while ensuring data validation. To the best of our knowledge, this is the first work that leverages online GPU-based compression techniques to significantly accelerate MPI_Alltoall communication for HPC and DL applications.

Author supplied keywords

Cite

CITATION STYLE

APA

Zhou, Q., Kousha, P., Anthony, Q., Shafie Khorassani, K., Shafi, A., Subramoni, H., & Panda, D. K. (2022). Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13289 LNCS, pp. 3–25). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-07312-0_1

Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters

Abstract

Author supplied keywords

Cite

Register to see more suggestions