Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters

6Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.
Get full text

Abstract

As more High-Performance Computing (HPC) and Deep Learning (DL) applications are adapting to scale using GPUs, the communication of GPU-resident data is becoming vital to end-to-end application performance. Among the available MPI operations in such applications, All-to-All is one of the most communication-intensive operations that becomes the bottleneck of efficiently scaling applications to larger GPU systems. Over the last decade, most research has focused on the optimization of large GPU-resident data transfers. However, for state-of-the-art GPU-Aware MPI libraries, MPI_Alltoall communication for large GPU-resident data still suffers from poor performance due to the throughput limitation of commodity networks. However, the development of GPU-based compression algorithms with high throughput can reduce the volume of data transferred. The recent research of point-to-point-based online compression with these compression algorithms has shown potential on modern GPU clusters. In this paper, we redesign an MPI library to enable efficient collective-level online compression with an optimized host-staging scheme for All-to-All communication. We demonstrate that the proposed design achieves benefits at both microbenchmark and application levels. At the microbenchmark level, the proposed design can reduce the All-to-All communication latency by up to 87%. For PSDNS, a traditional HPC application, our proposed design can reduce the All-to-All communication latency and total runtime by up to 29.2% and 21.8%, respectively, while ensuring data validation and not affecting the application convergence time. For Microsoft’s DeepSpeed, a DL optimization library, the proposed design reduces the MPI_Alltoall runtime by up to 26.4% compared to a state-of-the-art MPI library with point-to-point compression while ensuring data validation. To the best of our knowledge, this is the first work that leverages online GPU-based compression techniques to significantly accelerate MPI_Alltoall communication for HPC and DL applications.

Cite

CITATION STYLE

APA

Zhou, Q., Kousha, P., Anthony, Q., Shafie Khorassani, K., Shafi, A., Subramoni, H., & Panda, D. K. (2022). Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13289 LNCS, pp. 3–25). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-07312-0_1

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free