Designing high performance communication runtime for GPU managed memory: Early experiences

Dip Sankar Banerjee; Khaled Hamidouche; Dhabaleswar K. Panda

Conference Proceedings

Designing high performance communication runtime for GPU managed memory: Early experiences

9th Workshop on General Purpose Processing using GPUs, GPGPU 2016 - Proceedings (2016) 82-91

DOI: 10.1145/2884045.2884050

6Citations

16Readers

Get full text

Abstract

Graphics Processing Units (GPUs) have gained the position of a main stream accelerator due to its low power footprint and massive parallelism. CUDA 6.0 onward, NVIDIA has introduced the Managed Memory capability which unifies the host and device memory allocations into a single allocation and removes the requirement for explicit memory transfers between either memories. Several applications particularly of irregular nature can have immense benefits from managed memory because of the high productivity in programming that can be achieved owing to the minimal effort involved in the data management and movement. The MVAPICH2 library utilizes runtime designs such as CUDA Inter Process Communications (IPC) and GPUDirect RDMA (GDR) under the CUDA-Aware concept, to offer high productivity and programmability with MPI on modern clusters. However, integration and interaction of managed memory with these features raises challenges for efficient small and large message communications. In this study, we present an initial evaluation of managed memory capability and its interaction with existing high performance designs and features available in MVAPICH2 library. We propose new designs to enable efficient communication support between managed memory buffers. We also perform fine tuning to optimize the transfers between managed memories residing in GPUs. To the best of our knowledge, this is the first evaluation and study of managed memory and its interaction with MPI runtimes. A detailed evaluation and analysis of the performance of the proposed designs is presented. The Stencil2D communication kernel available in the SHOC suite was re-designed to enable the managed memory support. The evaluation shows a 4x improvement in the timings of stencil exchanges on 16 GPU nodes.

Cite

CITATION STYLE

APA

Banerjee, D. S., Hamidouche, K., & Panda, D. K. (2016). Designing high performance communication runtime for GPU managed memory: Early experiences. In 9th Workshop on General Purpose Processing using GPUs, GPGPU 2016 - Proceedings (pp. 82–91). Association for Computing Machinery, Inc. https://doi.org/10.1145/2884045.2884050

Designing high performance communication runtime for GPU managed memory: Early experiences

Abstract

Cite

Register to see more suggestions