Implementing the Himeno benchmark with CUDA on GPU clusters

Everett H. Phillips; Massimiliano Fatica

Conference Proceedings

Implementing the Himeno benchmark with CUDA on GPU clusters

Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010 (2010)

DOI: 10.1109/IPDPS.2010.5470394

89Citations

27Readers

Get full text

Abstract

This paper describes the use of CUDA to accelerate the Himeno benchmark on clusters with GPUs. The implementation is designed to optimize memory bandwidth utilization. Our approach achieves over 83% of the theoretical peak bandwidth on a NVIDIA Tesla C1060 GPU and performs at over 50 GFlops. A multi-GPU implementation that utilizes MPI alongside CUDA streams to overlap GPU execution with data transfers allows linear scaling and performs at over 800 GFlops on a cluster with 16 GPUs. The paper presents the optimizations required to achieve this level of performance. © 2010 IEEE.

Cite

CITATION STYLE

APA

Phillips, E. H., & Fatica, M. (2010). Implementing the Himeno benchmark with CUDA on GPU clusters. In Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010. https://doi.org/10.1109/IPDPS.2010.5470394

Implementing the Himeno benchmark with CUDA on GPU clusters

Abstract

Cite

Register to see more suggestions