Implementing the Himeno benchmark with CUDA on GPU clusters

86Citations
Citations of this article
27Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This paper describes the use of CUDA to accelerate the Himeno benchmark on clusters with GPUs. The implementation is designed to optimize memory bandwidth utilization. Our approach achieves over 83% of the theoretical peak bandwidth on a NVIDIA Tesla C1060 GPU and performs at over 50 GFlops. A multi-GPU implementation that utilizes MPI alongside CUDA streams to overlap GPU execution with data transfers allows linear scaling and performs at over 800 GFlops on a cluster with 16 GPUs. The paper presents the optimizations required to achieve this level of performance. © 2010 IEEE.

Cite

CITATION STYLE

APA

Phillips, E. H., & Fatica, M. (2010). Implementing the Himeno benchmark with CUDA on GPU clusters. In Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010. https://doi.org/10.1109/IPDPS.2010.5470394

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free