Modern GPUs open a completely new field to optimize embarrassingly parallel algorithms. Implementing an algorithm on a GPU confronts the programmer with a new set of challenges for program optimization. Especially tuning the program for the GPU memory hierarchy whose organization and performance implications are radically different from those of general purpose CPUs; and optimizing programs at the instruction-level for the GPU. In this paper we analyze different approaches for optimizing the memory usage and access patterns for GPUs and propose a class of memory layout optimizations that can take full advantage of the unique memory hierarchy of NVIDIA CUDA. Furthermore, we analyze some classical optimization techniques and how they effect the performance on a GPU. We used the Gravit gravity simulator to demonstrate these optimizations. The final optimized GPU version achieves a 87 × speedup compared to the original CPU version. Almost 30% of this speedup are direct results of the optimizations discussed in this paper.
CITATION STYLE
Siegel, J., Ributzka, J., & Li, X. (2011). CUDA memory optimizations for large data-structures in the gravit simulator. In Journal of Algorithms and Computational Technology (Vol. 5, pp. 341–362). https://doi.org/10.1260/1748-3018.5.2.341
Mendeley helps you to discover research relevant for your work.