The computing requirements of scientific applications have influenced processor design, and have motivated the intro-duction and use of many-core processors, i.e., accelerators, for high performance computing (HPC). Consequently, it is now common for the compute nodes of HPC clusters to be comprised of multiple computing devices, including ac-celerators. Although execution time can be used to com-pare the performance of different computing devices, there exists no standard way to analyze application performance across devices with very different architectural designs and, thus, understand why one outperforms another. Without this knowledge, a developer is handicapped when attempting to effectively tune application performance, as is a hardware designer when trying to understand how best to improve the design of computing devices. In this paper, we use the LULESH 1.0 proxy application to compare and analyze the performance of three different accelerators: the Intel® Xeon Phi™ and the NVIDIA Fermi and Kepler GPUs. Our study shows that LULESH 1.0 exhibits similar executiontime behavior across the three accelerators, but runs up to 7X faster on the Kepler. Despite the significant architectural differences between the Xeon Phi™ and the GPUs, and the differences in the metrics used to characterize their performance, we were able to quantify why the Kepler outperforms both the Fermi and the Xeon Phi™. To do this, we compared their achieved instructions per cycle and vectorization usage, as well as their memory behavior and power and energy consumption.
CITATION STYLE
Gallardo, E., Teller, P. J., Argueta, A., & Jaloma, J. (2016). Cross-accelerator performance profiling. In ACM International Conference Proceeding Series (Vol. 17-21-July-2016). Association for Computing Machinery. https://doi.org/10.1145/2949550.2949567
Mendeley helps you to discover research relevant for your work.