Optimizing memory-bound SYMV kernel on GPU hardware accelerators

7Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Hardware accelerators are becoming ubiquitous high performance scientific computing. They are capable of delivering an unprecedented level of concurrent execution contexts. High-level programming language extensions (e.g., CUDA), profiling tools (e.g., PAPI-CUDA, CUDA Profiler) are paramount to improve productivity, while effectively exploiting the underlying hardware. We present an optimized numerical kernel for computing the symmetric matrix-vector product on nVidia Fermi GPUs. Due to its inherent memory-bound nature, this kernel is very critical in the tridiagonalization of a symmetric dense matrix, which is a preprocessing step to calculate the eigenpairs. Using a novel design to address the irregular memory accesses by hiding latency and increasing bandwidth, our preliminary asymptotic results show 3.5x and 2.5x fold speedups over the similar CUBLAS 4.0 kernel, and 7-8% and 30% fold improvement over the Matrix Algebra on GPU and Multicore Architectures (MAGMA) library in single and double precision arithmetics, respectively. © 2013 Springer-Verlag.

Cite

CITATION STYLE

APA

Abdelfattah, A., Dongarra, J., Keyes, D., & Ltaief, H. (2013). Optimizing memory-bound SYMV kernel on GPU hardware accelerators. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7851 LNCS, pp. 72–79). https://doi.org/10.1007/978-3-642-38718-0_10

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free