Auto-tuning dense vector and matrix-vector operations for Fermi GPUs

Hans Henrik Brandenborg Sørensen

Conference Proceedings

Auto-tuning dense vector and matrix-vector operations for Fermi GPUs

Sørensen H

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2012) 7203 LNCS(PART 1) 619-629

DOI: 10.1007/978-3-642-31464-3_63

4Citations

10Readers

Get full text

Abstract

In this paper, we consider the automatic performance tuning of dense vector and matrix-vector operations on GPUs. Such operations form the backbone of level 1 and level 2 routines in the Basic Linear Algebra Subroutines (BLAS) library and are therefore of great importance in many scientific applications. As examples, we develop single-precision CUDA kernels for the Euclidian norm (SNRM2) and the matrix-vector multiplication (SGEMV). The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture). We show that auto-tuning can be successfully applied to achieve high performance for dense vector and matrix-vector operations by appropriately utilizing the fine-grained parallelism of the GPU. Our tuned kernels display between 25-100% better performance than the current CUBLAS 3.2 library. © 2012 Springer-Verlag.

Author supplied keywords

Cite

CITATION STYLE

APA

Sørensen, H. H. B. (2012). Auto-tuning dense vector and matrix-vector operations for Fermi GPUs. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7203 LNCS, pp. 619–629). https://doi.org/10.1007/978-3-642-31464-3_63

Auto-tuning dense vector and matrix-vector operations for Fermi GPUs

Abstract

Author supplied keywords

Cite

Register to see more suggestions