Auto-tuning dense vector and matrix-vector operations for Fermi GPUs

4Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper, we consider the automatic performance tuning of dense vector and matrix-vector operations on GPUs. Such operations form the backbone of level 1 and level 2 routines in the Basic Linear Algebra Subroutines (BLAS) library and are therefore of great importance in many scientific applications. As examples, we develop single-precision CUDA kernels for the Euclidian norm (SNRM2) and the matrix-vector multiplication (SGEMV). The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture). We show that auto-tuning can be successfully applied to achieve high performance for dense vector and matrix-vector operations by appropriately utilizing the fine-grained parallelism of the GPU. Our tuned kernels display between 25-100% better performance than the current CUBLAS 3.2 library. © 2012 Springer-Verlag.

Cite

CITATION STYLE

APA

Sørensen, H. H. B. (2012). Auto-tuning dense vector and matrix-vector operations for Fermi GPUs. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7203 LNCS, pp. 619–629). https://doi.org/10.1007/978-3-642-31464-3_63

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free