Self-Adapting Linear Algebra Algorithms and Software
Proceedings of the IEEE (2005)
- ISSN: 00189219
- DOI: 10.1109/JPROC.2004.840848
Available from ieeexplore.ieee.org
or
Abstract
One of the main obstacles to the efficient solution of scientific problems is the problem of tuning software, both to the available architecture and to the user problem at hand. We describe approaches for obtaining tuned high-performance kernels and for automatically choosing suitable algorithms. Specifically, we describe the generation of dense and sparse Basic Linear Algebra Subprograms (BLAS) kernels, and the selection of linear solver algorithms. However, the ideas presented here extend beyond these areas, which can be considered proof of concept.
Author-supplied keywords
Page 1
Self-Adapting Linear Algebra Algorithms and Software
Self-Adapting Linear Algebra
Algorithms and Software
JAMES DEMMEL, FELLOW, IEEE, JACK DONGARRA, FELLOW, IEEE, VICTOR EIJKHOUT,
ERIKA FUENTES, ANTOINE PETITET, RICHARD VUDUC, R. CLINT WHALEY, AND
KATHERINE YELICK, MEMBER, IEEE
Invited Paper
One of the main obstacles to the efficient solution of scientific
problems is the problem of tuning software, both to the available ar-
chitecture and to the user problem at hand. We describe approaches
for obtaining tuned high-performance kernels and for automatically
choosing suitable algorithms. Specifically, we describe the genera-
tion of dense and sparse Basic Linear Algebra Subprograms (BLAS)
kernels, and the selection of linear solver algorithms. However, the
ideas presented here extend beyond these areas, which can be con-
sidered proof of concept.
Keywords—Adaptive methods, Basic Linear Algebra Subpro-
grams (BLAS), dense kernels, iterative methods, linear systems,
matrix–matrix product, matrix–vector product, performance opti-
mization, preconditioners, sparse kernels.
I. INTRODUCTION
Speed and portability are conflicting objectives in the
design of numerical software libraries. While the basic
notion of confining most of the hardware dependencies
in a small number of heavily used computational kernels
Manuscript received February 9, 2004; revised October 15, 2004.
J. Demmel is with the Computer Science Division, Electrical Engineering
and Computer Science Department, University of California, Berkeley CA
94720 USA (e-mail: demmel@cs.berkeley.edu).
J. Dongarra is with the Computer Science Department, University of Ten-
nessee, Knoxville, TN 37996 USA and also with the Computer Science
and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN
37831 and the Computer Science Department, Rice University, Houston, TX
USA.
V. Eijkhout and E. Fuentes are with the Innovative Computing Lab-
oratory, University of Tennessee, Knoxville, TN 37996 USA (e-mail:
eijkhout@cs.utk.edu; efuentes@cs.utk.edu).
A. P. Petitet is with Sun Microsystems, Paris 75016, France (e-mail:
antoine.petitet@sun.com).
R. Vuduc is with the Center for Applied Scientific Computing, Lawrence
Livermore National Laboratory, Livermore, CA 94551.
R. C. Whaley is with the Department of Computer Science, Florida State
University, Tallahassee, FL 32306-4530 USA (e-mail: whaley@cs.fsu.edu).
K. Yelick is with the Electrical Engineering and Computer Science
Department, University of California, Berkeley, CA 94720 USA (e-mail:
yelick@cs.berkeley.edu).
Digital Object Identifier 10.1109/JPROC.2004.840848
stands, optimized implementation of these these kernels is
rapidly growing infeasible. As processors, and in general
machine architectures, grow ever more complicated a library
consisting of reference implementations will lag far behind
achievable performance; however, optimization for any
given architecture is a considerable investment in time and
effort, to be repeated for any next processor to be ported to.
For any given architecture, customizing a numerical
kernel’s source code to optimize performance requires a
comprehensive understanding of the exploitable hardware
resources of that architecture. This primarily includes the
memory hierarchy and how it can be utilized to maximize
data reuse, as well as the functional units and registers and
how these hardware components can be programmed to
generate the correct operands at the correct time. Clearly, the
size of the various cache levels, the latency of floating-point
instructions, the number of floating-point units (FPUs), and
other hardware constants are essential parameters that must
be taken into consideration as well. Since this time-con-
suming customization process must be repeated whenever
a slightly different target architecture is available, or even
when a new version of the compiler is released, the relentless
pace of hardware innovation makes the tuning of numerical
libraries a constant burden.
In this paper we will present two software systems—Au-
tomatically Tuned Linear Algebra Software (ATLAS) for
dense and the BeBOP Optimized Sparse Kernel Interface
(OSKI) for sparse linear algebra kernels, respectively—that
use heuristic search strategies for exploring the architecture
parameter space. The resulting optimized kernels achieve a
considerable speedup over the reference algorithms on all
architectures tested.
In addition to the problem of optimizing kernels across
architectures, there is the fact that often there are several for-
mulations of the same operation that can be chosen. The vari-
ations can be the choice of data structure, as in OSKI, or the
0018-9219/$20.00 © 2005 IEEE
PROCEEDINGS OF THE IEEE, VOL. 93, NO. 2, FEBRUARY 2005 293
Algorithms and Software
JAMES DEMMEL, FELLOW, IEEE, JACK DONGARRA, FELLOW, IEEE, VICTOR EIJKHOUT,
ERIKA FUENTES, ANTOINE PETITET, RICHARD VUDUC, R. CLINT WHALEY, AND
KATHERINE YELICK, MEMBER, IEEE
Invited Paper
One of the main obstacles to the efficient solution of scientific
problems is the problem of tuning software, both to the available ar-
chitecture and to the user problem at hand. We describe approaches
for obtaining tuned high-performance kernels and for automatically
choosing suitable algorithms. Specifically, we describe the genera-
tion of dense and sparse Basic Linear Algebra Subprograms (BLAS)
kernels, and the selection of linear solver algorithms. However, the
ideas presented here extend beyond these areas, which can be con-
sidered proof of concept.
Keywords—Adaptive methods, Basic Linear Algebra Subpro-
grams (BLAS), dense kernels, iterative methods, linear systems,
matrix–matrix product, matrix–vector product, performance opti-
mization, preconditioners, sparse kernels.
I. INTRODUCTION
Speed and portability are conflicting objectives in the
design of numerical software libraries. While the basic
notion of confining most of the hardware dependencies
in a small number of heavily used computational kernels
Manuscript received February 9, 2004; revised October 15, 2004.
J. Demmel is with the Computer Science Division, Electrical Engineering
and Computer Science Department, University of California, Berkeley CA
94720 USA (e-mail: demmel@cs.berkeley.edu).
J. Dongarra is with the Computer Science Department, University of Ten-
nessee, Knoxville, TN 37996 USA and also with the Computer Science
and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN
37831 and the Computer Science Department, Rice University, Houston, TX
USA.
V. Eijkhout and E. Fuentes are with the Innovative Computing Lab-
oratory, University of Tennessee, Knoxville, TN 37996 USA (e-mail:
eijkhout@cs.utk.edu; efuentes@cs.utk.edu).
A. P. Petitet is with Sun Microsystems, Paris 75016, France (e-mail:
antoine.petitet@sun.com).
R. Vuduc is with the Center for Applied Scientific Computing, Lawrence
Livermore National Laboratory, Livermore, CA 94551.
R. C. Whaley is with the Department of Computer Science, Florida State
University, Tallahassee, FL 32306-4530 USA (e-mail: whaley@cs.fsu.edu).
K. Yelick is with the Electrical Engineering and Computer Science
Department, University of California, Berkeley, CA 94720 USA (e-mail:
yelick@cs.berkeley.edu).
Digital Object Identifier 10.1109/JPROC.2004.840848
stands, optimized implementation of these these kernels is
rapidly growing infeasible. As processors, and in general
machine architectures, grow ever more complicated a library
consisting of reference implementations will lag far behind
achievable performance; however, optimization for any
given architecture is a considerable investment in time and
effort, to be repeated for any next processor to be ported to.
For any given architecture, customizing a numerical
kernel’s source code to optimize performance requires a
comprehensive understanding of the exploitable hardware
resources of that architecture. This primarily includes the
memory hierarchy and how it can be utilized to maximize
data reuse, as well as the functional units and registers and
how these hardware components can be programmed to
generate the correct operands at the correct time. Clearly, the
size of the various cache levels, the latency of floating-point
instructions, the number of floating-point units (FPUs), and
other hardware constants are essential parameters that must
be taken into consideration as well. Since this time-con-
suming customization process must be repeated whenever
a slightly different target architecture is available, or even
when a new version of the compiler is released, the relentless
pace of hardware innovation makes the tuning of numerical
libraries a constant burden.
In this paper we will present two software systems—Au-
tomatically Tuned Linear Algebra Software (ATLAS) for
dense and the BeBOP Optimized Sparse Kernel Interface
(OSKI) for sparse linear algebra kernels, respectively—that
use heuristic search strategies for exploring the architecture
parameter space. The resulting optimized kernels achieve a
considerable speedup over the reference algorithms on all
architectures tested.
In addition to the problem of optimizing kernels across
architectures, there is the fact that often there are several for-
mulations of the same operation that can be chosen. The vari-
ations can be the choice of data structure, as in OSKI, or the
0018-9219/$20.00 © 2005 IEEE
PROCEEDINGS OF THE IEEE, VOL. 93, NO. 2, FEBRUARY 2005 293
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
16 Readers on Mendeley
by Discipline
6% Engineering
6% Mathematics
by Academic Status
50% Ph.D. Student
19% Assistant Professor
6% Other Professional
by Country
38% United States
13% Germany
6% Japan



