Sign up & Download
Sign in

Automatic Performance Tuning of Sparse Matrix Kernels

by Richard Wilson Vuduc
Matrix (2003)

Abstract

This dissertation presents an automated system to generate highly efficient, platform-adapted implementations of sparse matrix kernels. We show that conventional implementations of important sparse kernels like sparse matrix-vector multiply (SpMV) have historically run at 10% or less of peak machine speed on cache-based superscalar architectures. Our implementations of SpMV, automatically tuned using a methodology based on empirical-search, can by contrast achieve up to 31% of peak machine speed, and can be up to 4 faster. Given a matrix, kernel, and machine; our approach to selecting a fast implementation consists of two steps: (1) we identify and generate a space of reasonable implementations, and then (2) search this space for the fastest one using a combination of heuristic models and actual experiments (i.e., running and timing the code). We build on the SPARSITY system for generating highly-tuned implementations of the SpMV kernel y y + Ax, where A is a sparse matrix and x, y are dense vectors. We extend SPARSITY to support tuning for a variety of common non-zero patterns arising in practice, and for additional kernels like sparse triangular solve (SpTS) and computation of ATA x (or AAT x) and A x. We develop new models to compute, for particular data structures and kernels, the best absolute performance (e.g., Mflop/s) we might expect on a given matrix and machine. These performance upper bounds account for the cost of memory operations at all levels of the memory hierarchy, but assume ideal instruction scheduling and low-level tuning. We evaluate our performance with respect to such bounds, finding that the generated and tuned implementations of SpMV and SpTS achieve up to 75% of the performance bound. This finding places limits on the effectiveness of additional low-level tuning (e.g., better instruction selection and scheduling). (Abstract shortened by UMI.)

Cite this document (BETA)

Available from citeseerx.ist.psu.edu
Page 1
hidden

Automatic Performance Tuning of Sparse Matrix Kernels

Automatic Performance Tuning of Sparse Matrix Kernels
by
Richard Wilson Vuduc
B.S. (Cornell University) 1997
A dissertation submitted in partial satisfaction of the
requirements for the degree of
Doctor of Philosophy
in
Computer Science
in the
GRADUATE DIVISION
of the
UNIVERSITY OF CALIFORNIA, BERKELEY
Committee in charge:
Professor James W. Demmel, Chair
Professor Katherine A. Yelick
Professor Sanjay Govindjee
Fall 2003
Page 2
hidden
The dissertation of Richard Wilson Vuduc is approved:
Chair Date
Date
Date
University of California, Berkeley
Fall 2003
Page 3
hidden
Automatic Performance Tuning of Sparse Matrix Kernels
Copyright 2003
by
Richard Wilson Vuduc
Page 4
hidden
1Abstract
Automatic Performance Tuning of Sparse Matrix Kernels
by
Richard Wilson Vuduc
Doctor of Philosophy in Computer Science
University of California, Berkeley
Professor James W. Demmel, Chair
This dissertation presents an automated system to generate highly ecient, platform-
adapted implementations of sparse matrix kernels. These computational kernels lie at the
heart of diverse applications in scienti c computing, engineering, economic modeling, and
information retrieval, to name a few. Informally, sparse kernels are computational oper-
ations on matrices whose entries are mostly zero, so that operations with and storage of
these zero elements may be eliminated. The challenge in developing high-performance im-
plementations of such kernels is choosing the data structure and code that best exploits
the structural properties of the matrix|generally unknown until application run-time|for
high-performance on the underlying machine architecture (e.g., memory hierarchy con-
guration and CPU pipeline structure). We show that conventional implementations of
important sparse kernels like sparse matrix-vector multiply (SpMV) have historically run
at 10% or less of peak machine speed on cache-based superscalar architectures. Our imple-
mentations of SpMV, automatically tuned using a methodology based on empirical-search,
can by contrast achieve up to 31% of peak machine speed, and can be up to 4 faster.
Given a matrix, kernel, and machine, our approach to selecting a fast implemen-
tation consists of two steps: (1) we identify and generate a space of reasonable implemen-
tations, and then (2) search this space for the fastest one using a combination of heuristic
models and actual experiments (i.e., running and timing the code). We build on the Spar-
sity system for generating highly-tuned implementations of the SpMV kernel y y+Ax,
where A is a sparse matrix and x; y are dense vectors. We extend Sparsity to support
tuning for a variety of common non-zero patterns arising in practice, and for additional
Page 5
hidden
2kernels like sparse triangular solve (SpTS) and computation of ATAx (or AATx) and A

x.
We develop new models to compute, for particular data structures and kernels, the
best absolute performance (e.g., M
op/s) we might expect on a given matrix and machine.
These performance upper bounds account for the cost of memory operations at all levels of
the memory hierarchy, but assume ideal instruction scheduling and low-level tuning. We
evaluate our performance with respect to such bounds, nding that the generated and tuned
implementations of SpMV and SpTS achieve up to 75% of the performance bound. This
nding places limits on the e ectiveness of additional low-level tuning (e.g., better instruc-
tion selection and scheduling). Instances in which we are further from the bounds (e.g., for
ATAx) indicate new opportunities to close the gap by applying existing automatic low-level
tuning technology. We also use these bounds to assess (partially) what architectures are
good for kernels like SpMV. Among other conclusions, we nd that performance improve-
ments may be possible for SpMV (and other streaming applications) by ensuring strictly
increasing cache line sizes in multi-level memory hierarchies.
The costs and steps of tuning imply changes to the design of sparse matrix libraries.
We propose extensions to the recent standardized interface, the Sparse Basic Linear Algebra
Subroutines (SpBLAS). We argue that such an extended interface complements existing
approaches to sparse code generation, and furthermore is a suitable building block for
widely-used higher-level scienti c libraries and systems (e.g., PETSc and MATLAB) to
provide users with high-performance sparse kernels.
Looking toward future tuning systems, we consider an aspect of the tuning problem
that is common to all current systems: the problem of search. Speci cally, we pose two
search-related problems. First, we consider the problem of stopping an exhaustive search
while providing approximate bounds on the probability that an optimal implementation
has been found. Second, we consider the problem of choosing at run-time one from among
several possible implementations based on the run-time input. We formalize both problems
in a manner amenable to attack by statistical modeling techniques. Our methods may
potentially apply broadly to tuning systems for as yet unexplored domains.
Professor James W. Demmel
Dissertation Committee Chair
Page 6
hidden
iTo my bringer of wing zings.
Page 7
hidden
ii
Contents
List of Figures vi
List of Tables xi
List of Symbols xiv
1 Introduction 1
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Problem Context and History . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Hardware and software trends in sparse kernel performance . . . . . 6
1.2.2 Emergence of standard library interfaces . . . . . . . . . . . . . . . . 8
1.3 The Case for Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Summary, Scope, and Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Basic Sparse Matrix Data Structures 18
2.1 Survey of Basic Data Structures . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.1 Example: Contrasting dense and sparse storage . . . . . . . . . . . . 20
2.1.2 Dense storage formats . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.3 Sparse vector formats . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1.4 Sparse matrix formats . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Experimental Comparison of the Basic Formats . . . . . . . . . . . . . . . . 39
2.2.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2.2 Results on the Sparsity matrix benchmark suite . . . . . . . . . . . 41
2.3 Note on Recursive Storage Formats . . . . . . . . . . . . . . . . . . . . . . . 45
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3 Improved Register-Level Tuning of Sparse Matrix-Vector Multiply 51
3.1 Register Blocked Sparse Matrix-Vector Multiply . . . . . . . . . . . . . . . 53
3.1.1 Register blocking overview . . . . . . . . . . . . . . . . . . . . . . . . 54
3.1.2 Surprising performance behavior in practice . . . . . . . . . . . . . . 58
3.2 An Improved Heuristic for Block Size Selection . . . . . . . . . . . . . . . . 67
3.2.1 A ll ratio estimation algorithm . . . . . . . . . . . . . . . . . . . . 68
3.2.2 Tuning the ll estimator: cost and accuracy trade-o s . . . . . . . . 71
3.3 Evaluation of the Heuristic: Accuracy and Costs . . . . . . . . . . . . . . . 77
Page 8
hidden
iii
3.3.1 Accuracy of the Sparsity Version 2 heuristic . . . . . . . . . . . . . 78
3.3.2 The costs of register block size tuning . . . . . . . . . . . . . . . . . 85
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4 Performance Bounds for Sparse Matrix-Vector Multiply 93
4.1 A Performance Bounds Model for Register Blocking . . . . . . . . . . . . . 95
4.1.1 Modeling framework and summary of key assumptions . . . . . . . . 96
4.1.2 A latency-based model of execution time . . . . . . . . . . . . . . . . 97
4.1.3 Lower (and upper) bounds on cache misses . . . . . . . . . . . . . . 98
4.2 Experimental Evaluation of the Bounds Model . . . . . . . . . . . . . . . . 100
4.2.1 Determining the model latencies . . . . . . . . . . . . . . . . . . . . 102
4.2.2 Cache miss model validation . . . . . . . . . . . . . . . . . . . . . . 109
4.2.3 Key results: observed performance vs. the bounds model . . . . . . 121
4.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5 Advanced Data Structures for Sparse Matrix-Vector Multiply 143
5.1 Splitting variable-block matrices . . . . . . . . . . . . . . . . . . . . . . . . 144
5.1.1 Test matrices and a motivating example . . . . . . . . . . . . . . . . 147
5.1.2 Altering the non-zero distribution of blocks using ll . . . . . . . . . 149
5.1.3 Choosing a splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.1.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5.2 Exploiting diagonal structure . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.2.1 Test matrices and motivating examples . . . . . . . . . . . . . . . . 165
5.2.2 Row segmented diagonal format . . . . . . . . . . . . . . . . . . . . 166
5.2.3 Converting to row segmented diagonal format . . . . . . . . . . . . . 169
5.2.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.3 Summary and overview of additional techniques . . . . . . . . . . . . . . . . 179
6 Performance Tuning and Bounds for Sparse Triangular Solve 184
6.1 Optimization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.1.1 Improving register reuse: register blocking . . . . . . . . . . . . . . . 188
6.1.2 Using the dense BLAS: switch-to-dense . . . . . . . . . . . . . . . . 189
6.1.3 Tuning parameter selection . . . . . . . . . . . . . . . . . . . . . . . 189
6.2 Performance Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
6.2.1 Review of the latency-based execution time model . . . . . . . . . . 192
6.2.2 Cache miss lower and upper bounds . . . . . . . . . . . . . . . . . . 193
6.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.3.1 Validating the cache miss bounds . . . . . . . . . . . . . . . . . . . . 194
6.3.2 Evaluating optimized SpTS performance . . . . . . . . . . . . . . . . 195
6.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Page 9
hidden
iv
7 Higher-Level Sparse Kernels 202
7.1 Automatically Tuning ATA x for the Memory Hierarchy . . . . . . . . . . . 203
7.2 Upper Bounds on ATA x Performance . . . . . . . . . . . . . . . . . . . . . 205
7.2.1 A latency-based execution time model . . . . . . . . . . . . . . . . . 207
7.2.2 A lower bound on cache misses . . . . . . . . . . . . . . . . . . . . . 207
7.3 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 209
7.3.1 Validation of the cache miss model . . . . . . . . . . . . . . . . . . . 209
7.3.2 Performance evaluation of our ATA x implementations . . . . . . . 215
7.4 Matrix Powers: A

 x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
7.4.1 Basic serial sparse tiling algorithm . . . . . . . . . . . . . . . . . . . 220
7.4.2 Preliminary results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
8 Library Design and Implementation 238
8.1 Rationale and Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
8.2 Overview of the Sparse BLAS interface . . . . . . . . . . . . . . . . . . . . . 240
8.3 Tuning Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
8.3.1 Interfaces for sparse A&AT , ATA x, and AAT x . . . . . . . . . . . 248
8.3.2 Kernel-speci c tune routines . . . . . . . . . . . . . . . . . . . . . . 250
8.3.3 Handle pro le save and restore . . . . . . . . . . . . . . . . . . . . . 252
8.4 Complementary Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
9 Statistical Approaches to Search 258
9.1 Revisiting The Case for Search: Dense Matrix Multiply . . . . . . . . . . . 261
9.1.1 Factors in
uencing matrix multiply performance . . . . . . . . . . . 261
9.1.2 A needle in a haystack: the need for search . . . . . . . . . . . . . . 263
9.2 A Statistical Early Stopping Criterion . . . . . . . . . . . . . . . . . . . . . 265
9.2.1 A formal model and stopping criterion . . . . . . . . . . . . . . . . . 267
9.2.2 Results and discussion using PHiPAC data . . . . . . . . . . . . . . 270
9.3 Statistical Classi ers for Run-time Selection . . . . . . . . . . . . . . . . . . 276
9.3.1 A formal framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
9.3.2 Parametric data model: linear regression modeling . . . . . . . . . . 279
9.3.3 Parametric geometric model: separating hyperplanes . . . . . . . . . 280
9.3.4 Nonparametric geometric model: support vectors . . . . . . . . . . . 281
9.3.5 Results and discussion with PHiPAC data . . . . . . . . . . . . . . . 281
9.4 A Survey of Empirical Search-Based Approaches to Code Generation . . . . 287
9.4.1 Kernel-centric empirical search-based tuning . . . . . . . . . . . . . 290
9.4.2 Compiler-centric empirical search-based tuning . . . . . . . . . . . . 296
9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
10 Conclusions and Future Directions 304
10.1 Main Results for Sparse Kernels . . . . . . . . . . . . . . . . . . . . . . . . 305
10.2 Summary of High-Level Themes . . . . . . . . . . . . . . . . . . . . . . . . 306
10.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
Page 11
hidden
vi
List of Figures
1.1 SpMV performance trends across architectures and over time . . . . . . . . 9
1.2 Spy plot of sparse matrix raefsky3 . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 The need for search: SpMV performance on raefsky3 across six platforms . 13
2.1 Dense column-major format . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Dense row-major format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Dense block-major format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Dense packed lower triangular (column major) format . . . . . . . . . . . . 24
2.5 Dense band format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Sparse vector example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7 Compressed sparse row (CSR) format . . . . . . . . . . . . . . . . . . . . . 28
2.8 Compressed sparse column (CSC) format . . . . . . . . . . . . . . . . . . . 29
2.9 Diagonal format (DIAG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.10 Modi ed sparse row (MSR) format . . . . . . . . . . . . . . . . . . . . . . . 31
2.11 ELLPACK/ITPACK format . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.12 Jagged diagonal format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.13 Block compressed sparse row (BCSR) format . . . . . . . . . . . . . . . . . 37
2.14 Variable block row (VBR) format . . . . . . . . . . . . . . . . . . . . . . . . 39
2.15 SpMV performance using baseline formats on Matrix Benchmark Suite #1:
Sun Ultra 2i (top) and Ultra 3 (bottom) platforms . . . . . . . . . . . . . . 47
2.16 SpMV performance using baseline formats on Matrix Benchmark Suite #1:
Intel Pentium III (top) and Pentium III-M (bottom) platforms . . . . . . . . 48
2.17 SpMV performance using baseline formats on Matrix Benchmark Suite #1:
IBM Power3 (top) and Power4 (bottom) platforms . . . . . . . . . . . . . . 49
2.18 SpMV performance using baseline formats on Matrix Benchmark Suite #1:
Intel Itanium 1 (top) and Itanium 2 (bottom) platforms . . . . . . . . . . . 50
3.1 Example C implementations of matrix-vector multiply for dense and sparse
BCSR matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2 Example of a non-obvious blocking . . . . . . . . . . . . . . . . . . . . . . . 57
3.3 SpMV BCSR Performance Pro les: Sun Platforms . . . . . . . . . . . . . . 62
3.4 SpMV BCSR Performance Pro les: Intel (x86) Platforms . . . . . . . . . . 63
3.5 SpMV BCSR Performance Pro les: IBM Platforms . . . . . . . . . . . . . . 64
Page 16
hidden
xi
List of Tables
1.1 The need for search: Summary of SpMV performance . . . . . . . . . . . . 15
2.1 Summary across platforms of baseline SpMV performance . . . . . . . . . . 43
3.1 Summary of SpMV register pro les (dense matrix) . . . . . . . . . . . . . . 66
3.2 Top 5 predictions compared to actual performance: Matrix 27 on Itanium 1 84
4.1 Machine-speci c parameters for performance model evaluation . . . . . . . 103
4.2 Summary of load and cache miss count accuracy . . . . . . . . . . . . . . . 115
5.1 Best performance and block sizes under register blocking: Matrix 12-raefsky4148
5.2 Variable block test matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.3 Diagonal test matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.4 Comparing storage requirements between row segmented diagonal storage
and register blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.1 Triangular matrix benchmark suite . . . . . . . . . . . . . . . . . . . . . . . 186
7.1 Proof-of-principle results for serial sparse tiled A

x on stencil matrices: Ultra
2i platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
7.2 Proof-of-principle results for serial sparse tiled A

 x on stencil matrices:
Pentium III platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
8.1 SpBLAS properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
8.2 SpBLAS computational kernels . . . . . . . . . . . . . . . . . . . . . . . . . 245
8.3 Proposed SpBLAS extensions to support tuning . . . . . . . . . . . . . . . . 247
A.1 Historical SpMV data: microprocessors . . . . . . . . . . . . . . . . . . . . 348
A.2 Historical SpMV data: vector processors . . . . . . . . . . . . . . . . . . . . 350
B.1 Hardware platforms (1/2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
B.2 Hardware platforms (2/2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
B.3 Sparsity matrix benchmark suite: Matrices 1{9 ( nite element matrices) . 354
B.4 Sparsity matrix benchmark suite: Matrices 10{17 ( nite element matrices) 355
Page 18
hidden
xiii
E.7 Comparison of register blocked SpMV performance to the upper bound model:
Ultra 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
E.8 Comparison of register blocked SpMV performance to the upper bound model:
Pentium III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
E.9 Comparison of register blocked SpMV performance to the upper bound model:
Pentium III-M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
E.10 Comparison of register blocked SpMV performance to the upper bound model:
Power3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
E.11 Comparison of register blocked SpMV performance to the upper bound model:
Power4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
E.12 Comparison of register blocked SpMV performance to the upper bound model:
Itanium 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
E.13 Comparison of register blocked SpMV performance to the upper bound model:
Itanium 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
G.1 Best unaligned block compressed sparse row splittings on variable block ma-
trices, compared to register blocking: Ultra 2i . . . . . . . . . . . . . . . . . 416
G.2 Best unaligned block compressed sparse row splittings on variable block ma-
trices, compared to register blocking: Pentium III-M . . . . . . . . . . . . . 417
G.3 Best unaligned block compressed sparse row splittings on variable block ma-
trices, compared to register blocking: Power4 . . . . . . . . . . . . . . . . . 418
G.4 Best unaligned block compressed sparse row splittings on variable block ma-
trices, compared to register blocking: Itanium 2 . . . . . . . . . . . . . . . . 419
H.1 Best row segmented diagonal + register blocking performance, compared to
register blocking only: Ultra 2i . . . . . . . . . . . . . . . . . . . . . . . . . 422
H.2 Best row segmented diagonal + register blocking performance, compared to
register blocking only: Pentium III-M . . . . . . . . . . . . . . . . . . . . . 422
H.3 Best row segmented diagonal + register blocking performance, compared to
register blocking only: Power4 . . . . . . . . . . . . . . . . . . . . . . . . . . 423
H.4 Best row segmented diagonal + register blocking performance, compared to
register blocking only: Itanium 2 . . . . . . . . . . . . . . . . . . . . . . . . 423
I.1 Block size summary data for the Sun Ultra 2i platform . . . . . . . . . . . . 426
I.2 Block size summary data for the Intel Pentium III platform . . . . . . . . . 427
I.3 Block size summary data for the IBM Power3 platform . . . . . . . . . . . . 428
I.4 Block size summary data for the Intel Itanium platform . . . . . . . . . . . 431
J.1 Tabulated performance data under serial sparse tiling: Ultra 2i . . . . . . . 432
J.2 Tabulated performance data under serial sparse tiling: Pentium III . . . . . 433
Page 19
hidden
xiv
List of Symbols
BCSR Block compressed sparse row storage format
BLAS Basic Linear Algebra Subroutines
COO Coordinate storage format
CSC Compressed sparse columnw storage format
CSR Compressed sparse row storage format
DIAG Diagonal storage format
ELL ELLPACK/ITPACK storage format
FEM Finite element method
GEMM Dense matrix-matrix multiply BLAS routine
GEMV Dense matrix-vector multiply BLAS routine
JAD Jagged diagonal storage format
MPI Message Passing Interface
MSR Modi ed sparse row storage format
RSDIAG Row segmented diagonal format
SKY Skyline storage format
TBCSR Tiled blocked compressed sparse row format
TCSR Tiled compressed sparse row format
VBR Variable block row storage format
SpA&AT The sparse kernel y y +Ax; z z +ATw
SpATA The sparse kernel y y +ATA x
SpAAT The sparse kernel y y +AAT x
SpBLAS Sparse Basic Linear Algebra Subroutines
SpMM Sparse matrix-multiple vector multiply
Page 21
hidden
xvi
Acknowledgments
First and foremost, I thank Jim Demmel for being a wonderfully supportive and engaging
advisor. He is truly a model of what great scientists can ask, imagine, and achieve. I
have yet to meet anyone with his patience, his attention to detail, or his seemingly in nite
capacity for espresso-based drinks. I will miss our occasional chats when walking between
Soda Hall and Brewed Awakening.
I also thank Kathy Yelick for being so very understanding, not to mention a good
listener|she often understood my incoherent, ill-formed questions and statements long
before I even nished them. I especially appreciate her guidance on and insights into the
systems aspects of computer science.
Among other faculty at Berkeley, I thank Susan Graham, Ching-Hsui Cheng, and
Sanjay Govindjee for taking the time to serve on my quals and dissertation committees.
External to Cal, Zhaojun Bai has always made me feel as though my work were actually
interesting. David Keyes has given me more of his time than I probably deserve, arranging
in particular for access to Power4 and Ultra3 based machines (the latter due also in part to
Barry Smith). For my rst experience with research, I thank Bohdan Balko and Maile Fries
at the Institute for Defense Analyses. I would have been prepared neither to experience the
frequent joy nor the occasional pain of science without their early guidance.
Of my countless helpful colleagues, I especially admire Je Bilmes, Melody Ivory,
and Remzi Arpaci-Dusseau for their amazing humor, intelligence, and hard work. They
were my personal models of graduate student success. The inspiration for this dissertation
speci cally comes from early work by Je Bilmes and Krste Asanovic on PHiPAC, as well
as pioneering work on Sparsity by my friend and colleague, Eun-Jin Im.
For their pleasant distractions (e.g., video games, Cowboy Bebop, snacks, dim
sum, useless informationabout the 1980s, the occasional Birkball experience, and excur-
sions in and around the greater Bay Area including the Lincoln Highway), I am especially
grateful to Andy Begel, Jason Hong, (Honorary Brother) Benjamin Horowitz, Francis Li,
and Jimmy Lin. Jason Riedy reminded me to eat lunch every now and again, introduced me
to the historic Paramount Theater in Oakland, and made delicious cookies when I needed
them most (namely, at my qualifying exam!). For being inquisitive colleagues and under-
standing companions, I o er additional thanks and best wishes to Mark Adams, Sharad
Agarwal, David Bindel, Tzu-Yi Chen and Chris Umans, Inderjit Dhillon, Plamen Koev,
Page 22
hidden
xvii
Osni Marques, Andrew Ng, David Oppenheimer, and Christof Vomel. I learned as much
about academic life from this group as I did by experiencing it myself.
I was very fortunate to have worked alongside a number of incredibly talented
undergraduate researchers. These individuals contributed enormously to the data collected
in this dissertation: Michael deLorimier, Attila Gyulassy, Jen Hsu, Sriram Iyer, Shoaib
Kamil, Ben Lee, Jin Moon, Rajesh Nishtala, and Tuyet-Linh Phan. I also owe much
inspiration for learning and teaching to my rst class of CS 61B students (Fall '97). I
have a particularly deep respect for Sriram Iyer and Andy Atwal, from whom I learned
much about the greeting card business and writing code in \the real world."
The Extended Bu alo Funk Family imparted much wisdom on me. I thank them
for everything since our days on the Hill. And from way, way back in the day, I hope Sam
Z., Ted S., and Kevin C. will forgive me for having not kept in better touch while I whittled
away at this rather bloated tome.
Agata, you are my best friend and the love of my life. I could not have nished
this \ridiculous thing" without your constant emotional support, con dence, understanding,
tolerance, and faith in me. \I choo-choo-choose you."
Most of all, I owe an especially deep gratitude to Mom for her never-ending love,
patience, and encouragement. She created the circumstances that allowed me to pursue my
dreams when there should not have been a choice to do so. I will always remember the
things she sacri ced to make a better life for me possible.
This research was supported in part by the National Science Foundation under NSF
Cooperative Agreement No. ACI-9813362, NSF Cooperative Agreement No. ACI-9619020,
the Department of Energy under DOE Grant No. DE-FC02-01ER25478, and a gift from
Intel. The information presented here does not necessarily re
ect the position or the policy
of the Government and no ocial endorsement should be inferred. Some experiments were
performed at the High Performance Computing Research Facility, Mathematics and Com-
puter Science Division, Argonne National Laboratory. Many experiments were completed
thanks to the friendly sta at Ca e Strada, Brewed Awakening, the Free Speech Movement
Cafe, the Hudson Bay Cafe, and the Co ee Bean. For help with collecting test matrices, I
thank the following individuals: Mark Adams, Robert Taylor, Ali Pinar, Parry Husbands,
Tzu-Yi Chen, Jason Riedy, Xiaoye Li, Pauletto Giorgio, Michel Juillard, Sanjay Govind-
jee, Z. Sroczynski, Osni Marques, Darren Wilkinson, Linda Garside, Ken Pearson, Roman
Geus, David Bindel, and Florin Dobrian.
Page 23
hidden
1Chapter 1
Introduction
Contents
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Problem Context and History . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Hardware and software trends in sparse kernel performance . . . . 6
1.2.2 Emergence of standard library interfaces . . . . . . . . . . . . . . . 8
1.3 The Case for Search . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Summary, Scope, and Outline . . . . . . . . . . . . . . . . . . . . 14
This dissertation presents an automated system to generate highly ecient, platform-
adapted implementations of sparse matrix kernels. These kernels are frequently compu-
tational bottlenecks in diverse applications in scienti c computing, engineering, economic
modeling, and information retrieval (to name a few). However, the task of extracting near-
peak performance from them on modern cache-based superscalar machines has proven to be
extremely dicult. In practice, performance is a complicated function of many factors, in-
cluding the underlying machine architecture, compiler technology, each kernel's instruction
mix and memory access behavior, and the nature of the input data which might be known
only at run-time. The gap in performance between tuned and untuned code can be severe.
We show that one of the central sparse kernels, sparse matrix-vector multiply (SpMV), has
historically run at 10% or less of machine peak. Our implementations, automatically tuned
using a methodology based on empirical search, can by contrast achieve up to 31% of peak
machine speed, and can be up to 4 faster.
Page 27
hidden
5the key intuition behind our approach by presenting the surprising quantitative results of
an experiment in tuning SpMV (Section 1.3): we show instances on modern architectures
in which observed performance behavior does not match what we would reasonably expect,
and worse still, that performance behavior varies dramatically across platforms. These ob-
servations compose the central insight behind our claim that automatic performance tuning
requires a platform-speci c, search-based approach.
1.1 Contributions
Recall that the speci c starting point of this dissertation is Sparsity [167, 164, 166], which
generates tuned implementations of the SpMV kernel, y y + Ax, where A is a sparse
matrix and x; y are dense vectors. We improve and extend this work in the following ways:
 We consider an implementation space for SpMV that includes a variety of data struc-
tures beyond those originally proposed by Sparsity (namely, splitting for multiple
block substructure and diagonals, discussed in Chapter 5). We also present an im-
proved heuristic for the tuning parameter selection for the so-called register blocking
optimization (Chapter 3) [316].
 We apply these techniques to new sparse kernels, including
{ sparse triangular solve (SpTS) (Chapter 6): y T1x, where T is a sparse
triangular matrix [319],
{ multiplication by ATA or AAT (Chapter 7): y ATAx or y AATx [317].
{ applying powers of a matrix, i.e., computing y Akx, where k is an positive
integer.
 We develop new matrix- and architecture-speci c bounds on performance, as a way
to evaluate the quality of code being generated (Chapter 4). For example, sometimes
these bounds show that our implementations are within, say, 75% of \optimal" in a
sense to be made precise in Chapter 4. In short, the bounds guide us in understanding
when we should expect the pay-o s from low-level tuning (e.g., better instruction
scheduling) to be signi cant [316]. Moreover, these bounds partially suggest what
architectures are well-suited to sparse kernels. We also study architectural aspects
Page 28
hidden
6and implications, in particular, nding that strictly increasing line sizes could boost
performance for SpMV, and streaming applications more generally.
 We examine the search problem as a problem in its own right (Chapter 9). We
pose two problems that arise in the search process, and show how these problems
are amenable to statistical modeling techniques [313]. Our techniques complement
existing approaches to search, and will be broadly applicable to future tuning systems.
(Citations refer to earlier versions of this material; this dissertation provides additional
details and updated results on several new architectures.)
1.2 Problem Context and History
Developments in automatic performance tuning have been driven both by trends in hardware
architectures and the emergence of standardization in software libraries for computational
kernels. Below, we provide a brief history of these technologies and trends that are central
to our work. We explore connections to related work more deeply in subsequent chapters.
1.2.1 Hardware and software trends in sparse kernel performance
We begin by arguing that trends in SpMV performance suggest an increasing gap between
what level of performance is possible when one relies solely on improvements in hardware and
compiler technology compared to what is possible with software tuning. This gap motivates
continued innovations in algorithmic and low-level tuning, in the spirit of automatic tuning
systems like the one we are proposing for sparse kernels.
Although Moore's Law suggests that microprocessor transistor capacity|and hence
performance|should double every 1.5{2 years,1 the extent to which applications can realize
the bene ts of these improvements depends strongly on memory access patterns. Analysts
have observed an exponentially increasing gap between the CPU cycle times and memory
access latencies|this phenomenon is sometimes referred to as the memory wall [333], re-
ecting a lack of balanced machine designs [216, 65]. However, Ertl notes that simultaneous
improvements in memory system design have for the time being still hidden this memory
1At least until physical (e.g., thermal and atomic) barriers are encountered [229]: current projections
suggest Moore's Law can be maintained at least until 2010 [188].
Page 29
hidden
7wall e ect for at least some widely used applications [113]. Still, few argue with the idea
that the gap exists and is worsening.
Figure 1.1 (top) shows where SpMV performance stands relative to Moore's Law.
Speci cally, we show SpMV speeds in M
op/s over time based on studies conducted on a
variety of architectures since 1987 [266, 51, 23, 88, 301, 326, 52, 129, 293, 197, 223, 167,
221, 323, 316]. (The tabulated data and remarks on methodology appear in Appendix A.
Data points taken from the NAS CG benchmark [23] are handled specially, and marked in
Figure 1.1 by an 'N'. See Appendix A.) We distinguish between vector processors (shown
with solid red triangles) and microprocessors (shown with blue squares and green aster-
isks), since Moore's Law applies to microprocessors. Furthermore, for microprocessors we
separate performance results into \reference" (or \untuned") implementations (shown by
green asterisks), and \tuned" implementations (shown by hollow blue squares)|in most
studies, authors report performance both before and after application of some proposed
data structure or optimization technique.2 Finally, through each set of points we show
performance trend lines of the form p(t) = p02
t
 , where t is time (in years since 1987), and
p0;  are chosen by a linear regression t of the data to log2p(t). In this model,  is the
doubling-time, i.e., the period of time after which performance doubles. Below, we answer
the question of how the doubling-time  compares between untuned implementations, tuned
implementations, and theoretical peak performance (Moore's Law).
First, observe that the untuned performance doubles approximately every two
years (2.07), which is consistent with Moore's Law. Indeed, if one examines the doubling-
time of the peak speeds for the machines shown in the plot, one nds that peak performance
doubles every 1.94 years. SpMV is memory-bound since there are only two
oating point
operations (
ops) worth of work for every
oating point operand fetched from main memory.
Thus, one possibly surprising aspect of the trend in untuned performance is that it scales
according to Moore's Law. In fact, SpMV represents one type of workload, which we later
show has a memory access pattern that is largely like streaming applications (Chapters 3{
4). It may be that increasingly aggressive cache and memory system designs (e.g., hardware
prefetching, longer cache line sizes, support for larger numbers of outstanding misses) have
helped to mask the e ective latency of memory access for SpMV workloads, thereby helping
SpMV scale with processor performance improvements [113].
2We do not separate by tuning in the vector processor case due to a lack of consistently separately
reported performance results.
Page 32
hidden
10
developed on top of the BLAS in the widely-used LAPACK library [14]. Applications that
can be expressed in terms of calls to the BLAS or LAPACK bene t both in performance,
as well as in reduced costs of development and porting.
Motivated by the cost of vendor libraries and the increasing complexity of tuning
even dense matrix multiply on a rapidly growing list of machine architectures, Bilmes,
et al., developed the PHiPAC system for automatically generating dense matrix multiply
routines tuned to a given architecture [46]. PHiPAC originally proposed (1) a set of coding
conventions, using C as a kind of high-level assembly language, to expose instruction-level
parallelism and scheduling opportunities to the compiler, (2) various ways to write matrix
multiply using these conventions, and (3) a prototype system to search over the space of
these implementations. Whaley and Dongarra extended the scope of these ideas to the
entire BLAS and to new architectures (notably, Intel x86 machines) in their ATLAS system
[324], which is at present included in the commercial engineering package, MATLAB. Both
systems report performance that is comparable, and often exceeding, that of hardware
vendor-tuned libraries.
A number of libraries and interfaces have been developed for sparse kernels [267,
258, 116]. Indeed, the most recent revision of the BLAS standard speci es a Sparse Basic
Linear Algebra Subroutines (SpBLAS) interface [108, 109]. Unlike prior proposals, SpBLAS
hides the data structure from the user, thereby allowing the library implementation to
make a decision about what data structure or particular implementation to use. The idea
of automatic tuning as proposed in PHiPAC, as well as the then on-going development of
SpBLAS provided some of the motivation for the Sparsity system. As mentioned before,
the system proposed in this dissertation extends Sparsity. An important question which
we later address is what implications the results of Sparsity and of this dissertation have
for the design of sparse kernel library interfaces like the SpBLAS (Chapter 8).
Automatic tuning systems have been developed to support the growing list of
performance-critical computational kernels. In the domain of signal and image processing,
these systems include FFTW (developed at the same time as PHiPAC, and also presently
distributed with MATLAB) [123, 122], SPIRAL [255], and UHFFT [224]. Automatic tuning
is a sensible approach for signal processing due to the large number of embedded platforms|
and, therefore, varied architectures|on which these kernels are likely to run. The Message
Passing Interface (MPI) API [126] provides an interface for application level distributed
parallel communications. As with the BLAS, MPI has been widely adopted and hardware
Page 34
hidden
12
0 2656 5312 7936 10592 13248 15904 18560 21216
0
2656
5312
7936
10592
13248
15904
18560
21216
1.49 million non−zeros
Matrix 02−raefsky3
0 8 16 24 32 40 48 56 64 72 80
0
8
16
24
32
40
48
56
64
72
80
Matrix 02−raefsky3: 8x8 blocked submatrix (1:80, 1:80)
Figure 1.2: Spy plot of sparse matrix raefsky3. (Left) Sparse matrix raefsky3,
arising from a nite element discretization of an object in a
uid
ow simulation. This
matrix has dimension 21216 and contains approximately 1.5 million non-zeros. (Right)
Matrix raefsky3 consists entirely of 88 dense blocks, uniformly aligned as shown in this
8080 submatrix.
Most application developers expect this choice of storage format and corresponding SpMV
implementation to be optimal for this kind of matrix.
In practice, performance behavior can be rather surprising. Consider an experi-
ment in which we measure the performance in M
op/s of the blocked SpMV implementa-
tion described, coded in C, for all rc formats that would seem sensible for this matrix:
r; c 2 f1; 2; 4; 8g, for a total of 16 implementations in all. Figure 1.3 shows the observed
performance on six di erent cache-based superscalar microprocessor platforms, where we
have used the recent compilers and the most aggressive compilation options (the experi-
mental setup is described in Appendix B). For each platform, each rc implementation
is both shaded by its performance in M
op/s and labeled by its speedup relative to the
conventional unblocked (or 11) case. We make the following observations.
 As we argue in more detail in Section 3.1, we might reasonably expect the 88 perfor-
mance to be the best, with performance increasing smoothly as rc increases. How-
ever, this behavior is only nearly exhibited on the Sun Ultra 2i platform [Figure 1.3
(top-left)], and, to a lesser extent, on the Pentium III-M [Figure 1.3 (top-right)].
Instead, 88 performance is roughly the same as or slower than 11 performance:
Page 36
hidden
14
1:10 faster on the Power4 [Figure 1.3 (bottom-left)], but 9% worse on the Power3
[Figure 1.3 (middle-left)]. This behavior is not readily explained by register pressure
issues: the Power3 and Power4 both have 32
oating point registers but the smallest
88 speedups, while the Pentium III-M and Ultra 2i have the fewest registers (8 and
16, respectively) but the best 88 speedups.
 Choosing a block size other than 88 can yield considerable performance improve-
ments. For instance, 42 blocking on Itanium 2 is 2:6 faster than 88 blocking.
Considerable gains over the 11 performance are possible by choosing just the right
block size|here, from 1:22 up to 4:07, or up to 31% of peak on the Itanium 2.
 Furthermore, the fraction of peak with just the right blocking can exceed the 5{10%
of peak which is typical at 11.
 Performance can be a very irregular function of rc, and varies across platforms. It
is not immediately obvious whether there is a simple analytical model that can cap-
ture this behavior. Furthermore, though not explicitly shown here, the performance
depends on the structure of the matrix as well.
The characteristic irregularity appears to become worse over time, roughly speaking. The
platforms in Figure 1.3 are arranged from left-to-right, top-to-bottom, by best SpMV per-
formance achieved over all block sizes for this matrix. Furthermore, they happen to be
arranged in nearly chronological order by year of release as shown in Table 1.1. Though
we have argued that careful tuning is necessary to maintain performance growth similar to
that of Moore's Law, the problem of tuning|even in a seemingly straightforward case|is
a considerable and worsening challenge.
1.4 Summary, Scope, and Outline
Our central claim is that achieving and maintaining high performance over time for applica-
tion-critical computational kernels, given the current trends in architecture and compiler
development, requires a platform-speci c, search-based approach. The idea of generating
parameterized spaces of reasonable implementations, and then searching those spaces, is
modeled on what practitioners do when hand-tuning code. Automating this process has
proved enormously successful for dense linear algebra and signal processing. The intuition
Page 41
hidden
19
Saad's SPARSKIT library [267]. In subsequent chapters, performance in CSR format is the
baseline against which we compare our tuned implementations. Readers familiar with the
basic storage formats may wish to proceed directly to Section 2.2.
Throughout this chapter, we consider the SpMV operation y y + Ax, where A
is an mn sparse matrix with k non-zero elements, and x; y are dense vectors. We refer
to x as the source vector and y as the destination vector. Algorithmically, SpMV can be
de ned as follows, where ai;j denotes the element of A at position (i; j):
8ai;j 6= 0 : yi yi + ai;j  xj (2.1)
According to Equation (2.1), SpMV simply enumerates the non-zero elements of A, updating
corresponding elements of y. Each ai;j is touched exactly once, and the only reuse possible
on cache-based machines occurs in the accesses to the elements of x and y. If x and y|
but not A|were completely contained in cache, then the time to execute SpMV would be
dominated by the time to move the elements of A. Futhermore, there is very little work
(only 2
ops per matrix element) to hide the cost of reading A. Thus, it becomes extremely
important to reduce any overheads associated with storing the matrix.
Typical sparse matrix formats incur storage and instruction overheads per non-
zero element, since information is needed to keep track of which non-zero values have been
stored. One aim in selecting a data structure is to minimize these overheads. Roughly
speaking, the way to reduce these overheads is to recognize patterns in the arrangement
and numerical structure of non-zeros. Many of the basic formats surveyed in Section 2.1, as
well as the aggressive structure-exploiting techniques of this dissertation, reduce the data
structure overheads by making assumptions about the non-zero patterns.
The remainder of this chapter speci cally discusses formats included in Saad's
SPARSKIT library [267]. Our survey (Section 2.1) presents these formats, summarizing for
each data structure both the total storage and the most natural way in which to enumerate
the non-zeros for SpMV. In practice, users consider additional factors in choosing a sparse
storage format, including the following:
 What operations need to be performed on the data structure? For example, sparse
triangular solve (SpTS) has a particular value-dependency structure which requires
enumerating non-zeros in a certain order. Sparse LU factorization requires a dynamic
data structure that can support fast insertions of new non-zeros, to handle ll-in of
elements created during the factorization.
Page 50
hidden
28
Figure 2.7: Compressed sparse row (CSR) format. The elements of each row of A
are shaded using the same color. Each row of A is stored as a sparse vector, and all rows
(i.e., all sparse vectors) are stored contiguously in val and ind. The ptr array indicates
where each sparse vector begins in val and ind.
of a given row. An implementation of sparse matrix-vector multiply (SpMV) using this
format is as follows:
type val : real[k]
type ind : int[k]
type ptr : int[m+ 1]
1 foreach row i do
2 for l = ptr[i] to ptr[i+ 1] 1 do
3 y[i] y[i] + val[l]  x[ind[l]]
In the limit of k  m, there is only 1 integer index per non-zero instead of the 2 in the
COO implementation.
This implementation exposes the potential reuse of elements of y, since y[i] can be
kept in a register during the execution of the inner-most loop. In addition, since val and ind
are accessed with unit stride, it is possible to prefetch their values.3 However, other loop-
level transformations are more dicult to apply e ectively since the loop bounds cannot be
predicted statically. For instance, it is possible to unroll either loop, but the right unrolling
depth will depend on the number of non-zeros ptr[i+ 1] ptr[i] in each row i. Moreover,
to tile accesses to x for registers or caches requires knowledge of the run-time values of ind.
CSC is similar to CSR, except we store each column as a sparse vector as shown
Figure 2.8. The corresponding SpMV code is as follows:
3Indeed, the IBM and Intel compilers listed in Appendix B insert prefetch instructions on elements of
these arrays.
Page 51
hidden
29
Figure 2.8: Compressed sparse column (CSC) format. The elements of each column
of A are shaded using the same color. Each column of A is stored as a sparse vector, and
all columns (i.e., all sparse vectors) are stored contiguously in val and ind. The ptr array
indicates where each sparse vector begins in val and ind.
type val : real[k]
type ind : int[k]
type ptr : int[n+ 1]
1 foreach column j do
2 for l = ptr[j] to ptr[j + 1] 1 do
3 y[ind[l]] y[ind[l]] + val[l]  x[j]
The CSC implementation contains dependencies among accesses to y, which complicates
static analyses to detect parallelism.
The compressed sparse stripe formats can be generalized to store sparse vectors
along diagonals as well, but we do not know of any actual implementations in use.
Diagonal format
The diagonal (DIAG) format is designed for the important class of sparse matrices consisting
of some number of full non-zero diagonals.4 Since each diagonal is assumed to be full, we only
need to store one index for each non-zero diagonal, and no indices for the individual non-
zero elements. Furthermore, common operations with diagonals are amenable to ecient
implementation on vector architectures, provided the diagonals stored are suciently long.
DIAG generalizes the dense band format (Section 2.1.2) by allowing arbitrary diagonals to
be speci ed, not just diagonals adjacent to the main diagonal.
We number diagonals according to the following convention. A non-zero element
4A common source of such matrices arise in stencil calculations.
Page 54
hidden
32
Figure 2.11: ELLPACK/ITPACK format. Non-zero values of A are stored by row in
an ms array val, where s is the maximum number of non-zeros in any row of A. For each
val[i; j], the corresponding column index is given by ind[i; j].
on those zeros will be performed. Thus, this format best supports matrices in which the
number of non-zeros in all rows is close to s. SpMV in this format is as follows:
type val : real[m s]
type ind : int[m s]
1 foreach row i do
2 for p = 0 to s 1 do
3 y[i] y[i] + val[i; p]  x[ind[i; p]]
The loops may be interchanged, and on vector architectures with explicit gather support,
vectorization across either rows or columns is possible.
Jagged diagonal format
The jagged diagonal (JAD) format was designed to overcome the problem of variable length
rows/columns in the CSR/CSC formats. The performance of SpMV can be especially poor
on vector architectures in CSR format when the number of non-zeros per row is typically
less than the machine's vector length. The main idea behind JAD format is to reorder rows
of A so as to expose more opporunities to exploit data parallelism [266].
Storing A in JAD format consists of two steps, as illustrated in Figure 2.12. First,
the rows of A are logically permuted in decreasing order of non-zeros per row by a permu-
tation matrix P . In Figure 2.12 (top), the rst element of every row i is labeled by i; in
Page 56
hidden
34
Figure 2.12: Jagged diagonal format. In the jagged diagonal representation, the rows of
A (top) are logically permuted in decreasing order of number of non-zeros per row (bottom).
The permutation information is stored in perm. Elements shaded the same color belong to
the same \jagged diagonal." Each jagged diagonal is then stored as a sequence of sparse
vectors in val; ind as with CSR and CSC formats.
Skyline format
Skyline (SKY) format is a composite format which stores the strictly lower triangle of A
in CSR, the strictly upper triangle in CSC, and the diagonal is stored in an array. This
format was particularly convenient in early implementations for Gaussian elimination, i.e.,
computing the decomposition A = LU , where L is a lower triangular matrix and U is upper
triangular. We will not be interested in SKY in this dissertation, and we therefore refer the
reader to other discussions [107].
Block compressed stripe formats
The class of formats referred to as blocked compressed sparse stripe formats are designed to
exploit naturally occurring dense block structure typical of matrices arising in nite element
Page 57
hidden
35
method (FEM) simulations. For an example of a matrix amenable to block storage, recall
the FEM matrix of Figure 1.2 which consists entirely of dense 88 blocks. Conceptually,
the block compressed sparse stripe formats replace each non-zero in the compressed sparse
stripe format by an rc dense block.5 The case of r = c = 1 is exactly the compressed
stripe storage described in Section 2.1.4.
Here, we describe the block compressed sparse row (BCSR) format. The rc
BCSR format generalizes CSR: A is divided into

m
r

block rows, and each block row is
stored as a sequence of dense rc blocks. Figure 2.13 (top) shows an example of a 69
matrix stored in 23 BCSR. The values of A are stored in an array val of Krcrc elements,
where Krc is the number of non-zero blocks. The blocks are stored consecutively by row,
and each block may be stored in any of the dense formats for general matrices (e.g., row- or
column-major) described in Section 2.1.2. The starting column index of each block is stored
in an array ind, and the o set within ind of the rst index in each block row is stored in
ptr. Each block is treated as a full dense block, which may require lling in explicit zero
elements. We discuss the relationships among ll, the overall size of the data structure, and
performance when we examine the register blocking optimization based on BCSR format in
Chapter 3.
Blockings are not unique, as can be seen by comparing the last block row between
Figure 2.13 (top) and (bottom). Di erent libraries and systems have chosen di erent con-
ventions for selecting blocks. The original Sparsity system chose to always align blocks
so that the rst column j of each block was always chosen so that j mod c = 0 [164].
By contrast, the SPARSKIT library uses a greedy approach which scans each block row
column-by-column, starting a new rc block upon encountering the rst column containing
a non-zero.
The pseudo-code implementing SpMV using BCSR format is as follows, where we
have assumed that r divides m and c divides n:
5All implementations of which we are aware treat only square blocks, i.e., r = c. We present the
straightforward generalization, particularly in light of the experimental results of Section 1.3.
Page 58
hidden
36
type val : real[Krc  r  c]
type ind : int[Krc]
type ptr : int[mr + 1]
1 foreach block row I do
2 i0 I  r /* starting row */
3 Let y^ yi0:(i0+r1) /* Can store in registers */
4 for L = ptr[I] to ptr[I + 1] 1 do
5 j0 ind[L]  c /* starting column */
6 Let x^ xj0:(j0+c1) /* Can store in registers */
7 Let A^ ai0:(i0+r1);j0:(j0+c1)
/* A^ = block of A stored in val[(L  r  c) : ((L+ 1)  r  c 1)] */
8 Perform rc block multiply, y^ y^ + A^  x^
9 Store y^
where a : b denotes a closed range of integers from a to b inclusive. Since r and c are xed,
the block multiply in line 8 can be fully unrolled, and the elements of x and y can be reused
by keeping them in registers (lines 3 and 6). We discuss implementations of SpMV using
BCSR in more detail in Chapter 3.
Variable block row format
The variable block row (VBR) format generalizes the BCSR format by allowing block rows
and columns to have variable sizes. This format is more complex than the preceeding
formats. Moreover, SpMV in this format is dicult to implement eciently because, unlike
BCSR format, the block size changes in the inner-most loop, requiring a branch operation
if we wish to unroll block multiplies and keep elements of x and y in registers as in BCSR.
Indeed, we know of no implementations of SpMV in this format which are as fast as any of
the formats described in this chapter on our evaluation platforms.6 However, VBR serves
as a useful intermediate format in a technique for exploiting blocks of multiple sizes, as
discussed in Chapter 5.
We illustrate VBR format in Figure 2.14, where we show a mn=68 matrix A
containing k = 19 non-zeros. Consider a partitioning of this matrix into M = 3 block rows
6The two libraries implementing this format are SPARSKIT and the NIST Sparse BLAS [267, 258].
Page 59
hidden
37
Figure 2.13: Block compressed sparse row (BCSR) format. (Top) In a 23 BCSR
format, A is divided into

m
r

= 3 block rows, and each row is stored as a sequence of
23 blocks in an array val. There are K = 6 blocks total in this example. The elements
of a given block have been shaded the same color, and solid black dots indicate structural
non-zeros. To ll all blocks, explicit zeros have been lled in (e.g., the (0; 9) element).
Each block may be stored in any of the dense formats (e.g., row-major, column-major; see
Section 2.1.2). The column index of the (0; 0) element of each block is stored in the array
ind. The element ptr[I] is the o set in ind of the rst block of block row I. (Bottom)
Blockings are not unique. Here, we show a di erent blocking of the same matrix A (top).
and N = 4 block columns as shown, yielding K = 6 blocks, each shaded with a di erent
color. The VBR data structure is composed of the following 6 arrays:
 brow (length M + 1): starting row positions in A of each block row. The Ith block
row starts at row brow[I] of A, ends at brow[I + 1] 1, and brow[M ] = m.
 bcol (length N + 1): starting column positions in A of each block column. The J th
block column starts at column bcol[J ] of A, ends at bcol[J + 1]1, and brow[N ] = n.
 val (length k): non-zero values, stored block-by-block. Blocks are laid out by row.
Page 60
hidden
38
 val ptr (length K+1): starting o sets of each block within val. The bth block starts
at position val ptr[b] in the array val. The last element val ptr[K] = k.
 ind (length K): block column indices. The bth block begins at column bcol[ind[b]].
 ptr (length M + 1): starting o sets of each block row within ind. The Ith block row
starts at position ptr[I] in ind.
The pseudo-code for SpMV using VBR is as follows:
type brow : int[M + 1]
type bcol : int[N + 1]
type val : real[k]
type val ptr : int[K + 1]
type ind : int[K]
type ptr : int[M + 1]
1 foreach block row I do
2 i0 brow[I] /* starting row index */
3 r brow[I + 1] brow[I] /* row block size */
4 Let y^ yi0:(i0+r1)
5 for b = ptr[I] to ptr[I + 1] 1 do /* blocks within Ith block row */
6 J ind[b] /* block column index */
7 j0 bcol[J ] /* starting column index */
8 c bcol[J + 1] bcol[J ] /* column block size */
9 Let x^ xj0:(j0+c1)
10 Let A^ ai0:(i0+r1);j0:(j0+c1)
/* A^ = block of A stored in val[val ptr[b] : (val ptr[b+ 1] 1)] */
11 Perform rc block multiply, y^ y^ + A^  x^
12 Store y^
Unlike the BCSR code, r and c are not xed throughout the computation, making it dicult
to unroll line 11 in the same way that we can unroll the block computation in the BCSR
code. In particular, we would need to introduce branches to handle di erent xed block
sizes. The implementation in SPARSKIT uses 2-nested loops to perform the block multiply
[267]. We would not expect VBR to perform very well due to the overheads incurred by
these loops.
Page 61
hidden
39
Figure 2.14: Variable block row (VBR) format. We show an example of a sparse
matrix A with k = 19 non-zeros. A is logically partitioned into M = 3 block rows, N = 4
block columns, yielding K = 6 non-zero blocks. The starting positions of each block row
and block column are stored in brow and bcol, respectively. Non-zero values are stored in
val, and the starting positions of each block of values are stored in val ptr. Block column
indices are stored in ind, and the beginning of the indices belonging to a given block row
are stored in ptr.
2.2 Experimental Comparison of the Basic Formats
This section compares implementations of SpMV using a subset of the formats described
in Section 2.1. We sometimes refer to these implementations collectively as the \baseline
implementations." We make our comparisons across a variety of matrices and machine
architectures. The high-level conclusions of these experiments are as follows:
 CSR and MSR formats tend to have the best performance on a wide class of matrices
and on a variety of superscalar architectures, among the basic formats considered (and,
in particular, omitting the BCSR format.) Thus, either of these formats would appear
to be a reasonable default choice if the user knows nothing about the matrix structure.
In subsequent chapters, \reference" performance always refers to CSR performance.
 Comparing across architectures, we show that for the most part, none of the basic
formats yield signi cantly more than 10% of peak machine speed. This observation,
coupled with the results of Section 1.3 arguing for search-based methods, motivate
our aggresive exploitation of matrix structure.
Page 62
hidden
40
We review the experimental setup (Section 2.2.1) before presenting and discussing the ex-
perimental results (Section 2.2.2).
2.2.1 Experimental setup
Our experimental method, as with all the experiments of this dissertation, follows the
discussion in Appendix B. When referring to a \platform," we refer to both a machine and
a compiler. Thus, measurements re
ect both characteristics of the machine architecture
plus the quality of the compiler's code generation. In these experiments, we make an e ort
to use the best compiler
ags, pragmas, and keywords that enable vectorization where
possible, and in the C implementations, to eliminate false dependencies due to aliasing.
The baseline implementations are taken from the SPARSKIT library. We consider
both the Fortran implementations (as written in the original SPARSKIT library) along
with C implementations. (Our C implementations are manual translations of the Fortran-
based SPARSKIT code.) As discussed in Appendix B, all matrix values are stored in IEEE
double-precision (64-bit)
oating point, and all indices are stored as 32-bit integers. We
compare the CSR, CSC, DIAG, ELL, MSR, and JAD formats. We omit the SKY format
since it is functionally equivalent to CSR and CSC. We omit comparison to the COO and
VBR format because these implementations were considerably slower (by roughly up to an
order of magnitude) than the worst of the formats considered. (In the case of VBR, refer
to the discussion about unrolling in Section 2.1.4.)
None of the matrices in this suite consist of only full diagonals. Therefore, our
implementation of the DIAG format actually splits the matrix into the sum A = A1 + A2,
where A1 is stored in diagonal format, A2 is stored in CSR format, and the non-zero
structures of A1 and A2 are disjoint. Our criteria for storing a given diagonal of A in
A1 are that (1) the length diagonal must be no less than 85% of the dimension of A,
and (2) that diagonal itself must be 90% full. The rst condition keeps diagonal storage
manageable. For instance, an nn permutation matrix would require n2 storage in the
worst case in DIAG format were no such condition imposed. The second condition ensures
that replacing a \mostly full" diagonal still leads to a reduction in the total storage. In
particular, a diagonal of length n which is 90% full requires that we store at least (8 bytes+
4 bytes) :9n = 10:8n bytes in a CSR-like format (1 64-bit real + 1 32-bit int per non-zero),
but only approximately 8n bytes in diagonal format. Thus, the CSR-like format requires
Page 63
hidden
41
10:8=8 = 1:35 more storage.
Recall that JAD includes a logical row permutation, and is otherwise roughly
equivalent to CSC and CSR formats (Section 6). The JAD implementation we consider
includes a permutation of y on input and an inverse permutation of y on output. The
cost of these permutations is re
ected in the performance data we report; in practice, if
SpMV is performed many times, these permutations could be moved to occur before the
rst multiplication and after the last, thereby amortizing the permutation cost.
The implementation of the BCSR format is complicated by issues of how to choose
the block size and handle explicit ll. Section 1.3 alludes to these diculties but demon-
strates that it is possible to achieve signi cantly more than 10% of peak machine speed
by choosing an appropriate implementation. We consider such an implementation, which
includes the choice of a possibly non-square block size, to be among our proposed opti-
mizations, especially since non-square block sizes have been only addressed in any depth by
Sparsity[165, 164]. We therefore defer detailed analyses of BCSR performance to Chap-
ter 3.
2.2.2 Results on the Sparsity matrix benchmark suite
We present the observed performance over the matrices and platforms shown in Appendix B
in Figures 2.15{2.18. All gures show both absolute performance (in M
op/s) and the
equivalent performance normalized to machine peak on the y-axis. Matrices which are
small relative to the largest cache on each platform have been omitted (see Appendix B).
For ease of comparison across platforms, the fraction of peak always ranges from
0 to 12.5%. To aid comparisons among matrices, we divide the matrices into ve classes:
1. Matrix 1: A dense matrix stored in sparse format.
2. Matrices 2{9: Matrices from FEM applications. The non-zero structure of these
matrices tends to be dominated by a single square block size, and all blocks are
uniformly aligned.
3. Matrices 10{17: These matrices also arise in FEM applications, and possess block
structure. However, the block structure consists of a larger mix of block sizes than
matrices 2{9, or have irregular alignment of blocks.
Page 66
hidden
44
in Basic Linear Algebra Subroutines (BLAS) lingo [203]). If the matrix is structurally sym-
metric and there are p non-zeros per row/column, then each dot-product will perform 2p
loads, to the row of A and vector x, and 1 store to y; by contrast, the AXPY requires
2p loads of a column of A and elements of y, interleaved with p stores to y. Thus, the
performance di erence could re
ect artifacts of the architecture which cause even cached
stores to incur some performance hit.
The performance of the JAD format is generally worse than that of CSC. Recall
that JAD implementation is similar to the CSC implementation, and that our implemen-
tation of JAD includes the cost of two permutations of the destination vector y. Thus,
the di erence in performance between JAD and CSC may re
ect these permutation costs,
which could be amortized over many SpMV operations. Nevertheless, we would not expect
JAD to be faster than CSC on superscalar architectures.
The ELL implementation generally yields the worst performance of all formats.
Recall that ELL is best suited to matrices with an average number of non-zeros per row
nearly equal to the maximum number of non-zeros per row. This condition is really only
true for the Matrix 1 (dense) and Matrices 2 and 11 (in both, 93% of rows are within
1 non-zero of the maximum number of non-zeros per row). The performance di erence
between ELL performance on Matrices 1, 2, and 11, and performance all other matrices
re
ects this fact. Nevertheless, ELL performance is still worse than CSR/MSR even on the
dense matrix, so at least on superscalar architectures, there is no reason to prefer a pure
ELL implementation over CSR.
While ELL and JAD formats were usually the worst formats, there are a number
of notable exceptions on the Itanium 1 platform. On Matrix 1 (dense), ELL and JAD were
1:38 faster than CSR, and on Matrix 11, ELL was 1:47 faster than CSR. Otherwise, ELL
was only marginally faster than CSR. We do not at present know why ELL performance
was only competitive with CSR in a few cases on the Itanium 1 only, and to a lesser extent
in a few cases on Itanium 2. These ELL results suggest that on platforms like the Itanium
1, there could be an appreciable bene t to a hybrid ELL/CSR format, in which we use
ELL format to store a subset of the non-zeros satisfying the ELL uniform non-zeros per row
assumption, and the remainder of the non-zeros in CSR (or another appropriate) format.
Page 68
hidden
46
processors (Section 2.2). We further conclude that CSR and MSR formats are reasonable
default data structures when nothing else is known about the input matrix structure. In-
deed, CSR storage is the base format in the PETSc scienti c computing library [27], as well
as SPARSKIT [267]. However, our analysis is restricted to platforms based on cache-based
superscalar architectures, while several of the surveyed formats were designed with vector
architectures in mind.
Page 70
hidden
48
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
0.12
fraction of peak
1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 18 20 21 23 24 25 26 27 28 29 36 37 40 41 42 440
10
20
30
40
50
60
70
80
90
100
110
120
Matrix No.
Pe
rfor
ma
nce
(Mf
lop/
s)
Comparison of SpMV Performance using Baseline Formats [pentium3−linux−icc]
CSRCSCModified CSRDiagonalEllpack/ItpackJagged diagonal
0
0.0063
0.0125
0.0187
0.025
0.0312
0.0375
0.0437
0.05
0.0563
0.0625
0.0688
0.075
0.0813
0.0875
0.0938
0.1
0.1062
0.1125
0.1187
0.125
fraction of peak
1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 18 20 21 23 24 25 26 27 28 29 36 37 40 41 42 440
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Matrix No.
Pe
rfor
ma
nce
(Mf
lop/
s)
Comparison of SpMV Performance using Baseline Formats [pentium3M−linux−icc7]
CSRCSCModified CSRDiagonalEllpack/ItpackJagged diagonal
Figure 2.16: SpMV performance using baseline formats on Matrix Benchmark
Suite #1: Intel Pentium III (top) and Pentium III-M (bottom) platforms. This
data is also tabulated in Appendix C.
Page 71
hidden
49
0
0.0067
0.0133
0.02
0.0267
0.0333
0.04
0.0467
0.0533
0.06
0.0667
0.0733
0.08
0.0867
0.0933
0.1
0.1067
0.1133
0.12
fraction of peak
1 2 4 5 7 8 9 10 12 13 15 400
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
Matrix No.
Pe
rfor
ma
nce
(Mf
lop/
s)
Comparison of SpMV Performance using Baseline Formats [power3−aix]
CSRCSCModified CSRDiagonalEllpack/ItpackJagged diagonal
0
0.0096
0.0192
0.0288
0.0385
0.0481
0.0577
0.0673
0.0769
0.0865
0.0962
0.1058
0.1154
0.125
fraction of peak
1 8 9 10 400
50
100
150
200
250
300
350
400
450
500
550
600
650
Matrix No.
Pe
rfor
ma
nce
(Mf
lop/
s)
Comparison of SpMV Performance using Baseline Formats [power4−aix]
CSRCSCModified CSRDiagonalEllpack/ItpackJagged diagonal
Figure 2.17: SpMV performance using baseline formats on Matrix Benchmark
Suite #1: IBM Power3 (top) and Power4 (bottom) platforms. This data is also
tabulated in Appendix C.
Page 72
hidden
50
0
0.0063
0.0125
0.0187
0.025
0.0312
0.0375
0.0437
0.05
0.0563
0.0625
0.0688
0.075
0.0813
0.0875
0.0938
0.1
0.1062
0.1125
0.1187
0.125
fraction of peak
1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 440
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
320
340
360
380
400
Matrix No.
SpM
×V
Pe
rfor
ma
nce
(Mf
lop/
s)
Comparison of Reference Sparse Formats on Application Matrices [itanium−linux−ecc7]
CSRCSCModified CSRDiagonalEllpack/ItpackJagged diagonal
0
0.0069
0.0139
0.0208
0.0278
0.0347
0.0417
0.0486
0.0556
0.0625
0.0694
0.0764
0.0833
0.0903
0.0972
0.1042
0.1111
0.1181
0.125
fraction of peak
1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 440
25
50
75
100
125
150
175
200
225
250
275
300
325
350
375
400
425
450
Matrix No.
Pe
rfor
ma
nce
(Mf
lop/
s)
Comparison of SpMV Performance using Baseline Formats [itanium2−linux−ecc7]
CSRCSCModified CSRDiagonalEllpack/ItpackJagged diagonal
Figure 2.18: SpMV performance using baseline formats on Matrix Benchmark
Suite #1: Intel Itanium 1 (top) and Itanium 2 (bottom) platforms. This data is
also tabulated in Appendix C.
Page 75
hidden
53
estimated for all rc.
3. Evaluate a performance model that combines the benchmarking data with the ll
ratio estimates to select r and c.
The costs of steps 2 and 3 are particularly important to minimize because they occur at
run-time. Our Version 2 heuristic follows the above procedure, but improves steps 2 and
3 as described in Section 3.2. These improvements lead to increased prediction accuracy
on recent machines like the Itanium 1 and Itanium 2, where Version 1 heuristic chooses
an implementation that can achieve as little as 60% of the best possible performance over
all r and c. The Version 2 heuristic in contrast frequently selects the best block size, and
yielding performance that is nearly always 90% or more of the best. In addition, we show
that the run-time costs of the Version 2 heuristic implementation of steps 2 and 3 is between
1{11 unblocked SpMVs, depending on the platform and matrix, on 8 current platforms and
44 test matrices. This cost is the penalty for evaluating the heuristic if it turns out no
blocking is required, and we discuss ways in which future work could try to reduce these
costs. However, if blocking is bene cial, we show the overall block size selection procedure
including conversion is never more than 42 11 SpMVs on our test machines and matrices,
meaning the cost of evaluating the heuristic is modest compared to the cost of conversion.
As noted by Im, et al., Sparsity's hybrid o -line/run-time tuning methodology
for register block size selection is an instance of a more general framework [165]. Indeed,
we adapt this framework in subsequent chapters to other kernels and optimizations (e.g.,
sparse triangular solve (SpTS) in Chapter 6 and sparse ATA  x (SpATA) in Chapter 7),
thereby demonstrating the e ectiveness and applicability of the approach.
Some of the material in this chapter also appears in recent papers [316, 165]. This
chapter discusses the costs of tuning in more detail.
3.1 Register Blocked Sparse Matrix-Vector Multiply
We begin with a review of register blocked sparse matrix-vector multiply (SpMV), as im-
plemented by the Sparsity system [167, 164]. The Sparsity implementation of register
blocking uses block compressed sparse row (BCSR) format as the base storage format (Sec-
tion 3.1.1). We present simple analyses and experimental results to help build the reader's
intuition for what performance behavior we might reasonably expect in contrast to what
Page 76
hidden
54
we observe in practice. This section revisits in greater detail the introductory argument of
Section 1.3 that motivated the use of empirical search-based methods for tuning.
3.1.1 Register blocking overview
Consider the SpMV operation y y+Ax, where A is an mn sparse matrix stored in BCSR
format. BCSR is generally presented assuming square block sizes, but here we consider the
generalization of early descriptions of BCSR from square block sizes to arbitrary rectangular
rc blocks, as proposed in Sparsity [164] and described in Section 6. BCSR is designed
to exploit naturally occuring dense blocks by reorganizing the matrix data structure into a
sequence of small (enough to t in register) dense blocks.
BCSR logically divides the matrix into mr 
n
c submatrices, where each submatrix is
of size rc. Assume for simplicity that r divides m and that c divides n. Only those blocks
which contain at least one non-zero are stored, and blocks within the same block row are
stored consecutively. We assume the Sparsity convention in which each block is stored in
row-major order. The SpMV computation proceeds block-by-block: for each block, we can
reuse the corresponding c elements of the source vector x and r elements of the destination
vector y by keeping them in registers, assuming a sucient number is available. The case
of r = c = 1 is precisely the case of compressed sparse row (CSR) format.
The C implementation of SpMV using 23 BCSR appears in Figure 3.1 (bottom).
The matrix is represented by three arrays: Aval, Aind, and Aptr. Here, Aval is an array
storing all the values, Aind stores the column indices for each block, and Aptr stores the
starting position within Aind of each block row. For reference, we show a conventional
dense matrix-vector multiply code in Figure 3.1 (top) where A is assumed to be stored in
row-major order, and we use a C-style pointer idiom to traverse the elements of A (A++ in
line D3 and A[0] in line D4). The line numbers for both the dense and sparse codes are
labeled to show the correspondence between the two implementations. For example, where
the dense code loops over each column j within each row i (top, line D3), the sparse code
loops over 23 blocks within each block row beginning at row i, column j (lines S3a{b).
The multiplication by each 23 block in the sparse code is fully unrolled, and the c = 3
corresponding elements of the source vector elements (x0,x1,x2) are each reused r = 2
times for each block. The destination vector elements (y0,y1) can be kept in registers
while processing all of the blocks within a block row, i.e., throughout the execution of the
Page 80
hidden
58
storage to blocked storage, can be approximated as follows:
V1;1 (A)
Vrc (A)

3
2k
kfrc

1 + 12rc
 =
3
2

1
frc

1 + 12rc
 (3.2)
Thus, the maximum compression ratio is 32 if we can choose a suciently large block size
without any ll. A corollary is that in order to maintain the same amount of storage as
the unblocked case, we can tolerate ll ratios of at most 32 . This observation explains why
storage increased only by a modest amount in the example of Figure 3.2.
In Sparsity, register blocking is implemented by (1) a special code generator to
output the rc code shown in Figure 3.1, and (2) a heuristic for selecting r and c, given the
matrix. We defer a discussion of and subsequent improvement to the Sparsity Version 1
heuristic described by Im [164, 167] to Section 3.2.
3.1.2 Surprising performance behavior in practice
By analogy to tiling in the dense case, the most dicult aspect of applying register blocking
is knowing on which matrices to apply it and how to select the block size. This fact
is illustrated for a particular sparse matrix in the example of Section 1.3 (Figure 1.3).
There, we see experimentally the surprising performance behavior as the block size varies,
motivating our use of automated empirical search.
What may be even more surprising is that the irregular behavior occurs even if
we repeat the same experiment on a very \regular" sparse problem: the case of a dense
matrix stored in BCSR format, as we show below. The implication is that irregular mem-
ory access alone does not explain irregularities in performance. Below, we argue that the
performance behavior we might reasonably expect can di er considerably from what we
observe in practice.
In the case of a dense matrix in BCSR format, there is no ll, assuming the
dimensions of the matrix are either suciently large or a multiple of the block size. We could
reasonably expect performance to increase smoothly with increasing r; c for the following
reasons:
 The storage overhead decreases with increasing rc. We only need to store 1 integer
index per block of rc non-zero values.
 The instruction overhead per
op decreases with increasing rc. Because we have
unrolled the rc block multiply, the innermost loop contains a constant number of
Page 83
hidden
61
21.4% of peak on the Pentium III compared to 15.2% on the Pentium III-M. Indeed,
the improvement in peak moving from the Pentium III to the Pentium III-M is not
matched by an equivalent improvement in performance. The peak performance of
the Pentium III-M is 1:6 faster than the Pentium III (Table 3.1), but the ratio of
the maximum performance is only 122=107  1:14 times faster, and the ratio of the
median performance data is 120=88  1:36 times faster.
Also consider di erences between the Power3 and Power4 platforms (Figure 3.5):
the Power3 performance is nearly symmetric with respect to r and c|compare the
upper-left corner, where r > c, to the lower-right corner, where r < c. In contrast,
performance is higher on the Power4 when r > c compared to the case of r < c.
Performance on the Itanium 1 and Itanium 2 platforms (Figure 3.6) is characteris-
tically similar in the sense that (a) performance is best when c is 2 or less and at
particular values of r, (b) there is swath of values of r and c in which performance
is worse than or comparable to the 11 performance, and (c) performance increases
again toward the upper-right corner of the plot (rc & 16). However, choosing the
right block size is much more critical on the Itanium 2, where the maximum speedup
is 4:07 (at 42), compared to 1:55 (at 41) on the Itanium 1.
 Signi cant performance improvements are possible, even compared to tuned dense
matrix-vector multiply (DGEMV). (A list of DGEMV implementations to which we
compare are described in Appendix B.) As shown in Table 3.1, the best SpMV per-
formance is typically close to or in excess of tuned DGEMV preformance, the notable
exception being the Ultra 3. On the other platforms, this observation indicates that
a reasonable but coarse bound on SpMV performance is DGEMV performance, pro-
vided the sparse matrix possesses exploitable dense block structure. The two cases in
which SpMV is faster than DGEMV (Ultra 2i and Pentium III) indicates that these
DGEMV implementations can most likely be better tuned.
On the Ultra 3, DGEMV runs at 17% of machine peak, compared to the best SpMV
performance running at 5% peak. Nevertheless, blocked SpMV performance is about
1:8 faster than the 11 performance, indicating there is some value to tuning.
These results for a dense matrix in sparse format rearm the conclusions of Section 1.3.
The main di erence in the example we have just considered is that we have eliminated the
Page 88
hidden
66
Platform
Processor Dense Matrix Register Pro le
Compiler Peak DGEMM DGEMV 11 Best Median
(no. of regs) M
op/s M
op/s M
op/s M
op/s M
op/s rc M
op/s
Ultra 2i
Sun cc v6 667 425 59 36 73 68 60
(16) [.637] [.088] [.053] [.109] [.089]
Ultra 3
Sun cc v6 1800 1600 311 50 90 1212 82
(32) [.888] [.172] [.027] [.050] [.045]
Pentium III
Intel C v7 500 330 58 42 107 210 88
(8) [.660] [.116] [.084] [.214] [.176]
Pentium III-M
Intel C v7 800 640 147 59 122 812 120
(8) [.800] [.183] [.073] [.152] [.15]
Power3
IBM xlc v7 1500 1300 260 164 256 44 198
(32) [.866] [.173] [.109] [.170] [.132]
Power4
IBM xlc v7 5200 3500 915 595 819 108 712
(32) [.673] [.175] [.114] [.157] [.136]
Itanium 1
Intel C v7 3200 2100 315 161 250 41 178
(128) [.656] [.098] [.050] [.078] [.055]
Itanium 2
Intel C v7 3600 3500 1330 295 1200 42 451
(128) [.972] [.369] [.081] [.333] [.125]
Table 3.1: Summary of SpMV register pro les (dense matrix). We summarize a
few machine characteristics (no. of double-precision
oating point registers, peak machine
speed, and performance of tuned double-precision dense matrix-matrix and matrix-vector
multiply routines, DGEMM and DGEMV) and data from Figures 3.3{3.6. Performance
data are shown in M
op/s, with fraction of peak shown in square brackets.
run-time source of irregular memory access patterns. Thus, the irregularity of performance
behavior as a function of r and c is not due only to irregular memory access. To make a
direct comparison between sparse pro les of Figure 1.3 and the dense pro les of Figures 3.3{
3.6, consider the set of rc values such that r; c 2 f1; 2; 4; 8g. The absolute performance
data is lower in sparse pro les than in the dense. However, the performance relative to the
11 performance in each case is qualitatively similar, though not exactly the same.
Instead, the complexity of performance we have observed most likely re
ects both
Page 89
hidden
67
the overall complexity of the underlying hardware and the diculty of optimal instruction
scheduling. Even for earlier, simpler pipelined RISC architectures, it is well-known that
the o -line (compiler) problem of optimally scheduling a basic block is NP-complete [155].
Thus, any sequence of instructions emitted by the compiler is most likely an approximation
to the best possible schedule (which will in turn depend on load latencies that vary with
where in the memory hierarchy data resides). However, what is encouraging about these
data is that, compared to DGEMV, reasonably good performance for SpMV appears to
nevertheless be possible, provided the right tuning parameters (a good block size) can be
selected.
3.2 An Improved Heuristic for Block Size Selection
Both the Sparsity Version 1 heuristic and Version 2 heuristic are based on the idea that
the data of Figures 3.3{3.6 captures the irregularities of performance as r and c vary, and
that the ll ratio quanti es how many extra
ops will be performed due to explicit zeros.
The Version 2 heuristic is as follows:
1. Once per machine, compute the register (blocking) pro le, or the set of observed SpMV
performance values (in M
op/s) for a dense matrix stored in sparse format, at all block
sizes from 11 to 1212. Denote the register pro le by fPrc (dense) j1  r; c  12g.
2. When the matrix A is known at run-time, compute an estimate f^rc (A; ) of the true
ll ratio frc(A) for all 1  r; c; 12. Here,  is a user-selected parameter ranging from
0 to 1 which controls the accuracy of the estimate, as we describe in Section 3.2.1.
(Our ll estimation procedure ensures that f^rc (A; 1) = frc(A).)
3. Choose r; c that maximizes the following estimate of register blocking performance
P^rc (A; ),
P^rc (A; ) =
Prc (dense)
f^rc (A; )
(3.3)
Although we refer to Equation (3.3) as a performance estimate, our interest is not to predict
performance precisely, but rather to use this quantity to compute a relative ranking of block
sizes.
Page 91
hidden
69
EstimateFill( A, r, cmax,  ):
1 Initialize array Num blocks[1 : cmax] 0 /* no. of blocks at each 1  c  cmax */
2 Initialize nnz visited 0 /* no. of non-zeros visited */
3 repeat max(1; 

m
r

) times
4 Choose a block of r consecutive rows i0 : i0 + r 1 in A with i0 mod r = 0
5 Initialize array Last block index[1 : cmax] 1
6 foreach non-zero A(i; j) 2 A (i0 : i0 + r 1; :) in column-major order do
7 nnz visited nnz visited + 1
8 foreach c 2 1 : cmax do
9 if
j
j
c
k
6= Last block index[c] then
/* A(i; j) is the rst non-zero in a new block */
10 Last block index[c]
j
j
c
k
11 Num blocks[c] Num blocks[c] + 1
12 returnf^rc (A; ) Num blocks[c]  r  c=nnz visited
Figure 3.7: Pseudocode for a ll ratio estimation algorithm. The inputs to the
algorithm are the mn matrix A, what fraction  of the matrix to sample, a particular row
block size r, and a range of column block sizes from 1 to cmax. The output is the ll ratio
estimate f^rc (A; ) for all 1  c  cmax.
e.g., \1 : n" is short-hand for 1; 2; : : : ; n.) To compute the ll ratios for all rmax  cmax block
sizes, we call the procedure for each r between 1 and rmax.
EstimateFill loops over 

m
r

block rows of A (line 3), maintaining an array
Num blocks[1 : cmax] (line 1). Speci cally, Num blocks[c] counts the total number of rc
blocks needed to store the block rows scanned in BCSR format. We discuss block row
selection (line 4) in more detail below. EstimateFill enumerates the non-zeros of each
block row (line 6) in \column-major" order: matrix entries are logically ordered so that
entry (i; j) < (i + 1; j) and (m; j) < (0; j + 1), assuming zero-based indices for A, and
entries are enumerated in increasing order. This ordering ensures that all non-zeros within
the block row in column j are visited before those in column j + 1. To implement this
enumeration eciently, we require that the column indices of A within each row of CSR
format be sorted on input. For each non-zero A(i; j) and block column size c (line 8), if
A(i; j) does not belong to the same block column as the previous non-zero visited (line 9),
Page 92
hidden
70
then we have encountered the rst non-zero in a new rc block (lines 10{11). We compute
and return the ll ratios for all c based on the block counts Num blocks[c] and the total
non-zeros visited (line 12).
There are a variety of ways to choose the block rows (line 4). For instance, block
rows may be chosen uniformly at random, but must be done so without replacement to
ensure that the estimates converge to truth when  = 1. The Version 1 heuristic examined
every

1


-th block row for each r. We adopt the Version 1 heuristic convention in the
Version 2 heuristic to ensure repeatability of the experiments.
The following is an easy way to enumerate the non-zeros of a given block row of
a CSR matrix in column-major order, assuming sorted column indices. We maintain an
integer array Cur index[1 : r] which keeps a pointer to the current column index in each
row, starting with the rst. At each iteration, we perform a linear search of Cur index[1 : r]
to select the non-zero A(i; j) with the smallest column index, and update the corresponding
entry Cur index[i]. The asymptotic cost of selecting a non-zero is O(r).
Since the Version 1 heuristic only needs ll estimates at r1 and 1c block sizes,
a simpler algorithm was used in the original Sparsity implementation. However, the
two algorithms return identical results on these block sizes, given the same convention for
selecting block rows.
Asymptotic costs
The asymptotic cost of executing the procedure EstimateFill shown in Figure 3.7 for all
1  r  rmax is O(krmaxcmax), where k is the number of non-zeros in A. To simplify the
analysis, assume that the number of rows m  k, and that every r  rmax divides m, and
that rmax  O(cmax).
First, consider a single execution of EstimateFill for a xed value of r, and for
simplicity further assume that r divides m. The total cost is dominated by the time to
execute lines 9{11 in the innermost loop, which each have O(1) cost. The outermost loop
in Figure 3.7 (line 3) executes mr times, assuming 
m
r  1. The loop in line 6 executes
approximately r km times on average, assuming
k
m non-zeros per row. Thus, lines 9{11 will
execute cmax  r km  
m
r = kcmax times. To execute EstimateFill for all 1  r  rmax
will therefore incur a cost of O(k  rmaxcmax). This cost is linear with respect to k, and
therefore has the same asymptotic costs as SpMV itself.
Page 95
hidden
73
10−4 10−3 10−2 10−1 100
0.60.65
0.70.75
0.80.85
0.90.95
11.05
Convergence of Prediction: Version 2 Heuristic [Pentium III−M]
Matrix 9
Ref
10−4 10−3 10−2 10−1 100
0.60.65
0.70.75
0.80.85
0.90.95
11.05
Pe
rfo
rm
anc
e (f
ract
ion
of b
est)
Matrix 10
Ref
10−4 10−3 10−2 10−1 100
0.60.65
0.70.75
0.80.85
0.90.95
11.05
Fraction of matrix sampled (σ)
Matrix 40
Ref
10−3 10−2 10−1 100
10−1
100
101
102
Fraction of matrix sampled (σ)
Tim
e (u
nblo
cke
d S
pMV
s)
Cost of Fill Estimation: Version 2 Heuristic [Pentium III−M]
Matrix 9
Matrix 10
Matrix 40
Figure 3.9: Accuracy and cost trade-o example: Matrices 9, 10, and 40 on
Pentium III-M. (Top) Performance of the implementation chosen by the heuristic as 
varies. We show data for three test matrices, where performance (y-axis) is shown as a
fraction of the best performance over all 1  r; c  12. (Bottom) Cost of executing the
heuristic as  varies. Time (y-axis) is shown as multiples of the time to execute a single
unblocked (11) SpMV on the given matrix. These data are tabulated in Appendix D.
Page 99
hidden
77
3. At  = :01, the cost of executing the heuristic is between 1 and 10 SpMVs. This can
be seen by observing the bottom plot of Figures 3.8{3.11. We conclude that this value
of  is likely to have a reasonable cost on most platforms.
4. The predictions have stabilized by  = :01 in all but one instance. The predictions
tend to be the same after this value of . The exception is Matrix 40 on the Itanium
2, where the predictions do not become stable until   :07. Examining the bottom
plots of Figures 3.8{3.11, we see that the cost at this value of  is about 11 SpMVs,
but can range from 20{40 SpMVs on the other three platforms.
There are many ways to address the problem of how to choose  in a platform and matrix-
speci c way. For instance, we could monitor the stability of the predictions as more of the
matrix is sampled, while simultaneously monitoring the elapsed time so as not to exceed a
user-speci ed maximum. (Con dence interval estimation is an example of a statistical tech-
nique which could be used to monitor and make systematic decisions regarding prediction
stability [260].) However, in the remainder of this chapter, we settle on the use of  = :01
on all platforms, where the observations above justify this choice as a reasonable trade-o
between cost and prediction accuracy.
3.3 Evaluation of the Heuristic: Accuracy and Costs
This section evaluates the overall accuracy and total run-time cost of tuning using the
Version 2 heuristic. We implemented the Version 2 heuristic according to the guidelines
and results of Section 3.2, and evaluated the heuristic against exhaustive search on the 8
platforms and 44 test matrices listed in Appendix B. This data leads us to the following
empirical conclusions:
1. The Version 2 heuristic nearly always chooses an implementation within 10% of the
best implementation found by exhaustive search in practice (Section 3.3.1). The sole
exception is Matrix 27 on Itanium 1, for which the heuristic selects an implementation
which is 86% as fast as the best by exhaustive search.
In addition, we nd that even exact knowledge of the ll ratio ( = 1) does not lead
to signi cantly better predictions, con rming that our choice of  = :01 is reasonable.
counters (as well as cache miss statistics), thus providing one portable way to use an accurate timer [60]. In
addition, the most recent revision of the FFTW package (FFTW 3) also contains a standard interface just
for reading the cycle counter, and is available on many additional platforms [123].
Page 100
hidden
78
2. The total cost of tuning, including execution of the heuristic and conversion to blocked
format, is at most 43 unblocked SpMV operations in practice (Section 3.3.2). This
total cost depends on the machine, and can even be as low as 5{6 SpMVs (Ultra 3,
Pentium III, and Pentium III-M).
Our implementation of the heuristic includes a run-time check in which the unblocked SpMV
routine is also executed once to ensure that blocking is pro table. This additional execution
is re
ected in reported costs for the heuristic.
For each platform, we omit matrices which t in the largest cache level, following
the methodology outlined in Appendix B. The matrices are organized as follows:
 Matrix 1 (D): Dense matrix stored in sparse format, shown for reference.
 Matrices 2{9 (FEM 1): Matrices from FEM simulations. The majority of non-zeros in
these matrices are located in blocks of the same size, and these blocks are uniformly
aligned on a grid, as shown by solid black lines in Figure 3.2 (right).
 Matrices 10{17 (FEM 2): Matrices from FEM simulations where a mixture of block
sizes occurs, or the blocks are not uniformly aligned, or both.
 Matrices 18{39 (Other): Matrices from non-FEM applications which tend not to have
much if any regular block structure.
 Matrices 40{44 (LP): Matrices from linear programming applications which also tend
not to have regular block structure.
The structural properties of the matrices are discussed in more detail in Chapter 5 and
Appendix F.
This discussion focuses on the accuracy and cost of the heuristic. We revisit this
performance data when comparing absolute performance to our upper bounds performance
model in Chapter 4.
3.3.1 Accuracy of the Sparsity Version 2 heuristic
Figures 3.12{3.15 summarizes how accurately the Version 2 heuristic predicts the optimal
block size. For each platform and matrix, we show

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

25 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
44% Ph.D. Student
 
8% Researcher (at an Academic Institution)
 
8% Assistant Professor
by Country
 
32% United States
 
12% China
 
8% France