Sign up & Download
Sign in

On the Limits of GPU Acceleration

by Richard Vuduc, Aparna Chandramowlishwaran, Jee Choi, M Guney, A Shringarpure
Science (2010)

Abstract

This paper throws a small "wet blanket" on the hot topic of GPGPU acceleration, based on experience analyzing and tuning both multithreaded CPU and GPU implementations of three computations in scientific computing. These computations-(a) iterative sparse linear solvers; (b) sparse Cholesky factorization; and (c) the fast multipole method-exhibit complex behavior and vary in computational intensity and memory reference irregularity. In each case, algorithmic analysis and prior work might lead us to conclude that an idealized GPU can deliver better performance, but we find that for at least equal-effort CPU tuning and consideration of realistic workloads and calling-contexts, we can with two modern quad-core CPU sockets roughly match one or two GPUs in performance. Our conclusions are not intended to dampen interest in GPU acceleration; on the contrary, they should do the opposite: they partially illuminate the boundary between CPU and GPU performance, and ask architects to consider application contexts in the design of future coupled on-die CPU/GPU processors.

Cite this document (BETA)

Available from Richard Vuduc's profile on Mendeley.
Page 1
hidden

On the Limits of GPU Acceleration

On the Limits of GPU Acceleration
Richard Vuducy, Aparna Chandramowlishwarany, Jee Choi,
Murat (Efe) Guney, Aashay Shringarpurez
Georgia Institute of Technology
y School of Computational Science and Engineering
 School of Electrical and Computer Engineering
 School of Civil and Environmental Engineering
z School of Computer Science
frichie,aparna,jee,efe,aashay.shringarpureg@gatech.edu
Abstract
This paper throws a small “wet blanket” on the hot topic
of GPGPU acceleration, based on experience analyzing
and tuning both multithreaded CPU and GPU implemen-
tations of three computations in scientific computing.
These computations—(a) iterative sparse linear solvers;
(b) sparse Cholesky factorization; and (c) the fast mul-
tipole method—exhibit complex behavior and vary in
computational intensity and memory reference irregular-
ity. In each case, algorithmic analysis and prior work
might lead us to conclude that an idealized GPU can
deliver better performance, but we find that for at least
equal-effort CPU tuning and consideration of realistic
workloads and calling-contexts, we can with two mod-
ern quad-core CPU sockets roughly match one or two
GPUs in performance.
Our conclusions are not intended to dampen interest
in GPU acceleration; on the contrary, they should do the
opposite: they partially illuminate the boundary between
CPU and GPU performance, and ask architects to con-
sider application contexts in the design of future coupled
on-die CPU/GPU processors.
1 Our Position and Its Limitations
We have over the past year been interested in the anal-
ysis, implementation, and tuning of a variety of irreg-
ular computations arising in computational science and
engineering applications, for both multicore CPUs and
GPGPU platforms [4, 11, 5, 16, 1]. In reflecting on this
experience, the following question arose:
What is the boundary between computations
that can and cannot be effectively accelerated
by GPUs, relative to general-purpose multi-
core CPUs within a roughly comparable power
footprint?
Though we do not claim a definitive answer to this
question, we believe our preliminary findings might sur-
prise the broader community of application development
teams whose charge it is to decide whether and how
much effort to expend on GPGPU code development.
Position. Our central aim is to provoke a more real-
istic discussion about the ultimate role of GPGPU ac-
celerators in applications. In particular, we argue that,
for a moderately complex class of “irregular” compu-
tations, even well-tuned GPGPU accelerated implemen-
tations on currently available systems will deliver per-
formance that is, roughly speaking, only comparable to
well-tuned code for general-purpose multicore CPU sys-
tems, within a roughly comparable power footprint. Put
another way, adding a GPU is equivalent in performance
to simply adding one or perhaps two more multicore
CPU sockets. Thus, one might reasonably ask whether
this level of performance increase is worth the potential
productivity loss from adoption of a new programming
model and re-tuning for the accelerator.
Our discussion considers (a) iterative solvers for
sparse linear systems; (b) direct solvers for sparse linear
systems; and (c) the fast multipole method for particle
systems. These appear in traditional high-performance
scientific computing applications, but are also of increas-
ing importance in graphics, physics-based games, and
large-scale machine learning problems.
Threats to validity. Our conclusions represent our in-
terpretation of the data. By way of full-disclosure up-
front, we acknowledge at least the following three major
weaknesses in our position.
 (Threat 1) Our perspective comes from relatively
narrow classes of applications. These computations
come from traditional HPC applications.
 (Threat 2) Some conclusions are drawn from partial
results. Our work is very much on-going, and we
are carefully studying our GPU codes to ensure that
we have not missed additional tuning opportunities.
1
Page 2
hidden
 (Threat 3) Our results are limited to today’s plat-
forms. At the time of this writing, we had access
to NVIDIA Tesla C1060/S1070 and GTX285 sys-
tems. Our results do not yet include ATI systems or
NVIDIA’s new Fermi offerings, which could yield
very different conclusions [12, 13]. Also, some of
the performance limits we discuss stem in part from
the limits of PCIe. If CPUs and GPUs move onto
the same die, this limitation may become irrelevant.
Having acknowledged these limitations, we make the
following counter-arguments.
Regarding Threat 1, we claim these classes have two
interesting features. First, as stated previously, these
computations will have an impact in increasingly sophis-
ticated emerging applications in graphics, gaming, and
machine learning. Secondy, the computations are non-
trivial, going beyond just a single “kernel,” like matrix
multiply or sparse matrix-vector multiplication. Since
they involve additional context, the computations be-
gin to approach larger and more realistic applications.
Thirdly, they have a mix of regular and irregular behav-
ior, and may therefore live near the boundaries of what
we might expect to run better on a GPU than a CPU.
Regarding Threat 2, we would claim that we achieve
extremely high levels of absolute performance in all our
codes, so it is not clear whether there is much room left
for additional improvement, at least, without resorting to
entirely new algorithms.
Regarding Threat 3, it seems to us that just moving
a GPU-like accelerator unit on the same die as one or
more CPU-like cores will not resolve all issues. For ex-
ample, the high-bandwidth channels available on a GPU
board would, we presume, have to be translated to a fu-
ture same-die CPU/GPU socket to deliver the same level
of performance we enjoy today when the entire problem
can reside on the GPU.
2 Iterative Sparse Solvers
We first consider the class of iterative sparse solvers.
Given a sparse matrix A, we wish either to solve a lin-
ear system (i.e., compute the solution x of Ax = b) or
compute the eigenvalues and/or eigenvectors of A, using
an iterative method, such as the conjugate gradients or
Lanczos algorithms [6]. These algorithms have the same
basic structure: they iteratively compute a sequence ap-
proximate solutions that ultimately converge to the solu-
tion within a user-specified error tolerance. Each itera-
tion consists of multiplying A by a dense vector, which
is a sparse matrix-vector multiply (SpMV) operation. Al-
gorithmically, an SpMV computes y A  x, given A
and x. To first order, an SpMV is dominated simply by
the time to stream the matrix A, and within an iteration,
SpMV has no temporal locality. That is, we expect the
performance of SpMV—and thus the solver overall—to
be largely memory-bandwidth bound.
We have with others for many years studied auto-
tuning of SpMV for single- and multicore CPU plat-
forms [16, 14, 10]. The challenge is that although SpMV
is bandwidth bound, a sparse matrix must be stored us-
ing a graph data structure, which will lead to indirect and
irregular memory references to the x and/or y vectors.
Nevertheless, the main cost for typical applications on
cache-based machines is the bandwidth-bound aspect of
reading A.
Thus, GPUs are attractive for SpMV because they de-
liver much higher raw memory bandwidth than a multi-
socket CPU system within a (very) roughly equal power
budget. We have extended our autotuning methodolo-
gies for CPU-tuning [14] to the case of GPUs [5]. We
do in fact achieve a considerable 2 speedup over the
CPU case, as Figure 1 shows for a variety of finite-
element modeling problems (x-axis) in double-precision.
(This figure is taken from an upcoming book chap-
ter [15].) Our autotuned GPU SpMV on a single
NVIDIA GTX285 system achieves a state-of-the-art 12–
19 Gflop/s, compared to an autotuned dual-socket quad-
core Intel Nehalem implementation that achieves 7–8
Gflop/s, with 1.5–2.3 improvements. This improve-
ment is roughly what we might expect, given that the
GTX285’s peak bandwidth is 159 GB/s, which is 3:1
the aggregate peak bandwidth of the dual-socket Ne-
halem system (51 GB/s).
However, this performance assumes the matrix is al-
ready on the GPU. In fact, there will be additional costs
for moving the matrix to the GPU combined with GPU-
specific data reorganization. That is, the optimal imple-
mentation on the GPU uses a different data structure than
either of the the optimal or baseline implementations on
the CPU. Indeed, this data structure tuning is even more
critical on the GPU, due to the performance requirement
of coalesced accesses; without it, the GPU provides no
advantage over the CPU [2].
The host-to-GPU copy is also not negligible. To see
why, consider the following. Recall that, to first order,
SpMV streams the matrix A, and performs just 2 flops
per matrix entry. If SpMV runs at P Gflop/s in double-
precision, then the “equivalent” effective bandwidth in
double-precision is at least (8 bytes) / (2 flops) * P , or
4P GB/s. Now, decompose the GPU solver execution
time into three phases: (a) data reorganization, at a rate
of reorg words per second second; (b) host-to-GPU data
transfer, at transfer words per second, without increas-
ing the size of A; and finally (c) q iterations of SpMV, at
an effective rate of gpu words per second. On a multi-
core CPU, let cpu be the equivalent effective bandwidth,
also in words per second. For a matrix of k words, we
2
Page 3
hidden
ParCo’09 Circles: Tuned Nehalem Diamonds: STI PowerXCell 8i
Baseline: Intel Nehalem, Multithreaded But Untuned (8 cores)
NVIDIA, SC’09 GTX285
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
20.0
Protein

Sphere
s
Cantile
ver
WindT
unnel Harbo
r QCD Ship
GFlo
p/s
Matrix (input data set)
Our code, PPoPP’10 1.1 to 1.5x
Best Single-node Double-precision SpMV
Figure 1: The best GPU implementation of sparse matrix-vector multiply (SpMV) (“Our code”, on one NVIDIA GTX
285) can be over 2 faster than a highly-tuned multicore CPU implementation (“Tuned Nehalem”, on a dual-socket
quad-core system). Implementations: ParCo’09 [16], SC’09 [2], and PPoPP’10 [5]. Note: Figure taken from an
upcoming book chapter [15].
will only observe a speedup if the CPU time, cpu, ex-
ceeds the GPU time, gpu. With this constraint, we can
determine how many iterations q are necessary for the
GPU-based solver to beat the CPU-based one:
cpu  gpu (1)
)
k  q
cpu
 k 

1
reorg
+
1
transfer
+
q
gpu

(2)
) q 
1
reorg
+ 1 transfer
1
cpu
1 gpu
(3)
From Figure 1, we might optimistically take gpu= (4
bytes/flop) * (19 Gflop/s) = 76 GB/s, and pessimistically
take cpu= (4 bytes per flop) * 6 Gflop/s = 24 GB/s; both
are about half the aggregate peak on the respective plat-
forms. Reasonable estimates of reorg and transfer, based
on measurement (not peak), are 0.5 and 1 GB/s, respec-
tively. The solver must, therefore, perform q  105
iterations to break-even; thus, to realize an actual 2
speedup on the whole solve, we would need q  840 it-
erations. While typical iteration counts reported for stan-
dard problems number in the few hundreds [6], whether
this value of q is large or not is highly problem- and
solver-dependent, and we might not know until run-time
when the problem (matrix) is known. The developer
must make an educated guess and take a chance, rais-
ing the question of what she or he should expect the real
pay-off from GPU acceleration to be.
Having said that, our analysis may also be pessimistic.
One could, for instance, improve effective transfer term
by pipelining the matrix transfer with the SpMV. Or, one
might be able to eliminate the transfer term altogether by
assembling the matrix on the GPU itself [3]. The main
point is that making use of GPU acceleration even in this
relatively simple “application” is more complicated than
it might at first seem.
3 Direct Sparse Solvers
Another important related class of sparse matrix solvers
are direct methods based on explicitly factoring the ma-
trix. In contrast to an iterative solver, a direct solver has
a fixed number of operations as well as more complex
task-level parallelism, more storage, and possibly even
more irregular memory access behavior than the largely
data-parallel and streaming behavior of the iterative case
(Section 2).
We have been interested in such sparse direct solvers,
particularly so-called multifrontal methods for Cholesky
factorization, which we tune specifically for structural
analysis problems arising in civil engineering [9]. From
the perspective of GPU acceleration, the most rele-
vant aspect of this class of sparse direct solvers is that
the workload consists of many dense matrix subprob-
lems (factorization, triangular multiple-vector solves,
and rank-k update matrix multiplications). Generally
speaking, we expect a GPU to easily accelerate such sub-
computations.
In reality, however, the size of these subproblems
changes as the computation proceeds, and the subprob-
lems themselves may execute asynchronously together,
depending on the input problem. That is, the input
3
Page 6
hidden
Conf. Parallel Processing (ICPP), Vienna, Austria,
September 2009.
[2] N. Bell and M. Garland. Implementing a
sparse matrix-vector multiplication on throughput-
oriented processors. In Proc. ACM/IEEE Conf. Su-
percomputing (SC), Portland, OR, USA, November
2009.
[3] C. Cecka, A. J. Lew, and E. Darve. Assembly of fi-
nite element methods on graphics processors. Int’l.
J. Numerical Methods in Engineering, 2009.
[4] A. Chandramowlishwaran, S. Williams, L. Oliker,
I. Lashuk, G. Biros, and R. Vuduc. Optimizing and
tuning the fast multipole method for state-of-the-art
multicore architectures. In Proc. IEEE Int’l. Paral-
lel and Distributed Processing Symp. (IPDPS), At-
lanta, GA, USA, April 2010.
[5] J. W. Choi, A. Singh, and R. W. Vuduc. Model-
driven autotuning of sparse matrix-vector multi-
ply on GPUs. In Proc. ACM SIGPLAN Symp.
Principles and Practice of Parallel Programming
(PPoPP), Bangalore, India, January 2010.
[6] J. W. Demmel. Applied Numerical Linear Algebra.
SIAM, Philadelphia, PA, USA, 1997.
[7] L. Greengard and V. Rokhlin. A fast algorithm for
particle simulations. J. Comp. Phys., 73:325–348,
1987.
[8] N. A. Gumerov and R. Duraiswami. Fast multipole
methods on graphics processors. J. Comp. Phys.,
227:8290–8313, 2008.
[9] M. E. Guney. High-performance direct solution of
finite-element problems on multi-core processors.
PhD thesis, Georgia Institute of Technology, At-
lanta, GA, USA, May 2010.
[10] E.-J. Im, K. Yelick, and R. Vuduc. SPARSITY:
Optimization framework for sparse matrix kernels.
Int’l. J. High Performance Computing Applications
(IJHPCA), 18(1):135–158, February 2004.
[11] I. Lashuk, A. Chandramowlishwaran, H. Langston,
T.-A. Nguyen, R. Sampath, A. Shringarpure,
R. Vuduc, L. Ying, D. Zorin, and G. Biros. A mas-
sively parallel adaptive fast multipole method on
heterogeneous architectures. In Proc. ACM/IEEE
Conf. Supercomputing (SC), Portland, OR, USA,
November 2009.
[12] NVIDIA. NVIDIA’s next generation CUDA
compute architecture: FermiTM, v1.1. Whitepa-
per (electronic), September 2009. http://
www.nvidia.com/content/PDF/fermi_
white_papers/NVIDIA_Fermi_Compute_
Architecture_Whitepaper.pdf.
[13] D. A. Patterson. The top 10 innova-
tions in the new NVIDIA Fermi archi-
tecture, and the top 3 next challenges.
http://www.nvidia.com/content/PDF/
fermi_white_papers/D.Patterson_
Top10InnovationsInNVIDIAFermi.pdf,
September 2009.
[14] R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI:
A library of automatically tuned sparse matrix ker-
nels. In Proc. SciDAC, J. Physics: Conf. Ser., vol-
ume 16, pages 521–530, 2005.
[15] S. Williams, N. Bell, J. Choi, M. Garland, L. Oliker,
and R. Vuduc. Sparse matrix vector multiplication
on multicore and accelerator systems. In J. Don-
garra, D. A. Bader, and J. Kurzak, editors, Scientific
Computing with Multicore Processors and Acceler-
ators. CRC Press, 2010.
[16] S. Williams, R. Vuduc, L. Oliker, J. Shalf,
K. Yelick, and J. Demmel. Optimizing sparse
matrix-vector multiply on emerging multi-
core platforms. Parallel Computing (ParCo),
35(3):178–194, March 2009. Extends conference
version: http://dx.doi.org/10.1145/
1362622.1362674.
[17] L. Ying, G. Biros, D. Zorin, and H. Langston.
A new parallel kernel-independent fast multipole
method. In Proc. ACM/IEEE Conf. Supercomput-
ing (SC), Phoenix, AZ, USA, November 2003.
[18] L. Ying, D. Zorin, and G. Biros. A kernel-
independent adaptive fast multipole method in two
and three dimensions. J. Comp. Phys., 196:591–
626, May 2004.
6

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

33 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
39% Ph.D. Student
 
18% Student (Master)
 
9% Post Doc
by Country
 
42% United States
 
12% China
 
6% France