Sign up & Download
Sign in

What GPU Computing Means for High-End Systems

by Richard Vuduc, Kent Czechowski
Ieee Micro (2011)

Cite this document (BETA)

Available from ieeexplore.ieee.org
Page 1
hidden

What GPU Computing Means for High-End Systems

..................................................................................................................................................................
What GPU Computing Means
for High-End Systems
RICHARD VUDUC AND KENT CZECHOWSKI
Georgia Institute of Technology
.......Between 2018 and 2020, the
first exascale supercomputers will
come online.1 If realized, these systems
will perform a staggering 1018 or more
floating-point operations per second
(1 exaflop/second, or 1 EF/s), which is
two to three orders of magnitude more
than what’s available today. Such sys-
tems promise to advance computer-
based simulation of diverse phenomena
in climate, energy, materials, combus-
tion, geoscience, and biomedicine,
among numerous other areas.2
What will the architecture of such a
machine look like? GPU computing dom-
inates the conversation.3 The reason is
simple: GPUs today deliver more peak
performance and bandwidth relative to
conventional CPU designs within a com-
parable power footprint. As such, a GPU-
based supercomputer should be smaller
and more energy-efficient than its non-
GPU counterpart, and therefore cheaper
to buy with a lower energy bill over its
lifetime. Three of today’s five fastest
machines use GPUs (http://top500.org),
as do half of the top 10 green supercom-
puters as measured by performance per
Watt (http://green500.org). Most high-
performance computing (HPC) analysts
would bet that GPU-like designs are per-
haps the most viable path toward exa-
scale computing.
But is this a sure bet? Here, we sug-
gest that exactly the opposite might be
true: relative to an ideal GPU-like system
at exascale, a system with slower
processors—but, critically, better-
balanced ones—might yield an overall
system with higher performance while
consuming less energy. The analysis
behind this claim is not actually about
the specifics of CPUs versus GPUs.4-7
Rather, it draws on a fundamental
principle of algorithmic performance
engineering—namely, the principle of
system balance.8-10
Balance and intensity
The classical principle of balance says
simply that we should strive to design
systems in which compute time equals
I/O time.9 Compute time might be the
time to perform flops, and I/O time can
refer to memory or network traffic
time. When balanced, we can hide I/O
time, making the computation compute
bound rather than I/O bound.
Balance gives us an analytical frame-
work for relating architectures and algo-
rithms. Consider the hypothetical
multiprocessor with p cores shown in
Figure 1a. It has Z words of local storage
(that is, registers, cache, and scratchpad
memory). Each core can perform up to
C0 operations per unit time, and the
peak off-processor bandwidth is b
words per unit time. Given a computa-
tion, we will apply the balance principle
to derive precise analytical relationships
among these parameters.
To do so, we also must characterize
a computation by its intensity, which is
the number of operations (for example,
flops) it performs divided by the number
of words it must transfer between the
fast and external memories. Intensity
has units of operations per word and is
an intrinsic property of the computation.
For example, dense-matrix multiply usu-
ally has a high intensity because it has a
lot of data reuse per operation; sorting,
by contrast, has streaming behavior
and a relatively lower intensity. The pre-
cise value of intensity depends on the
size of the fast memory Z. More fast
memory usually means fewer required
transfers, so intensity will tend to
grow with Z. Thus, we denote intensity
by I(Z ).
If we apply the balance principle to a
given computation and this processor sys-
tem, we will obtain the constraint8-10
p  C0
b ¼ IðZ Þ
(1)
where p  C0/b is the processor’s balance
and I(Z ) is the intensity. When Equation
1 is true, the processor delivers opera-
tions and data at the rate required by
the computation. Furthermore, it gives
an explicit relationship among the archi-
tectural parameters.
A nice way to visualize the impact of
balance and intensity on performance is
through a roofline diagram.11 Figure 1b
is an example. Each line represents a
typical processor from a particular plat-
form class (GPU, CPU, or mobile); the
y-axis shows the maximum achievable
[3B2-9] mmi2011040074.3d 22/7/011 12:41 Page 74
Prolegomena
..............................................................
74 Published by the IEEE Computer Society 0272-1732/11/$26.00 c 2011 IEEE

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

5 Readers on Mendeley
by Discipline
 
by Academic Status
 
40% Student (Master)
 
20% Ph.D. Student
 
20% Researcher (at an Academic Institution)
by Country
 
60% United States
 
20% Germany
 
20% Saudi Arabia

Groups

HPC Garage