Sign up & Download
Sign in

Cache oblivious stencil computations

by Matteo Frigo, Volker Strumpen
Proceedings of the 19th annual international conference on Supercomputing ICS 05 (2005)

Abstract

We present a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods. Our algorithm applies to arbitrary stencils in n -dimensional spaces. On an "ideal cache" of size Z , our algorithm saves a factor of Θ( Z 1/ n cache misses compared to a naive algorithm, and it exploits temporal locality optimally throughout the entire memory hierarchy.

Cite this document (BETA)

Available from portal.acm.org
Page 1
hidden

Cache oblivious stencil computations

Cache Oblivious Stencil Computations
Matteo Frigo and Volker Strumpen∗
IBM Austin Research Laboratory
11501 Burnet Road, Austin, TX 78758
May 25, 2005
Abstract
We present a cache oblivious algorithm for stencil com-
putations, which arise for example in finite-difference
methods. Our algorithm applies to arbitrary stencils in
n-dimensional spaces. On an “ideal cache” of size Z,
our algorithm saves a factor of Θ(Z1/n) cache misses
compared to a naive algorithm, and it exploits temporal
locality optimally throughout the entire memory hier-
archy.
1 Introduction
A stencil defines the computation of an element in
an n-dimensional spatial grid at time t as a function
of neighboring grid elements at time t − 1, . . . , t − k.
This computational pattern arises in many contexts,
including explicit finite-difference methods [5]. The
n-dimensional grid plus the time dimension span an
(n+1)-dimensional spacetime .1 Each spacetime point,
except possibly for initial and boundary values, is com-
puted by means of a computational kernel . In prac-
tical implementations of stencils, there is often no need
to store the entire spacetime; storing a bounded num-
ber of time steps per space point is sufficient. For ex-
ample, consider a 3-point stencil in 1-dimensional space
(2-dimensional spacetime): Because the computation of
a point at time t depends only upon three points at time
t−1, it is sufficient to store two time steps only. For this
important case of stencil computations with kernels that
require a bounded amount of storage per space point,
we present a cache-oblivious algorithm that exploits a
memory hierarchy optimally.
A stencil computation is a traversal of spacetime in
? This work was supported in part by the Defense Ad-
vanced Research Projects Agency (DARPA) under contract No.
NBCH30390004.
Copyright 2005 by the Association for Computing Machinery, Inc. Permission to
make digital or hard copies of part of this work for personal or classroom use is
granted without fee provided that copies are not made or distributed for profit
or commercial advantage and that copies bear this notice and the full citation on
the first page or intial screen of the document. Copyrights for components of this
work owned by others than ACM must be honored. Abstracting with credit is
permitted. To copy otherwise, to republish, to post on servers, or to redistribute
to lists, requires prior specific permission and/or a fee. Request permissions from
Publications Dept., ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org.
an order that respects the data dependencies imposed
by the stencil. The simplest stencil computation ap-
plies the kernel to all spacetime points at time t before
computing any point at time t+1. On a memory hierar-
chy, if the size of the storage required for the spacetime
points of one time step exceeds the cache size Z, this
simple computation incurs a number of cache misses
proportional to p, where p is the number of spacetime
points computed. In contrast, when traversing a suffi-
ciently large rectangular region of (n + 1)-dimensional
spacetime spanning a sufficiently large time interval, our
algorithm incurs at most O(p/Z1/n) cache misses on a
machine with an ideal cache [2] of size Z. This num-
ber of cache misses matches the lower bound of Hong
and Kung [3] within a constant factor. Unlike blocked
algorithms, our algorithm is cache oblivious: it does
not contain the cache size as a parameter [2]. There-
fore, the algorithm makes optimal use of each level in
a multi-level memory hierarchy. In addition, our algo-
rithm applies to arbitrary stencils and arbitrary space
dimension n > 0.
Cache oblivious algorithms for special cases of stencil
computations have been proposed before. Bilardi and
Preparata [1] discuss cache oblivious algorithms for the
related problem of simulating large parallel machines on
smaller machines in a spacetime-efficient manner. Their
algorithms apply to 1-dimensional and 2-dimensional
spaces and do not generalize to higher dimensions. In
fact, the authors declare the 3-dimensional case, and im-
plicitly higher dimensional spaces, to be an open prob-
lem. Prokop [4] gives a cache oblivious stencil algo-
rithm for a 3-point stencil in 1-dimensional space, and
proves that the algorithm is optimal. His algorithm is
restricted to square spacetime regions, and it does not
extend to higher dimensions. We are unaware of any
previous solution of the general n-dimensional case.
We introduce a simplified cache-oblivious stencil al-
gorithm for 1-dimensional grids and a 3-point stencil in
1We emphasize that we denote the dimensionality of space
as n and the dimensionality of spacetime as n + 1. When using
the term space, we implicitly exclude the time dimension. When
we include the time dimension, we refer to spacetime.
361
Page 2
hidden
void walk1(int t0, int t1, int x0, int x˙0, int x1, int x˙1)
{
int ∆t = t1 - t0;
if (∆t == 1) {
/* base case */
int x;
for (x = x0; x < x1; ++x)
kernel(t0, x);
} else if (∆t > 1) {
if (2 * (x1 - x0) + (x˙1 - x˙0) * ∆t >= 4 * ∆t) {
/* space cut */
int xm = (2 * (x0 + x1) + (2 + x˙0 + x˙1) * ∆t) / 4;
walk1(t0, t1, x0, x˙0, xm, -1);
walk1(t0, t1, xm, -1, x1, x˙1);
} else {
/* time cut */
int s = ∆t / 2;
walk1(t0, t0 + s, x0, x˙0, x1, x˙1);
walk1(t0 + s, t1, x0 + x˙0 * s, x˙0, x1 + x˙1 * s, x˙1);
}
}
}
Figure 1: Procedure walk1 for traversing a 2-dimensional spacetime spanned by a 1-dimensional grid and time for a 3-point
stencil.
Section 2. Then, we present our algorithm for arbitrary
stencils and n-dimensional grids in Section 3, and prove
bounds on the number of cache misses in Section 4.
2 One-dimensional Stencil Algorithm
Procedure walk1 in Fig. 1 traverses rectangular space-
time (t, x), where 0 ≤ t < T and 0 ≤ x < N . For sim-
pler illustration, we restrict this procedure to observe
the dependencies of a 3-point stencil, i.e. the procedure
visits point (t+1, x) after visiting points (t, x−1), (t, x),
and (t, x + 1). Although we are ultimately interested
in traversing rectangular spacetime regions, procedure
walk1 operates on more general trapezoidal regions such
as the one shown in Fig. 2. For integers t0, t1, x0, x˙0, x1,
and x˙1, we define the trapezoid T (t0, t1, x0, x˙0, x1, x˙1)
to be the set of integer pairs (t, x) such that t0 ≤ t < t1
and x0 + x˙0(t − t0) ≤ x < x1 + x˙1(t − t0). (We use
the Newtonian notation x˙ = dx/dt.) The height of the
trapezoid is ∆t = t1 − t0, and we define the width
to be the average of the lengths of the two parallel
sides, i.e. w = (x1 − x0) + (x˙1 − x˙0)∆t/2. The cen-
ter of the trapezoid is point (t, x), where t = (t0+ t1)/2
and x = (x0 + x1)/2 + (x˙0 + x˙1)∆t/4 (i.e., the average
of the four corners). The volume of the trapezoid is
the number of points in the trapezoid: Vol(T ) = |T |.
We only consider well-defined trapezoids, for which
x0 x1
t0
t1
x
t
w
∆t
Figure 2: Illustration of the trapezoid T (t0, t1, x0, x˙0, x1, x˙1)
for x˙0 = 1 and x˙1 = −1. The trapezoid includes all points
in the shaded region, except for those on the top and right
edges.
these three conditions hold: t1 ≥ t0, x1 ≥ x0, and
x1 +∆t · x˙1 ≥ x0 +∆t · x˙0.
The special case T (t0, t1, x0, 0, x1, 0) denotes a rect-
angular region. In this section, we restrict the slopes
x˙0 and x˙1 of the edges to assume values 1, −1, or 0,
delaying the general case until Section 3.
Fig. 1 shows procedure walk1 as a recursive
C function whose parameters denote the trapezoid
T (t0, t1, x0, x˙0, x1, x˙1). The procedure visits all points
of the trapezoid in an order that respects the stencil de-
pendencies. Procedure walk1 decomposes the trapezoid
362

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

17 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
24% Student (Master)
 
24% Ph.D. Student
 
12% Assistant Professor
by Country
 
18% Belgium
 
12% China
 
12% France