Sign up & Download
Sign in

GPU acceleration of an unmodified parallel finite element Navier-Stokes solver

by Dominik Goddeke, Sven H M Buijssen, Hilmar Wobker, Stefan Turek
2009 International Conference on High Performance Computing Simulation (2009)

Abstract

We have previously suggested a minimally invasive approach to include hardware accelerators into an existing large-scale parallel finite element PDE solver toolkit, and implemented it into our software FEAST. Our concept has the important advantage that applications built on top of FEAST benefit from the acceleration immediately, without changes to application code. In this paper we explore the limitations of our approach by accelerating a Navier-Stokes solver. This nonlinear saddle point problem is much more involved than our previous tests, and does not exhibit an equally favourable acceleration potential: Not all computational work is concentrated inside the linear solver. Nonetheless, we are able to achieve speedups of more than a factor of two on a small GPU-enhanced cluster. We conclude with a discussion how our concept can be altered to further improve acceleration.

Cite this document (BETA)

Available from ieeexplore.ieee.org
Page 1
hidden

GPU acceleration of an unmodified parallel finite element Navier-Stokes solver

GPU Acceleration of an Unmodified Parallel
Finite Element Navier–Stokes Solver
Dominik Go¨ddeke, Sven H.M. Buijssen, Hilmar Wobker and Stefan Turek
Angewandte Mathematik und Numerik, TU Dortmund, Germany
dominik.goeddeke, sven.buijssen, hilmar.wobker, stefan.turek@math.tu-dortmund.de
ABSTRACT
We have previously suggested a minimally invasive approach
to include hardware accelerators into an existing large-scale
parallel finite element PDE solver toolkit, and implemented
it into our software FEAST. Our concept has the important
advantage that applications built on top of FEAST benefit
from the acceleration immediately, without changes to appli-
cation code. In this paper we explore the limitations of our
approach by accelerating a Navier-Stokes solver. This non-
linear saddle point problem is much more involved than our
previous tests, and does not exhibit an equally favourable
acceleration potential: Not all computational work is con-
centrated inside the linear solver. Nonetheless, we are able
to achieve speedups of more than a factor of two on a small
GPU-enhanced cluster. We conclude with a discussion how
our concept can be altered to further improve acceleration.
KEYWORDS: Large Scale Scientific Computing, Paral-
lelization of Simulation, Fine-Grain Parallelism and Archi-
tectures
1. INTRODUCTION
Computational science and numerical simulation are in the
midst of a revolution, caused by the fundamental paradigm
shift of the underlying hardware towards parallelism and het-
erogeneity. Due to power and heat considerations, chip man-
ufacturers now scale the number of cores per chip rather than
clock frequencies. At the same time, memory bandwidth
and latency continue to improve at a much slower rate than
(peak) compute performance, further inhibited by pin lim-
its. This well-known memory wall problem is worsened by
multicore architectures: The available bandwidth typically
scales with the number of sockets per compute node and not
with the number of cores per chip. Graphics processor units
(GPUs) on the other hand provide a much higher bandwidth
than commodity CPU designs, and their architecture and pro-
gramming model is representative of future manycore archi-
tectures.
1.1. Hardware-Oriented Numerics
The memory wall problem is particularly critical in the nu-
merical simulation of physical phenomena described by par-
tial differential equations (PDEs), such as the Navier–Stokes
equations governing fluid flow. In the finite element method
(FEM) and similarly for finite differences and finite volumes,
the discretisation of the PDEs leads to large, sparse systems
of equations, and linear algebra operations such as matrix-
vector multiplication exhibit an arithmetic intensity (ratio of
floating point operations per memory access) of 1:1 or less,
while peak processor performance is only attained for ratios
of 10:1 or higher. In addition, the discrete problems are typ-
ically much too large to be solved on a single computer, and
parallel solution schemes are necessary. We are convinced
that in this situation, significant performance gains can only
be achieved by ‘hardware-oriented numerics’. This concept
comprises much more than a highly-tuned implementation
involving optimal data structures and maximising data reuse
to exploit (cache) memory hierarchies. Here, we only illus-
trate the broad ideas, and refer to previous work for more
details [22, 23].
For ill-conditioned problems depending on the mesh width,
multigrid methods are obligatory due to their asymptotic op-
timality: Even for a simple Poisson problem, a multigrid
solver with the worst possible choice of smoother, data struc-
ture and numbering scheme for the unknowns executes faster
than a Krylov subspace solver with a very powerful elemen-
tary preconditioner like ILU [15]. On the other hand, as long
as the factorisation overhead can be amortised over several
right hand sides, the (serial) direct solver in the UMFPACK
library [5] outperforms multigrid for up to 20–60,000 un-
knowns, depending on the hardware. Serial smoothers with
‘optimal’ numerical properties are strongly recursive, pre-
venting a scalable implementation in parallel. Instead, meth-
ods are employed that are potentially less efficient in terms of
numerics (i. e., convergence rates), but that are much better
suited for the communication characteristics of the parallel
computer. In general, hardware trends enforce research into
novel numerical methodology that in turn exploits the avail-
able hardware in a better way.

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

12 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
33% Ph.D. Student
 
25% Post Doc
 
17% Researcher (at a non-Academic Institution)
by Country
 
50% United States
 
17% France
 
8% Sweden

Groups

FEM on GPUs