Sign up & Download
Sign in

Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures

by Aparna Chandramowlishwaran, Samuel Williams, Leonid Oliker, Ilya Lashuk, George Biros, Richard Vuduc
2010 IEEE International Symposium on Parallel Distributed Processing IPDPS (2010)

Abstract

This work presents the first extensive study of single-node performance optimization, tuning, and analysis of the fast multipole method (FMM) on modern multi-core systems. We consider single- and double-precision with numerous performance enhancements, including low-level tuning, numerical approximation, data structure transformations, OpenMP parallelization, and algorithmic tuning. Among our numerous findings, we show that optimization and parallelization can improve double-precision performance by 25 on Intel's quad-core Nehalem, 9.4 on AMD's quad-core Barcelona, and 37.6 on Sun's Victoria Falls (dual-sockets on all systems). We also compare our single-precision version against our prior state-of-the-art GPU-based code and show, surprisingly, that the most advanced multicore architecture (Nehalem) reaches parity in both performance and power efficiency with NVIDIA's most advanced GPU architecture.

Cite this document (BETA)

Available from ieeexplore.ieee.org
Page 1
hidden

Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures

Optimizing and Tuning the Fast Multipole Method
for State-of-the-Art Multicore Architectures
Aparna Chandramowlishwaran?y, Samuel Williams?,
Leonid Oliker?, Ilya Lashuky, George Birosy, Richard Vuducy
?CRD, Lawrence Berkeley National Laboratory, Berkeley, CA 94720
yCollege of Computing, Georgia Institute of Technology, Atlanta, GA
Abstract
This work presents the first extensive study of single-
node performance optimization, tuning, and analysis
of the fast multipole method (FMM) on modern multi-
core systems. We consider single- and double-precision
with numerous performance enhancements, including
low-level tuning, numerical approximation, data struc-
ture transformations, OpenMP parallelization, and al-
gorithmic tuning.
Among our numerous findings, we show that op-
timization and parallelization can improve double-
precision performance by 25 on Intel’s quad-core
Nehalem, 9.4 on AMD’s quad-core Barcelona, and
37.6 on Sun’s Victoria Falls (dual-sockets on all sys-
tems). We also compare our single-precision version
against our prior state-of-the-art GPU-based code and
show, surprisingly, that the most advanced multicore
architecture (Nehalem) reaches parity in both perfor-
mance and power efficiency with NVIDIA’s most ad-
vanced GPU architecture.
1 Introduction
This paper presents the first extensive study of
single-node performance optimization, tuning, and
analysis of the fast multipole method (FMM) [5] on
state-of-the-art multicore processor systems. We target
the FMM because it is broadly applicable to a variety
of scientific particle simulations used to study electro-
magnetic, fluid, and gravitational phenomena, among
others. Importantly, the FMM has asymptotically op-
timal time complexity with guaranteed approximation
accuracy. As such, it is among the most attractive solu-
tions for scalable particle simulation on future extreme
scale systems. This study focuses on single-node per-
formance since it is a critical building-block in scalable
multi-node distributed memory codes and, moreover, is
less well-understood.
Approach: Specifically, we consider implementa-
tions of the kernel-independent FMM (KIFMM) algo-
rithm [16], which simplifies the integration of FMM
methods in practical applications (Section 2). The
KIFMM itself is a complex computation, consisting
of six distinct phases, all of which we parallelize and
tune for leading multicore platforms (Section 4). We
develop both single- and double-precision implemen-
tations, and consider numerous performance enhance-
ments, including: low-level instruction selection, SIMD
vectorization and scheduling, numerical approximation,
data structure transformations, OpenMP-based paral-
lelization, and tuning of algorithmic parameters. Our
implementations are analyzed on a diverse collection of
dual-socket multicore systems, including those based on
the Intel Nehalem, AMD Barcelona, Sun Victoria Falls,
and NVIDIA GPU processors. (Section 5).
Key findings and contributions: Our main contri-
bution is the first in-depth study of multicore optimiza-
tions and tuning for KIFMM, which includes cross-
platform evaluations of performance, scalability, and
power. We show that optimization and OpenMP par-
allelization can improve double-precision performance
by 25 on Intel’s Nehalem, 9.4 on AMD’s Barcelona,
and 37.6 on Sun’s Victoria Falls. Moreover, we com-
pare our single-precision results against the literature’s
best GPU-accelerated implementation [9]. Surprisingly,
we find that the most advanced multicore architecture
(Nehalem) reaches parity in performance and power
efficiency with NVIDIA’s most advanced GPU archi-
tecture. Our results lay solid foundations for future
ultra-scalable KIFMM implementations on current and
emerging high-end systems.
2 Fast Multipole Method
This section provides an overview of the fast multi-
pole method (FMM), summarizing the key components
that are relevant to this study. For more in-depth algo-
rithmic details, see Greengard, et al. [5, 16].
1































978-1-4244-6443-2/10/$26.00 ©2010 IEEE

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

17 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
35% Ph.D. Student
 
24% Professor
 
12% Post Doc
by Country
 
53% United States
 
6% New Zealand
 
6% India

Groups

HPC Garage