Sign up & Download
Sign in

Applying the concurrent collections programming model to asynchronous parallel dense linear algebra

by Aparna Chandramowlishwaran, Kathleen Knobe, Richard Vuduc
ACM SIGPLAN Symp Principles and Practice of Parallel Programming PPoPP (2010)

Abstract

This poster is a case study on the application of a novel programming model, called Concurrent Collections (CnC), to the implementation of an asynchronous-parallel algorithm for computing the Cholesky factorization of dense matrices. In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. We demonstrate the performance potential of CnC in this poster, by showing that our Cholesky implementation nearly matches or exceeds competing vendor-tuned codes and alternative programming models. We conclude that the CnC model is well-suited for expressing asynchronous-parallel algorithms on emerging multicore systems.

Cite this document (BETA)

Available from Richard Vuduc's profile on Mendeley.
Page 1
hidden

Applying the concurrent collections programming model to asynchronous parallel dense linear algebra

Applying the Concurrent Collections Programming
Model to Asynchronous Parallel Dense Linear Algebra
Aparna Chandramowlishwaran
Georgia Institute of Technology
aparna@gatech.edu
Kathleen Knobe
Intel Corporation
kath.knobe@intel.com
Richard Vuduc
Georgia Institute of Technology
richie@cc.gatech.edu
Abstract
This poster is a case study on the application of a novel program-
ming model, called Concurrent Collections (CnC), to the imple-
mentation of an asynchronous-parallel algorithm for computing the
Cholesky factorization of dense matrices. In CnC, the program-
mer expresses her computation in terms of application-specific op-
erations, partially-ordered by semantic scheduling constraints. We
demonstrate the performance potential of CnC in this poster, by
showing that our Cholesky implementation nearly matches or ex-
ceeds competing vendor-tuned codes and alternative programming
models. We conclude that the CnC model is well-suited for ex-
pressing asynchronous-parallel algorithms on emerging multicore
systems.
Categories and Subject Descriptors D.1.3 [Programming Tech-
niques]: Concurrent Programming
General Terms Algorithms, Performance
1. Introduction
Motivated by multicore and future manycore architectures that re-
quire fine-grained thread- or task-level parallelism, researchers in
the dense linear algebra community have been developing novel
asynchronous-parallel algorithms [2]. In contrast to prior classical
bulk-synchronous variants of such algorithms, these asynchronous
variants (a) are naturally suited to cores with relatively smaller
cache or local-store memories, and (b) reduce the degree of syn-
chronization, whose cost may reasonably be expected to increase
with increasing core counts. The current trend in research in this
area is the pursuit of higher-level programming models to better
support asynchronous algorithms, primarily using pragma and spe-
cialized library-based approaches [5–7].
In this poster, we study the feasibility of a novel general-
purpose parallel programming model, called Concurrent Collec-
tions (CnC) [1], for dense linear algebra. We show that asyn-
chronous variants of Cholesky factorization can be expressed in
CnC, and when coupled with a well-tuned BLAS, closely match
or sometimes exceed the performance and scalability of the Intel
Math Kernel Library (MKL) implementation. Our achieved per-
formance also compares favorably to PLASMA, a state-of-the-art
domain-specific library-based approach.
Copyright is held by the author/owner(s).
PPoPP’10, January 9–14, 2010, Bangalore, India.
ACM 978-1-60558-708-0/10/01.
2. Overview of CnC
CnC [1, 3, 4] is a novel programming model that separates the
specification of computation from the expression of its paral-
lelism. The goal is to simplify the task of a domain expert, who
is responsible for designing the computation, from the tasks of
a parallelization and tuning expert (or a compiler), who identi-
fies the parallelism and performs scheduling/distribution and man-
ages communication/synchronization. This section provides an
overview of CnC concepts using a simple linear algebra example.
Computation specification. Consider the dense outer product
computation, Z x  yT , where x and y are two column vectors
and Z is a matrix of appropriate dimension. Algorithmically, we
compute Zi;j xi  yj for all pairs, (xi; yj).
The domain expert specifies the computation in a form that can
be represented by a graph, as shown in Figure 1. This graph has 3
kinds of nodes: computational steps, data items, and control tags.
Directed edges show producer-consumer relationships among these
nodes. The graph is specified by the programmer using a formal
textual representation.
Figure 1. CnC graphical represen-
tation of the outer product opera-
tion, Z x  yT .
A step is the basic
unit of execution, which
for the outer product is
pairwise element multi-
plication. We consider
this very fine granularity
for example only, as in
practice one might wish
to choose a larger grain,
such as a block. The oval
in the figure is a step col-
lection, which statically represents the set (collection) of dynamic
instances of these multiplications.
Data is represented using item collections. In this example,
x, y, and Z are the three item collections, denoted by boxes.
These items serve as the basic unit of storage, communication, and
synchronization.
Each instance of a step or item has an unique application-
specific identifier, or tag. For the outer product, it is natural to use
element indices as tags. We denote the tag for x by hii, y by hji,
and Z by hi; ji. Tag collections (also called tag spaces) specify ex-
actly which instances of a step will execute. A step collection is
associated with exactly one tag collection; a step instance executes
only if a matching tag instance exists. For the outer product, we
show the tag space by a triangle and denote it by hi; ji. For in-
stance, only if the tag collection has h3; 10i does the corresponding
pairwise multiply for Z3;10 x3  y10 execute. We say that a tag
collection prescribes a step collection, and show that visually with
a dashed undirected edge. Importantly, tags indicate whether a step
will execute, but nothing about when it executes. This distinction
Page 2
hidden
shows in part how CnC separates scheduling decisions from the
computation’s specification.
Though not shown here, a step may produce tags. In this way,
a step may control what other steps execute. This facility is part
of what makes CnC a more flexible and general model than, say, a
pure streaming language.
Lastly, the figure contains “squiggly” lines, which means the
item or tag comes from or goes to the environment, which is the
external code that invokes this computation. For the outer product,
the environment provides the data items and control tags.
Semantics and execution. If a step executes and produces an
item or a tag, that item or tag becomes available. If a tag prescribing
a step becomes available, then the step is prescribed. If all items
for a step are available, the step becomes inputs-available. If a
step is both inputs-available and prescribed, then it is enabled and
may execute. The program terminates when no step is currently
executing and no unexecuted step is enabled. A valid program
termination occurs when a program terminates and all prescribed
steps have been executed.
The CnC model permits many run-time system designs, in-
cluding those for distributed memory systems using MPI, as well
as shared memory versions [1]. We specifically evaluate the Intel
Linux CnC (v0.3) [3], which is based on Intel’s Threading Build-
ing Blocks.
3. Tiled Cholesky in CnC
A Cholesky factorization of a symmetric positive definite matrix
B is the product L  LT , where L is a (lower) triangular matrix.
We specifically consider the tiled Cholesky algorithm of Buttari,
et al. [2]. This algorithm is based on decomposing B into blocks (or
tiles), and computes L in an asynchronous-parallel manner suitable
for multicore hierarchical memory platforms.
The tiled algorithm consists of three steps: the conventional se-
quential Cholesky, triangular solve, and the symmetric rank-k up-
date. These steps can be overlapped with one another after ini-
tial factorization of a single block, resulting in an asynchronous-
parallel approach. There is also abundant data parallelism within
each of these steps.
The asynchronous behavior maps naturally to the CnC con-
structs seen in Section 2. CnC translates the textual representation
indicating the graph into C++ code containing stubs for the pro-
grammer to fill in. That is, at this point, all the programmer does
is input the appropriate tags and data items along with the serial
logic of the step. For the serial step implementation, we call tuned
sequential vendor BLAS routines. This allows us to couple CnC
with an optimized serial library to obtain an efficient parallel asyn-
chronous implementation with minimal programming effort.
We evaluate this CnC implementation on a dual-socket, quad-
core 2.8 GHz Intel Xeon X5560 (Nehalem) system. Figure 3 gives
the best parallel performance achieved (in GFlop/s) for various ma-
trix sizes. We use a theoretical flop count of n3=3 when reporting
performance.
Our performance baseline is sequential MKL, which runs at
about 10 GFlop/s. We also compare with several other implemen-
tations of the Cholesky algorithm. The OpenMP-parallel approach
is 2.8 faster than the baseline; recursive Cilk++ (v1.0.3) with
sequential MKL, and ScaLAPACK (v1.8.0) and MPI (MPICH2
v1.0.8 with the Nemesis device), are an additional 11% faster.
Compared to these implementations, our CnC code delivers very
high performance, with a speedup of nearly 7:3 on 8 cores. We
also achieve more than 85% of the theoretical peak performance for
the largest matrix size (n = 10; 000), and match the performance
of the tuned PLASMA (v2.0.0) and Intel MKL (v10.2) implemen-
tations.
Figure 2. Performance summary of double-precision Cholesky
factorization on a dual-socket, quad-core Intel Nehalem system.
4. Conclusions and Future Work
The CnC model complements existing approaches for express-
ing and scheduling asynchronous-parallel computations, providing
clean abstractions that enable a variety of control and data flow con-
structs to be expressed in a way that is both natural and amenable
to effective parallelization. We can match or exceed a highly-tuned
vendor library for Cholesky factorization, suggesting the promise
of CnC as a platform for implementing more asynchronous dense
linear algebra kernels, and also for entirely different computation
domains. We are currently working on a generalized symmetric
dense eigensolver in CnC. Our preliminary results indicate a 2.6
speedup on the Nehalem system when compared to Intel’s MKL
implementation, and the performance is within 10% of the theoret-
ically optimal schedule.
References
[1] Z. Budimlic´, A. Chandramowlishwaran, K. Knobe, G. Lowney,
V. Sarkar, and L. Treggiari. Multi-core implementations of the
Concurrent Collections programming model. In Proc. Workshop on
Compilers for Parallel Computing (CPC), January 2009.
[2] A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A class of parallel
tiled linear algebra algorithms for multicore architectures. Technical
Report UT-CS-07-600 (LAPACK Working Note 191), University of
Tennessee Knoxville, September 2007.
[3] Intel R
Concurrent Collections for C++. http:
//software.intel.com/en-us/articles/
intel-concurrent-collections-for-cc/, 2009.
[4] K. Knobe. Ease of use with Concurrent Collections (CnC). In
Proc. USENIX Workshop on Hot Topics in Parallelism (HotPar), March
2009.
[5] H. Ltaeif, J. Kurzak, and J. Dongarra. Scheduling two-sided
transformations using algorithms-by-tiles on multicore architectures.
Technical Report UT-CS-09-637 (LAPACK Working Note 214),
University of Tennessee Knoxville, February 2009.
[6] J. M. Perez, R. M. Badia, and J. Labarta. A dependency-aware
task-based programming environment for multicore architectures. In
Proc. IEEE Int’l. Conf. Cluster Computing (CLUSTER), pages 142–151,
September 2008.
[7] E. Chan, E. S. Quintana-Ortı´, G. Quintana-Ortı´, and R. van de Geijn.
SuperMatrix out-of-order scheduling of matrix operations for SMP and
multi-core architectures. In Proc. ACM Symp. Parallelism in Algorithms
and Architectures (SPAA), pages 116–125, June 2007.

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

6 Readers on Mendeley
by Discipline
 
by Academic Status
 
50% Ph.D. Student
 
17% Lecturer
 
17% Other Professional
by Country
 
33% United States
 
33% India
 
17% Ireland

Groups

HPC Garage