Idempotent work stealing
- ISSN: 03621340
- ISBN: 9781605583976
- DOI: 10.1145/1594835.1504186
Abstract
Load balancing is a technique which allows efficient parallelization of irregular workloads, and a key component of many applications and parallelizing runtimes. Work-stealing is a popular technique for implementing load balancing, where each parallel thread maintains its own work set of items and occasionally steals items from the sets of other threads. The conventional semantics of work stealing guarantee that each inserted task is eventually extracted exactly once. However, correctness of a wide class of applications allows for relaxed semantics, because either: i) the application already explicitly checks that no work is repeated or ii) the application can tolerate repeated work. In this paper, we introduce idempotent work tealing, and present several new algorithms that exploit the relaxed semantics to deliver better performance. The semantics of the new algorithms guarantee that each inserted task is eventually extracted at least once-instead of exactly once. On mainstream processors, algorithms for conventional work stealing require special atomic instructions or store-load memory ordering fence instructions in the owner's critical path operations. In general, these instructions are substantially slower than regular memory access instructions. By exploiting the relaxed semantics, our algorithms avoid these instructions in the owner's operations. We evaluated our algorithms using common graph problems and micro-benchmarks and compared them to well-known conventional work stealing algorithms, the THE Cilk and Chase-Lev algorithms. We found that our best algorithm (with LIFO extraction) outperforms existing algorithms in nearly all cases, and often by significant margins.
Idempotent work stealing
Maged M. Michael
IBM Thomas J. Watson Research Center
magedm@us.ibm.com
Martin T. Vechev
IBM Thomas J. Watson Research Center
mtvechev@us.ibm.com
Vijay A. Saraswat
IBM Thomas J. Watson Research Center
vsaraswa@us.ibm.com
Abstract
Load balancing is a technique which allows efficient parallelization
of irregular workloads, and a key component of many applications
and parallelizing runtimes. Work-stealing is a popular technique for
implementing load balancing, where each parallel thread maintains
its own work set of items and occasionally steals items from the
sets of other threads.
The conventional semantics of work stealing guarantee that
each inserted task is eventually extracted exactly once. However,
correctness of a wide class of applications allows for relaxed se-
mantics, because either: i) the application already explicitly checks
that no work is repeated or ii) the application can tolerate repeated
work.
In this paper, we introduce idempotent work stealing,and
present several new algorithms that exploit the relaxed semantics
to deliver better performance. The semantics of the new algorithms
guarantee that each inserted task is eventually extracted at least
once–instead of exactly once.
On mainstream processors, algorithms for conventional work
stealing require special atomic instructions or store-load memory
ordering fence instructions in the owner’s critical path operations.
In general, these instructions are substantially slower than regular
memory access instructions. By exploiting the relaxed semantics,
our algorithms avoid these instructions in the owner’s operations.
We evaluated our algorithms using common graph problems and
micro-benchmarks and compared them to well-known conventional
work stealing algorithms, the THE Cilk and Chase-Lev algorithms.
We found that our best algorithm (with LIFO extraction) outper-
forms existing algorithms in nearly all cases, and often by signifi-
cant margins.
Categories and Subject Descriptors: D.1.3 [Programming Tech-
niques]: Concurrent Programming; D.3.3 [Programming Lan-
guages]: Language Constructs and Features—concurrent program-
ming structures; D.4.1 [Operating Systems]: Process Management—
concurrency, scheduling, synchronization, threads.
General Terms: Algorithms, Management, Performance.
1. Introduction
Statically parallelizing applications with irregular workloads is a
very challenging task. The key problem in trying to come up with
a scalable static algorithmic solution is that the amount of available
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
PPoPP’09, February 14–18, 2009, Raleigh, North Carolina, USA.
Copyright c© 2009 ACM 978-1-60558-397-6/09/02. . . $5.00
parallelism can change dramatically from one invocation of the al-
gorithm to another. One answer to this challenge is the well-known
dynamic technique of load balancing. Load balancing works by dy-
namically distributing the work to each process. It is a key tech-
nique used in many runtimes for parallel languages such as Cilk [4]
and X10 [5]. It is also a core component of parallel garbage col-
lectors [7], now a part of most modern virtual machines. Increased
proliferation of load balancing techniques and their central place
in most parallelizing systems dictate the need for high-performing
load balancing algorithms.
Work-stealing is a technique that implements load balancing.
Effectively, each thread maintains its own set of tasks. The owner
thread stores and takes items from that set. Typically, when there
are no more tasks in the set (the owner thread has nothing more
to do), to keep busy, the thread can steal work items from other
threads. Hence, in this scheme, only the owner thread can add tasks
to its set, but all threads (including the owner) can take items from
the owner’s set.
This working set of items (called a work stealing queue from
now on) supports three main operations: put and take which are
used only by the queue’s owner to insert and extract tasks, and
steal which is used by other threads to steal work.
Current algorithms for work stealing queues comply with the
following semantics: each inserted task is eventually extracted–by
the owner thread or other threads–exactly once. However, these se-
mantics are too restrictive for a wide range of applications deal-
ing with irregular computation patterns. Sample domains include:
parallel garbage collection, fixed point computations in program
analysis, constraint solvers (e.g. SAT solvers), state space search
exploration in model checking as well as integer and mixed pro-
gramming solvers.
The key observation is that the correctness invariants of these
applications allow for a relaxation of the traditional work stealing
semantics. The fundamental reason is that in these problems: i) the
application already ensures that no work is repeated, for example
by checking whether a task is completed, or ii) the application
tolerates repeatable work.
Informally, the relaxed semantics state that each inserted task
should be eventually extracted at least once–instead of exactly once
as it is with the conventional semantics. We exploit this invariant
relaxation and introduce idempotent work stealing.Wepresent
several new algorithms that exploit the relaxed semantics to deliver
better performance. Note that even with these relaxed semantics,
subtle issues need to be handled in order to ensure correct and
efficient operation. For example, the algorithms must guarantee that
no tasks are lost and all extracted tasks contain valid and consistent
information while at the same time avoiding the use of expensive
synchronization instructions in the owner’s operations: put and
take.
45
work stealing queues require store-load memory ordering fence in-
structions in the critical path of the owner’s operations [1, 6, 8, 10,
11]. A store-load fence prevents loads from being executed before
the completion of stores to independent locations where the stores
appear earlier in program order. In general, special atomic instruc-
tions and store-load fence instructions are substantially slower than
regular instructions. Our new algorithms are designed to optimize
the owner’s operations by avoiding the high overheads of these
instruction in the owner’s operations. That is, in our algorithms,
unlike existing algorithms, owner operations avoid using special
atomic instructions and expensive store-load fence instructions.
We have evaluated ours and existing state-of-the-art algorithms
with both microbenchmarks and representative non-trivial graph
applications whose correctness invariants allow the usage of re-
laxed work stealing semantics. In particular, we performed experi-
mental evaluation on several fundamental graph problems such as
transitive closure and spanning tree computation for various graph
types and sizes. The results indicate performance gains of up to 5x
on microbenchmarks and up to 3x on graph applications. On graph
applications, gains of 40% are common.
The contributions of this paper are the following:
•
Introducing the concept of idempotent work stealing, a useful
relaxation of the conventional semantics, applicable to a wide-
class of applications.
•
New high-performance work-stealing algorithms that adhere to
these new relaxed semantics while avoiding expensive synchro-
nization in the critical path of the owner’s operations.
•
Experimental evaluation of our new and existing state-of-the-
art algorithms. The results indicate that our algorithms often
significantly outperform existing state-of-the-art algorithms.
The rest of the paper is organized as follows. In Section 2, we
discuss related work and atomic and fence instructions. Section 3
describes the new algorithms in detail. Section 4 presents our exper-
imental performance results. We conclude the paper with Section 5.
2. Background
2.1 Atomic and Fence Instructions
To build efficient and correct concurrent algorithms, implementa-
tions often rely on the use of special atomic and memory fence
instructions.
Atomic Instructions: Current mainstream processor architec-
tures support either Compare-and-Swap (CAS) or the pair Load-
Linked and Store-Conditional (LL/SC).
CAS was introduced on the IBM System 370 [12]. It is sup-
ported on Intel and Sun SPARC processor architectures. In its sim-
plest form, it takes three arguments: a memory location, an ex-
pected value, and a new value. If the memory location holds the
expected value, the new value is written to it, atomically. A Boolean
return value indicates whether the write occurred. If it returns true,
it is said to succeed. Otherwise, it is said to fail.
LL and SC are supported on the PowerPC architecture. LL
takes one argument: a memory location, and returns its contents.
SC takes two arguments: a memory location and a new value.
Only if the memory location has not been written since the current
thread last read it using LL, the new value is written to the memory
location, atomically. A Boolean return value indicates whether the
write occurred. Similar to CAS, SC is said to succeed or fail,
if it returns true or false, respectively. For architectural reasons,
implementations of LL/SC, do not allow the nesting or interleaving
of LL/SC pairs, and infrequently often allow SC to fail spuriously,
even if the target location was never written since the last LL. These
spurious failures happen, for example, if the thread was preempted
or a different location in the same cache line was written.
For generality, we present the algorithms in this paper using
CAS. As discussed in Section 3, if LL/SC is supported rather than
CAS, simpler implementations are possible.
Fence Instructions: Mainstream processor architectures allow
some independent memory accesses to be executed out of program
order, for the sake of hiding memory access latency and hence im-
proving performance in the general case where reordering memory
accesses has no effect on correctness. These architectures provide
fence instructions that allow programmers to enforce order among
memory accesses–that otherwise could be reordered–if such an or-
dering is required for correctness. This situation typically occurs in
the implementations of concurrent algorithms.
While processor architectures vary in their relaxation of mem-
ory access ordering, all mainstream processors require fence in-
structions for preventing loads being executed before the comple-
tion of stores to independent locations where the stores appear ear-
lier in program order (i.e., enforce store-load ordering). For archi-
tectural and historical reasons, fences that enforce store-load order
are typically quite expensive and take tens of processor cycles.
2.2 Related Work
There have been several published algorithms for work-stealing, all
adhering to the strong semantics. In a paper by Frigo et. al. [8], the
authors present the THE work stealing algorithm implemented in
the Cilk language runtime [4]. That algorithm is based on Dijk-
stra’s mutual exclusion protocol and uses locks in the steal oper-
ation and in the corner case when the queue is empty it uses locks
in take. Another algorithm presented by Arora et. al. [1] presents
a non-blocking double-ended work queue but in the worst-case re-
quires unbounded memory even if the number of waiting tasks at
any one time is bounded. The Chase-Lev algorithm [6] rectifies this
situation while preserving the performance of the Arora et. al. algo-
rithm.
The correctness of all of these algorithms depends on enforcing
the order of a write before a read in the critical path of the owner’s
take operation. For example, in the Chase-Lev algorithm [6, Fig-
ure 3], the write in line 23 must be ordered before the read in line
24; in the Cilk THE algorithm [8, Figure 4], the write in line 5 must
be ordered before the read in line 6. Similarly, for the algorithms by
Arora et. al. [1], Hendler and Shavit [11], and Hendler et. al. [10].
3. Algorithms
3.1 Overview
In this section we describe in detail our algorithms for idempotent
work stealing. The main motivation behind these algorithms is to
exploit the relaxed semantics to deliver better performance. In par-
ticular, the relaxed semantics enable us to build algorithms which
speed up the common path consisting of the owner’s operations:
put and take. We present three algorithms, each with a different
choice for how the items are extracted. In all of our algorithms, the
owner inserts new tasks at the tail of the queue. In the first algorithm
(idempotent LIFO) tasks are always extracted from the tail, while in
the second algorithm (idempotent FIFO) tasks are always extracted
from the head. In the third algorithm (idempotent double-ended),
the owner extracts from the tail while thieves extract from the head
of the queue. We use the term queue loosely to mean a structure
with items stored in the order in which they were inserted.
46
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


