Sign up & Download
Sign in

Efficient Work Stealing for Fine Grained Parallelism

by K F Faxén
Parallel Processing ICPP 2010 39th International Conference on ()

Abstract

This paper deals with improving the performance of fine grain task parallelism. It is often either cumbersome or impossible to increase the grain size of such programs. Increasing core counts exacerbates the problem; a program that appears coarse-grained on eight cores may well look a lot more fine-grained on sixty four. In this paper we present the direct task stack, a novel work stealing algorithm with unusually low overheads, both for creating tasks and for stealing. We compare the performance of our scheduler to Cilk++, the icc implementation of OpenMP 3.0 and the Intel TBB library on an eight core, dual socket Opteron machine. We also analyze the reasons why our techniques achieve consistent speed ups over the other systems ranging from 2-3x on many fine grained workloads to over 50 in extreme cases and show quantitatively how each of the techniques we use contribute to the improved performance.

Author-supplied keywords

Cite this document (BETA)

Available from ieeexplore.ieee.org
Page 1
hidden

Efficient Work Stealing for Fine ...

Efficient Work Stealing for Fine Grained Parallelism Karl-Filip Fax��en Swedish Institute of Computer Science Stockholm, Sweden kff@sics.se Abstract���This paper deals with improving the performance of fine grain task parallelism. It is often either cumbersome or impossible to increase the grain size of such programs. Increasing core counts exacerbates the problem a program that appears coarse-grained on eight cores may well look a lot more fine- grained on sixty four. In this paper we present the direct task stack, a novel work stealing algorithm with unusually low overheads, both for creating tasks and for stealing. We compare the performance of our scheduler to Cilk++, the icc implementation of OpenMP 3.0 and the Intel TBB library on an eight core, dual socket Opteron machine. We also analyze the reasons why our techniques achieve consistent speed ups over the other systems ranging from 2-3x on many fine grained workloads to over 50 in extreme cases and show quantitatively how each of the techniques we use contribute to the improved performance. I. INTRODUCTION Nested task parallelism is gaining popularity as a program- ming model for multi(core)processors. Parallelism is expressed by spawning tasks that the implementation is allowed, but not mandated, to execute in parallel. Existing implementations [13], [21], [1] exhibit significant overheads for fine grain computations, forcing application programmers to implement manual cut-offs (avoid spawning a task if the expected run time is small) or manage parallelism in other ways. For instance, in a recent paper on the Intel TBB [8], the authors recommend that tasks with less than 100k cycles worth of work should be handled using a special continuation passing style API (which requires a great deal of code restructuring) to reduce overhead. Not only does this put an extra burden on the programmer (predicting execution times can be difficult or, for some programs, impossible), but it also precludes exploiting fine grain parallelism. This paper presents the direct task stack (Section III-A), a novel work stealing algorithm which virtually eliminates the overhead of task creation for tasks that are never stolen and achieves an overhead for stealing that is at most half of that of state of the art systems such as Cilk++, TBB and the icc implementation of OpenMP 3.0. We compare our work stealing scheduler Wool to these three systems and analyze the performance difference in terms of the task granularity (the average useful work per task) and load balancing granu- larity (the average useful work per steal) of the computations (Section IV-D). The effects of these two kinds of fine grainedness are illustrated in Figure 1. On the left, fib (with no cutoff) is an example of very small task granularity it spawns a task 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 fib(42) 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 stress 4096 (3, 128K) Wool Cilk TBB OpenMP Figure 1. Absolute speedup of fib(42) with no cutoff and relative speedup of stress (4096, 3, 128K) on Wool, Cilk++, TBB and OpenMP (icc) TASK_1( int, fib, int, n ) { if( n2 ) return n else { int a,b SPAWN( fib, n-2 ) a = CALL( fib, n-1 ) b = JOIN( fib ) return a+b } } Figure 2. A simple Fibbonacci function in Wool for every 13 cycles worth of work. The graph shows how the task management overheads of current implementations outweigh the gains of parallel execution. Wool, on the other hand, achieves speedup from two processors. On the right, stress is a program that repeatedly spawns a balanced tree of tasks with a simple loop at the leaves execution is serialized between the trees. Because the trees, including the leaves, are relatively small (less than 70k cycles), this program stresses the load balancing performance of the implementation. For some systems, this overhead is large enough that adding processors makes performance worse. Figure 2 shows the Wool version of the Fibbonacci function (whose performance is shown in Figure 1). SPAWN creates a task and adds it to the local work pool of the spawning processor, which then does an ordinary recursive CALL, after which it JOINs with the spawned computation. The join does not terminate until the joined task has terminated this is the main form of synchronization in task parallel models such as 2010 39th International Conference on Parallel Processing 0190-3918/10 $26.00 �� 2010 IEEE DOI 10.1109/ICPP.2010.39 313

Readership Statistics

10 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
40% Ph.D. Student
 
20% Researcher (at an Academic Institution)
 
10% Student (Master)
by Country
 
50% United States
 
20% Sweden
 
10% China

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in