Towards a holistic approach to au...
Towards a Holistic Approach to Auto-Parallelization Integrating Profile-Driven Parallelism Detection and Machine-Learning Based Mapping Georgios Tournavitis Zheng Wang Bj�� orn Franke Michael F.P. O���Boyle Institute for Computing Systems Architecture (ICSA) School of Informatics University of Edinburgh Scotland, United Kingdom gtournav@inf.ed.ac.uk,jason.wangz@ed.ac.uk,{bfranke,mob}@inf.ed.ac.uk Abstract Compiler-based auto-parallelization is a much studied area, yet has still not found wide-spread application. This is largely due to the poor exploitation of application parallelism, subsequently result- ing in performance levels far below those which a skilled expert programmer could achieve. We have identified two weaknesses in traditional parallelizing compilers and propose a novel, integrated approach, resulting in significant performance improvements of the generated parallel code. Using profile-driven parallelism detection we overcome the limitations of static analysis, enabling us to iden- tify more application parallelism and only rely on the user for fi- nal approval. In addition, we replace the traditional target-specific and inflexible mapping heuristics with a machine-learning based prediction mechanism, resulting in better mapping decisions while providing more scope for adaptation to different target architec- tures. We have evaluated our parallelization strategy against the NAS and SPEC OMP benchmarks and two different multi-core platforms (dual quad-core Intel Xeon SMP and dual-socket QS20 Cell blade). We demonstrate that our approach not only yields sig- nificant improvements when compared with state-of-the-art par- allelizing compilers, but comes close to and sometimes exceeds the performance of manually parallelized codes. On average, our methodology achieves 96% of the performance of the hand-tuned OpenMP NAS and SPEC parallel benchmarks on the Intel Xeon platform and gains a significant speedup for the IBM Cell platform, demonstrating the potential of profile-guided and machine-learning based parallelization for complex multi-core platforms. Categories and Subject Descriptors D.3.4 [Programming Lan- guages]: Processors���Compilers D.1.3 [Programming Techniques]: Concurrent Programming���Parallel Programming General Terms Experimentation, Languages, Measurement, Per- formance Keywords Auto-Parallelization, Profile-Driven Parallelism De- tection, Machine-Learning Based Parallelism Mapping, OpenMP Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PLDI���09, June 15���20, 2009, Dublin, Ireland. Copyright c 2009 ACM 978-1-60558-392-1/09/06. . . $5.00 1. Introduction Multi-core computing systems are widely seen as the most viable means of delivering performance with increasing transistor densi- ties (1). However, this potential cannot be realized unless the ap- plication has been well parallelized. Unfortunately, efficient par- allelization of a sequential program is a challenging and error- prone task. It is generally agreed that manual code paralleliza- tion by expert programmers results in the most streamlined parallel implementation, but at the same time this is the most costly and time-consuming approach. Parallelizing compiler technology, on the other hand, has the potential to greatly reduce cost and time-to- market while ensuring formal correctness of the resulting parallel code. Automatic parallelism extraction is certainly not a new research area (2). Progress was achieved in 1980s to 1990s on restricted DOALL and DOACROSS loops (3 4 5). In fact, this research has resulted in a whole range of parallelizing research compil- ers, e.g. Polaris (6), SUIF-1 (7) and, more recently, Open64 (8). Complementary to the on-going work in auto-parallelization many high-level parallel programming languages ��� such as Cilk-5 (9), OpenMP, StreamIt (10), UPC (11) and X10 (12) ��� and program- ming models ��� such as Galois (14), STAPL (15) and HTA (16) ��� have been proposed. Interactive parallelization tools (17 18 19 20) provide a way to actively involve the programmer in the detec- tion and mapping of application parallelism, but still demand great effort from the user. While these approaches make parallelism ex- pression easier than in the past, the effort involved in discovering and mapping parallelism is still far greater than that of writing an equivalent sequential program. This paper argues that the lack of success in auto-parallelization has occurred for two reasons. First, traditional static parallelism de- tection techniques are not effective in finding parallelism due to lack of information in the static source code. Second, no existing integrated approach has successfully brought together automatic parallelism discovery and portable mapping. Given that the num- ber and type of processors of a parallel system is likely to change from one generation to the next, finding the right mapping for an application may have to be repeated many times throughout an ap- plication���s lifetime, hence, making automatic approaches attractive. Approach. Our approach integrates profile-driven parallelism de- tection and machine-learning based mapping in a single frame- work. We use profiling data to extract actual control and data de- pendences and enhance the corresponding static analyses with dy- namic information. Subsequently, we apply a previously trained 177
for ( i = 0 i nodes i ++) { Anext = Aindex [ i ] Alast = Aindex [ i + 1 ] sum0 = A[ Anext ] [ 0 ] [ 0 ] ��� v [ i ] [ 0 ] + A[ Anext ] [ 0 ] [ 1 ] ��� v [ i ] [ 1 ] + A[ Anext ] [ 0 ] [ 2 ] ��� v [ i ] [ 2 ] sum1 = . . . Anext ++ while ( Anext Alast ) { col = Acol [ Anext ] sum0 += A[ Anext ] [ 0 ] [ 0 ] ��� v [ col ] [ 0 ] + A[ Anext ] [ 0 ] [ 1 ] ��� v [ col ] [ 1 ] + A[ Anext ] [ 0 ] [ 2 ] ��� v [ col ] [ 2 ] sum1 += . . . w[ col ] [ 0 ] += A[ Anext ] [ 0 ] [ 0 ] ��� v [ i ] [ 0 ] + A[ Anext ] [ 1 ] [ 0 ] ��� v [ i ] [ 1 ] + A[ Anext ] [ 2 ] [ 0 ] ��� v [ i ] [ 2 ] w[ col ] [ 1 ] += . . . Anext ++ } w[ i ] [ 0 ] += sum0 w[ i ] [ 1 ] += . . . } Figure 1. Static analysis is challenged by sparse array reduction operations and the inner while loop in the SPEC equake benchmark. machine-learning based prediction mechanism to each parallel loop candidate and decide if and how the parallel mapping should be per- formed. Finally, we generate parallel code using standard OpenMP annotations. Our approach is semi-automated, i.e. we only expect the user to finally approve those loops where parallelization is likely to be beneficial, but correctness cannot be proven conclu- sively. Results. We have evaluated our parallelization strategy against the NAS and SPEC OMP benchmarks and two different multi- core platforms (dual quad-core Intel Xeon SMP and dual-socket QS20 Cell blade). We demonstrate that our approach not only yields significant improvements when compared with state-of-the- art parallelizing compilers, but comes close to and sometimes ex- ceeds the performance of manually parallelized codes. We show that profiling-driven analyses can detect more parallel loops than static techniques. A surprising result is that all loops classified as parallel by our technique are correctly identified as such, despite the fact that only a single, small data input is considered for par- allelism detection. Furthermore, we show that parallelism detec- tion in isolation is not sufficient to achieve high performance, and neither are conventional mapping heuristics. Our machine-learning based mapping approach provides the adaptivity across platforms that is required for a genuinely portable parallelization strategy. On average, our methodology achieves 96% of the performance of the hand-tuned OpenMP NAS and SPEC parallel benchmarks on the Intel Xeon platform, and a significant speedup for the Cell platform, demonstrating the potential of profile-guided machine- learning based auto-parallelization for complex multi-core plat- forms. Overview. The remainder of this paper is structured as follows. We motivate our work based on simple examples in section 2. This is followed by a presentation of our parallelization framework in section 3. Our experimental methodology and results are discussed in sections 4 and 5, respectively. We establish a wider context of #pragma omp for r e d u c t i o n (+: sum ) p r i v a t e ( d ) for ( j =1 j = l a s t c o l ���f i r s t c o l ���1 j ++) { d = x [ j ] ��� r [ j ] sum = sum + d ��� d } Figure 2. Despite its simplicity mapping of this parallel loop taken from the NAS cg benchmark is non-trivial and the best-performing scheme varies across platforms. related work in section 6 before we summarize and conclude in section 7. 2. Motivation Parallelism Detection. Figure 1 shows a short excerpt of the smvp function from the SPEC equake seismic wave propagation bench- mark. This function implements a general-purpose sparse matrix- vector product and takes up more than 60% of the total execution time of the equake application. While conservative, static analysis fails to parallelize both loops due to sparse matrix operations with indirect array indices and the inner while loop, profiling-based de- pendence analysis provides us with the additional information that no actual data dependence inhibits parallelization for a given sam- ple input. While we still cannot prove absence of data dependences for every possible input we can classify both loops as candidates for parallelization (reduction) and, if profitably parallelizable, present it to the user for approval. In this example, the user would provide the additional knowledge (and guarantee) that every col index in the inner loop is unique and, hence, accesses to w[col][0] and w[col][1], respectively, do not result in cross-iteration dependen- cies. This example demonstrates that static analysis is overly con- servative. Profiling based analysis, on the other hand, can provide accurate dependence information for a specific input. When com- bined we can select candidates for parallelization based on empir- ical evidence and, hence, can eventually extract more application parallelism than purely static approaches. Mapping. In figure 2 a parallel reduction loop originating from the parallel NAS conjugate-gradient cg benchmark is shown. De- spite the simplicity of the code, mapping decisions are non-trivial. For example, parallel execution of this loop is not profitable for the Cell BE platform due to high communication costs between pro- cessing elements. In fact, parallel execution results in a massive slowdown over the sequential version for the Cell for any number of threads. On the Intel Xeon platform, however, parallelization can be profitable, but this depends strongly on the specific OpenMP scheduling policy. The best scheme (STATIC) results in a speedup of 2.3 over the sequential code and performs 115 times better than the worst scheme (DYNAMIC) that slows the program down to 2% of its original, sequential performance. This example illustrates that selecting the correct mapping scheme has a significant impact on performance. However, the mapping scheme varies not only from program to program, but also from architecture to architecture. Therefore, we need an auto- matic and portable solution for parallelism mapping. 3. Parallelization Framework In this section we provide an overview and technical details of our parallelization framework. As shown in figure 3, a sequential C program is initially ex- tended with plain OpenMP annotations for parallel loops and re- ductions as a result of our profiling-based dependence analysis. In 178