Affine parallelization of loops with run-time dependent bounds from binaries

Aparna Kotha; Kapil Anand; Timothy Creech; Khaled ElWazeer; Matthew Smithson; Rajeev Barua

Conference ProceedingsOPEN ACCESS

Affine parallelization of loops with run-time dependent bounds from binaries

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2014) 8410 LNCS 554-574

DOI: 10.1007/978-3-642-54833-8_29

0Citations

4Readers

Abstract

An automatic parallelizer is a tool that converts serial code to parallel code. This is an important tool because most hardware today is parallel and manually rewriting the vast repository of serial code is tedious and error prone. We build an automatic parallelizer for binary code, i.e. a tool which converts a serial binary to a parallel binary. It is important because: (i) most serial legacy code has no source code available; (ii) it is compatible with all compilers and languages. In the past binary automatic parallelization techniques have been developed and researchers have presented results on small kernels from polybench. These techniques are a good start; however they are far from parallelizing larger codes from the SPEC2006 and OMP2001 benchmark suites which are representative of real world codes. The main limitation of past techniques is the assumption that loop bounds are statically known to calculate loop dependencies. However, in larger codes loop bounds are only known at run-time; hence loop dependencies calculated statically are overly conservative making binary parallelization ineffective. In this paper we present a novel algorithm that enhancing past techniques significantly by guessing the most likely loop bounds using only the memory expressions present in that loop. It then inserts run-time checks to see if these guesses were indeed correct and if correct executes the parallel version of the loop, else the serial version executes. These techniques are applied to the large affine benchmarks in SPEC2006 and OMP2001 and unlike previous methods the speedups from binary are as good as from source. We also present results on the number of loops parallelized directly from a binary with and without this algorithm. Among the 8 affine benchmarks among these suites, the best existing binary parallelization method achieves an average speedup of 1.74X, whereas our method achieves a speedup of 3.38X. This is close to the speedup from source code of 3.15X. © 2014 Springer-Verlag.

Author supplied keywords

Cite

CITATION STYLE

APA

Kotha, A., Anand, K., Creech, T., ElWazeer, K., Smithson, M., & Barua, R. (2014). Affine parallelization of loops with run-time dependent bounds from binaries. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8410 LNCS, pp. 554–574). Springer Verlag. https://doi.org/10.1007/978-3-642-54833-8_29

Affine parallelization of loops with run-time dependent bounds from binaries

Abstract

Author supplied keywords

Cite

Register to see more suggestions