This paper introduces a programming interface called Parray (or Parallelizing ARRAYs) that supports system-level succinct programming for heterogeneous parallel systems like GPU clusters. The current practice of software development requires combining several low-level libraries like Pthread, OpenMP, CUDA and MPI. Achieving productivity and portability is hard with different numbers and models of GPUs. Parray extends mainstream C programming with novel array types of the following features: (1) the dimensions of an array type are nested in a tree structure, conceptually reflecting the memory hierarchy; (2) the definition of an array type may contain references to other array types, allowing sophisticated array types to be created for parallelization; (3) threads also form arrays that allow programming in a Single-Program-Multiple-Codeblock (SPMC) style to unify various sophisticated communication patterns. This leads to shorter, more portable and maintainable parallel codes, while the programmer still has control over performance-related features necessary for deep manual optimization. Although the source-to-source code generator only faithfully generates low-level library calls according to the type information, higher-level programming and automatic performance optimization are still possible through building libraries of sub-programs on top of Parray. The case study on cluster FFT illustrates a simple 30-line code that 2 × -outperforms Intel Cluster MKL on the Tianhe-1A system with 7168 Fermi GPUs and 14336 CPUs.
CITATION STYLE
Chen, Y., Cui, X., & Mei, H. (2013). PARRAY: A unifying array representation for heterogeneous parallelism. In Lecture Notes in Earth System Sciences (Vol. 0, pp. 91–113). Springer International Publishing. https://doi.org/10.1007/978-3-642-16405-7_5
Mendeley helps you to discover research relevant for your work.