A Hierarchical Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-Accelerator NUMA Nodes

Hamidreza Khaleghzadeh; Ravi Reddy Manumachu; Alexey Lastovetsky

Journal ArticleOPEN ACCESS

A Hierarchical Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-Accelerator NUMA Nodes

IEEE Access (2020) 8 7861-7876

DOI: 10.1109/ACCESS.2019.2959905

11Citations

17Readers

Abstract

Modern HPC platforms are highly heterogeneous with tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to address the twin critical concerns of performance and energy efficiency. Due to this inherent characteristic, processing elements contend for shared on-chip resources such as Last Level Cache (LLC), interconnect, etc. and shared nodal resources such as DRAM, PCI-E links, etc., resulting in complexities such as resource contention, non-uniform memory access (NUMA), and accelerator-specific limitations such as limited main memory thereby necessitating support for efficient out-of-card execution. Due to these complexities, the performance profiles of data-parallel applications executing on these platforms are not smooth and deviate significantly from the shapes that allowed state-of-the-art load-balancing algorithms to find optimal solutions. In this paper, we propose a hierarchical two-level data partitioning algorithm minimizing the parallel execution time of data-parallel applications on clusters of h identical nodes where each node has c heterogeneous processors. This algorithm takes as input c disycrete speed functions of cardinality m corresponding to the c heterogeneous processors. It does not make any assumptions about the shapes of these functions. Unlike load balancing algorithms, optimal solutions found by the algorithm may not load-balance an application in terms of execution time. The proposed algorithm has low time complexity of O(m2 × h + m3 × c3) unlike the state-of-the-art algorithm solving the same problem with the complexity of O(m3 × c3 × h3). We also propose an extension of the algorithm for clusters of h non-identical nodes where each node has c heterogeneous processors. We experimentally demonstrate the optimality of our algorithm using two well-known and highly optimized multi-threaded data-parallel applications, matrix-matrix multiplication and 2D fast Fourier transform, on a heterogeneous multi-accelerator NUMA node containing an Intel multicore Haswell CPU, an Nvidia K40c GPU, and an Intel Xeon Phi co-processor and a simulated homogeneous cluster of such nodes.

Author supplied keywords

Cite

CITATION STYLE

APA

Khaleghzadeh, H., Manumachu, R. R., & Lastovetsky, A. (2020). A Hierarchical Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-Accelerator NUMA Nodes. IEEE Access, 8, 7861–7876. https://doi.org/10.1109/ACCESS.2019.2959905

A Hierarchical Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-Accelerator NUMA Nodes

Abstract

Author supplied keywords

Cite

Register to see more suggestions