Hybrid-grained dynamic load balanced GEMM on NUMA architectures

Xing Su; Fei Lei

Journal ArticleOPEN ACCESS

Hybrid-grained dynamic load balanced GEMM on NUMA architectures

Electronics (Switzerland) (2018) 7(12)

DOI: 10.3390/electronics7120359

5Citations

5Readers

Abstract

The Basic Linear Algebra Subprograms (BLAS) is a fundamental numerical software and GEneral Matrix Multiply (GEMM) is the most important computational kernel routine in the BLAS library. On multi-core and many-core processors, the whole workload of GEMM is partitioned and scheduled to multiple threads to exploit the parallel hardware. Generally, the workload is equally partitioned among threads and all threads are expected to accomplish their work in roughly the same time. However, this is not the case on Non-Uniform Memory Access (NUMA) architectures. The NUMA effect may cause threads to run at different speeds, and the overall executing times of GEMM is determined by the slowest thread. In this paper, we propose a hybrid-grained dynamic load-balancing method to reduce the harm of the NUMA effect by allowing fast threads to steal work from slow ones. We evaluate the proposed method on Phytium 2000+, an emerging 64-core high-performance processor based on Arm’s AArch64 architecture. Results show that our method reduces the synchronization overhead by 51.5% and achieves an improvement of GEMM performance by 1.9%.

Author supplied keywords

Cite

CITATION STYLE

APA

Su, X., & Lei, F. (2018). Hybrid-grained dynamic load balanced GEMM on NUMA architectures. Electronics (Switzerland), 7(12). https://doi.org/10.3390/electronics7120359

Hybrid-grained dynamic load balanced GEMM on NUMA architectures

Abstract

Author supplied keywords

Cite

Register to see more suggestions