Parallelizing Neural Network Models Effectively on GPU by Implementing Reductions Atomically

Jie Zhao; Cédric Bastoul; Yanzhi Yi; Jiahui Hu; Wang Nie; Renwei Zhang; Zhen Geng; Chong Li; Thibaut Tachon; Zhiliang Gan

Conference ProceedingsOPEN ACCESS

Parallelizing Neural Network Models Effectively on GPU by Implementing Reductions Atomically

Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT (2022) 451-466

DOI: 10.1145/3559009.3569656

1Citations

6Readers

Abstract

Due to the missing of a good orchestration of loop transformations, existing optimizing compilers for deploying neural networks on GPU either parallelize reductions ineffectively or miss the fusion opportunities with other operators. Neural network models thus exhibit sub-optimal performance on GPU. We present a practical approach called Panamera for the effective parallelization of reductions in neural networks on GPU. Panamera frst leverages loop coalescing to flatten the loop dimensions of reductions, converting all reduction operators into canonical forms eligible for the polyhedral model. Next, Panamera uses polyhedral transformations to reduce the data movements caused by unfused reductions and perform multi-block hardware binding not considered by many compilers. Finally, Panamera embeds a highly optimized routine implemented using GPU atomic instructions, further improving the performance of neural network models while guaranteeing the correctness of parallel reductions. The experimental results demonstrate the effectiveness of our approach: for single operators our code obtains a mean speedup of 33.7×, 3.5×, 5.4× and 9.6× over cuDNN, CUB, TVM and Ansor, for sub-graphs our approach outperforms cuDNN, TVM and Ansor by 9.5×, 2.6× and 2.7×, and for end-to-end workloads, a tensor compiler integrated with our approach outperforms them by 122.5%, 19.3% and 15.2%.

Author supplied keywords

Cite

CITATION STYLE

APA

Zhao, J., Bastoul, C., Yi, Y., Hu, J., Nie, W., Zhang, R., … Gan, Z. (2022). Parallelizing Neural Network Models Effectively on GPU by Implementing Reductions Atomically. In Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT (pp. 451–466). Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1145/3559009.3569656

Parallelizing Neural Network Models Effectively on GPU by Implementing Reductions Atomically

Abstract

Author supplied keywords

Cite

Register to see more suggestions