TurboDL: Improving the CNN Training on GPU with Fine-Grained Multi-Streaming Scheduling

Hai Jin; Wenchao Wu; Xuanhua Shi; Ligang He; Bing Bing Zhou

Journal ArticleOPEN ACCESS

TurboDL: Improving the CNN Training on GPU with Fine-Grained Multi-Streaming Scheduling

IEEE Transactions on Computers (2021) 70(4) 552-565

DOI: 10.1109/TC.2020.2990321

6Citations

11Readers

Abstract

Graphics Processing Units (GPUs) have evolved as powerful co-processors for the CNN training. Many new features have been introduced into GPUs such as concurrent kernel execution and hyper-Q technology. It is challenging to orchestrate concurrency for CNN (convolutional neural networks) training on GPUs since it may introduce synchronization overhead and poor resource utilization. Unlike previous research which mainly focuses on single layer or coarse-grained optimization, we introduce a critical-path based, asynchronous parallelization mechanism, and propose the optimization technique for the CNN training that takes into account global network architecture and GPU resource usage together. The proposed methods can effectively overlap the synchronization and the computation in different streams. As a result, the training process of CNN is accelerated. We have integrated our methods into Caffe. The experimental results show that the Caffe integrated with our methods can achieve 1.30X performance speedup on average compared with Caffe+cuDNN, and even higher performance speedup can be achieved for deeper, wider, and more complicated networks.

Author supplied keywords

Cite

CITATION STYLE

APA

Jin, H., Wu, W., Shi, X., He, L., & Zhou, B. B. (2021). TurboDL: Improving the CNN Training on GPU with Fine-Grained Multi-Streaming Scheduling. IEEE Transactions on Computers, 70(4), 552–565. https://doi.org/10.1109/TC.2020.2990321

TurboDL: Improving the CNN Training on GPU with Fine-Grained Multi-Streaming Scheduling

Abstract

Author supplied keywords

Cite

Register to see more suggestions