Multi-stage gradient compression: Overcoming the communication bottleneck in distributed deep learning

Qu Lu; Wantao Liu; Jizhong Han; Jinrong Guo

Conference Proceedings

Multi-stage gradient compression: Overcoming the communication bottleneck in distributed deep learning

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2018) 11301 LNCS 107-119

DOI: 10.1007/978-3-030-04167-0_10

4Citations

1Readers

Get full text

Abstract

Due to the huge size of deep learning model and the limited bandwidth of network, communication cost has become a salient bottleneck in distributed training. Gradient compression is an effective way to relieve the pressure of bandwidth and increase the scalability of distributed training. In this paper, we propose a novel gradient compression technique, Multi-Stage Gradient Compression (MGC) with Sparsity Automatic Adjustment and Gradient Recession. These techniques divide the whole training process into three stages which fit different compression strategy. To handle error and preserve accuracy, we accumulate the quantization error and sparse gradients locally with momentum correction. Our experiments show that MGC achieves excellent compression ratio up to 3800x without incurring accuracy loss. We compress gradient size of ResNet-50 from 97 MB to 0.03 MB, for AlexNet from 233 MB to 0.06 MB. We even get a better accuracy than baseline on GoogLeNet. Experiments also show the significant scalability of MGC.

Author supplied keywords

Cite

CITATION STYLE

APA

Lu, Q., Liu, W., Han, J., & Guo, J. (2018). Multi-stage gradient compression: Overcoming the communication bottleneck in distributed deep learning. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11301 LNCS, pp. 107–119). Springer Verlag. https://doi.org/10.1007/978-3-030-04167-0_10

Multi-stage gradient compression: Overcoming the communication bottleneck in distributed deep learning

Abstract

Author supplied keywords

Cite

Register to see more suggestions