Multi-stage gradient compression: Overcoming the communication bottleneck in distributed deep learning

4Citations
Citations of this article
1Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Due to the huge size of deep learning model and the limited bandwidth of network, communication cost has become a salient bottleneck in distributed training. Gradient compression is an effective way to relieve the pressure of bandwidth and increase the scalability of distributed training. In this paper, we propose a novel gradient compression technique, Multi-Stage Gradient Compression (MGC) with Sparsity Automatic Adjustment and Gradient Recession. These techniques divide the whole training process into three stages which fit different compression strategy. To handle error and preserve accuracy, we accumulate the quantization error and sparse gradients locally with momentum correction. Our experiments show that MGC achieves excellent compression ratio up to 3800x without incurring accuracy loss. We compress gradient size of ResNet-50 from 97 MB to 0.03 MB, for AlexNet from 233 MB to 0.06 MB. We even get a better accuracy than baseline on GoogLeNet. Experiments also show the significant scalability of MGC.

Cite

CITATION STYLE

APA

Lu, Q., Liu, W., Han, J., & Guo, J. (2018). Multi-stage gradient compression: Overcoming the communication bottleneck in distributed deep learning. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11301 LNCS, pp. 107–119). Springer Verlag. https://doi.org/10.1007/978-3-030-04167-0_10

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free