ADaComP: Adaptive residual gradient compression for data-parallel distributed training

136Citations
Citations of this article
100Readers
Mendeley users who have this article in their library.

Abstract

Highly distributed training of Deep Neural Networks (DNNs) on future compute platforms (offering 100 of TeraOps/s of computational capacity) is expected to be severely communication constrained. To overcome this limitation, new gradient compression techniques are needed that are computationally friendly, applicable to a wide variety of layers seen in Deep Neural Networks and adaptable to variations in network architectures as well as their hyper-parameters. In this paper we introduce a novel technique - the Adaptive Residual Gradient Compression (AdaComp) scheme. AdaComp is based on localized selection of gradient residues and automatically tunes the compression rate depending on local activity. We show excellent results on a wide spectrum of state of the art Deep Learning models in multiple domains (vision, speech, language), datasets (MNIST, CIFAR10, ImageNet, BN50, Shakespeare), optimizers (SGD with momentum, Adam) and network parameters (number of learners, minibatch-size etc.). Exploiting both sparsity and quantization, we demonstrate end-to-end compression rates of ∼200× for fully-connected and recurrent layers, and ∼40× for convolutional layers, without any noticeable degradation in model accuracies.

Cite

CITATION STYLE

APA

Chen, C. Y., Choi, J., Brand, D., Agrawal, A., Zhang, W., & Gopalakrishnan, K. (2018). ADaComP: Adaptive residual gradient compression for data-parallel distributed training. In 32nd AAAI Conference on Artificial Intelligence, AAAI 2018 (pp. 2827–2835). AAAI press. https://doi.org/10.1609/aaai.v32i1.11728

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free