Combining global sparse gradients with local gradients in distributed neural network training

Alham Fikri Aji; Kenneth Heafield; Nikolay Bogoychev

Conference ProceedingsOPEN ACCESS

Combining global sparse gradients with local gradients in distributed neural network training

EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference (2019) 3626-3631

DOI: 10.18653/v1/d19-1373

2Citations

73Readers

Abstract

One way to reduce network traffic in multi-node data-parallel stochastic gradient descent is to only exchange the largest gradients. However, doing so damages the gradient and degrades the model's performance. Tranformer models degrade dramatically while the impact on RNNs is smaller. We restore gradient quality by combining the compressed global gradient with the node's locally computed uncompressed gradient. Neural machine translation experiments show that Transformer convergence is restored while RNNs converge faster. With our method, training on 4 nodes converges up to 1.5x as fast as with uncompressed gradients and scales 3.5x relative to single-node training.

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Aji, A. F., Heafield, K., & Bogoychev, N. (2019). Combining global sparse gradients with local gradients in distributed neural network training. In EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference (pp. 3626–3631). Association for Computational Linguistics. https://doi.org/10.18653/v1/d19-1373

Readers' Seniority

PhD / Post grad / Masters / Doc 17

68%

Researcher 6

24%

Lecturer / Post doc 2

Readers' Discipline

Computer Science 24

75%

Linguistics 5

16%

Business, Management and Accounting 2

Neuroscience 1

Combining global sparse gradients with local gradients in distributed neural network training

Abstract

References Powered by Scopus

Neural machine translation of rare words with subword units

Improving neural machine translation models with monolingual data

Scalable distributed DNN training using commodity GPU cloud computing

Cited by Powered by Scopus

Congestion-Aware Critical Gradient Scheduling for Distributed Machine Learning in Data Center Networks

Register to see more suggestions

Cite

Readers' Seniority

Readers' Discipline