Accelerating asynchronous stochastic gradient descent for neural machine translation

12Citations
Citations of this article
107Readers
Mendeley users who have this article in their library.

Abstract

In order to extract the best possible performance from asynchronous stochastic gradient descent one must increase the mini-batch size and scale the learning rate accordingly. In order to achieve further speedup we introduce a technique that delays gradient updates effectively increasing the mini-batch size. Unfortunately with the increase of mini-batch size we worsen the stale gradient problem in asynchronous stochastic gradient descent (SGD) which makes the model convergence poor. We introduce local optimizers which mitigate the stale gradient problem and together with fine tuning our momentum we are able to train a shallow machine translation system 27% faster than an optimized baseline with negligible penalty in BLEU.

Cite

CITATION STYLE

APA

Bogoychev, N., Junczys-Dowmunt, M., Heafield, K., & Aji, A. F. (2018). Accelerating asynchronous stochastic gradient descent for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018 (pp. 2991–2996). Association for Computational Linguistics. https://doi.org/10.18653/v1/d18-1332

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free