Improving deep transformer with depth-scaled initialization and merged attention

Biao Zhang; Ivan Titov; Rico Sennrich

Conference ProceedingsOPEN ACCESS

Improving deep transformer with depth-scaled initialization and merged attention

EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference (2019) 898-909

DOI: 10.18653/v1/d19-1083

65Citations

187Readers

Abstract

The general trend in NLP is towards increasing model capacity and performance via deeper neural networks. However, simply stacking more layers of the popular Transformer architecture for machine translation results in poor convergence and high computational overhead. Our empirical analysis suggests that convergence is poor due to gradient vanishing caused by the interaction between residual connections and layer normalization. We propose depth-scaled initialization (DS-Init), which decreases parameter variance at the initialization stage, and reduces output variance of residual connections so as to ease gradient back-propagation through normalization layers. To address computational cost, we propose a merged attention sublayer (MAtt) which combines a simplified average-based self-attention sublayer and the encoder-decoder attention sublayer on the decoder side. Results on WMT and IWSLT translation tasks with five translation directions show that deep Transformers with DS-Init and MAtt can substantially outperform their base counterpart in terms of BLEU (+1.1 BLEU on average for 12-layer models), while matching the decoding speed of the baseline model thanks to the efficiency improvements of MAtt.1.

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Zhang, B., Titov, I., & Sennrich, R. (2019). Improving deep transformer with depth-scaled initialization and merged attention. In EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference (pp. 898–909). Association for Computational Linguistics. https://doi.org/10.18653/v1/d19-1083

Readers' Seniority

PhD / Post grad / Masters / Doc 71

83%

Researcher 11

13%

Lecturer / Post doc 3

Professor / Associate Prof. 1

Readers' Discipline

Computer Science 89

87%

Linguistics 7

Engineering 3

Medicine and Dentistry 3

Improving deep transformer with depth-scaled initialization and merged attention

Abstract

References Powered by Scopus

Long Short-Term Memory

Learning phrase representations using RNN encoder-decoder for statistical machine translation

Neural machine translation of rare words with subword units

Cited by Powered by Scopus

RealFormer: Transformer Likes Residual Attention

Neural machine translation: Challenges, progress and future

Optimizing deeper transformers on small datasets

Register to see more suggestions

Cite

Readers' Seniority

Readers' Discipline