Mixed multi-head self-attention for neural machine translation

12Citations
Citations of this article
79Readers
Mendeley users who have this article in their library.

Abstract

Recently, the Transformer becomes a state-of-the-art architecture in the filed of neural machine translation (NMT). A key point of its high-performance is the multi-head self-attention which is supposed to allow the model to independently attend to information from different representation subspaces. However, there is no explicit mechanism to ensure that different attention heads indeed capture different features, and in practice, redundancy has occurred in multiple heads. In this paper, we argue that using the same global attention in multiple heads limits multi-head self-attention’s capacity for learning distinct features. In order to improve the expressiveness of multi-head self-attention, we propose a novel Mixed Multi-Head Self-Attention (MMA) which models not only global and local attention but also forward and backward attention in different attention heads. This enables the model to learn distinct representations explicitly among multiple heads. In our experiments on both WAT17 English-Japanese as well as IWSLT14 German-English translation task, we show that, without increasing the number of parameters, our models yield consistent and significant improvements (0.9 BLEU scores on average) over the strong Transformer baseline.

Cite

CITATION STYLE

APA

Cui, H., Iida, S., Hung, P. H., Utsuro, T., & Nagata, M. (2019). Mixed multi-head self-attention for neural machine translation. In EMNLP-IJCNLP 2019 - Proceedings of the 3rd Workshop on Neural Generation and Translation (pp. 206–214). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/d19-5622

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free