Attention over heads: A multi-hop attention for neural machine translation

12Citations
Citations of this article
90Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper, we propose a multi-hop attention for the Transformer. It refines the attention for an output symbol by integrating that of each head, and consists of two hops. The first hop attention is the scaled dot-product attention which is the same attention mechanism used in the original Transformer. The second hop attention is a combination of multi-layer perceptron (MLP) attention and head gate, which efficiently increases the complexity of the model by adding dependencies between heads. We demonstrate that the translation accuracy of the proposed multi-hop attention outperforms the baseline Transformer significantly, +0.85 BLEU point for the IWSLT-2017 German-to- English task and +2.58 BLEU point for the WMT-2017 German-to-English task. We also find that the number of parameters required for a multi-hop attention is smaller than that for stacking another self-attention layer and the proposed model converges significantly faster than the original Transformer.

Cite

CITATION STYLE

APA

Iida, S., Kimura, R., Cui, H., Hung, P. H., Utsuro, T., & Nagata, M. (2019). Attention over heads: A multi-hop attention for neural machine translation. In ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop (pp. 217–222). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/p19-2030

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free