Do transformers need deep long-range memory?

22Citations
Citations of this article
163Readers
Mendeley users who have this article in their library.

Abstract

Deep attention models have advanced the modelling of sequential data across many domains. For language modelling in particular, the Transformer-XL - a Transformer augmented with a long-range memory of past activations - has been shown to be state-of-the-art across a variety of well-studied benchmarks. The Transformer-XL incorporates a long-range memory at every layer of the network, which renders its state to be thousands of times larger than RNN predecessors. However it is unclear whether this is necessary. We perform a set of interventions to show that comparable performance can be obtained with 6X fewer long range memories and better performance can be obtained by limiting the range of attention in lower layers of the network.

Cite

CITATION STYLE

APA

Rae, J. W., & Razavi, A. (2020). Do transformers need deep long-range memory? In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 7524–7529). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.acl-main.672

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free