Absolute Position Embedding Learns Sinusoid-like Waves for Attention Based on Relative Position

1Citations
Citations of this article
15Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Attention weight is a clue to interpret how a Transformer-based model makes an inference. In some attention heads, the attention focuses on the neighbors of each token. This allows the output vector of each token to depend on the surrounding tokens and contributes to make the inference context-dependent. We analyze the mechanism behind the concentration of attention on nearby tokens. We show that the phenomenon emerges as follows: (1) learned position embedding has sinusoid-like components, (2) such components are transmitted to the query and the key in the self-attention, (3) the attention head shifts the phases of the sinusoid-like components so that the attention concentrates on nearby tokens at specific relative positions. In other words, a certain type of Transformer-based model acquires the sinusoidal positional encoding to some extent on its own through Masked Language Modeling.

Cite

CITATION STYLE

APA

Yamamoto, Y., & Matsuzaki, T. (2023). Absolute Position Embedding Learns Sinusoid-like Waves for Attention Based on Relative Position. In EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 15–28). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.emnlp-main.2

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free