Pro-Attention: Efficient Probability Distribution Matching-Based Attention Through Feature Space Conversion

Jongseong Bae; Byung Do Cheon; Ha Young Kim

Journal ArticleOPEN ACCESS

Pro-Attention: Efficient Probability Distribution Matching-Based Attention Through Feature Space Conversion

IEEE Access (2022) 10 131192-131201

DOI: 10.1109/ACCESS.2022.3229055

0Citations

9Readers

Abstract

The self-Attention mechanism requires a huge amount of computational cost despite its successful use in the transformer. The computational cost linearly increases in proportion to the embedding dimension size, owing to the dot product operation that calculates token similarities in vector spaces. To tackle this problem, we propose a novel efficient self-Attention mechanism (Pro-Attention) that computes the attention scores through distribution matching in probability space. To this end, we assume that each token has its unique probability distribution and regard each component of a token vector as a sample from the probability distribution. Then, we estimate the statistics of each token-specific probability distribution from the samples, and the token similarities are obtained by using the Kullback-Leibler Divergence. According to the time complexity analysis, the computational cost is markedly saved because the time complexity is independent of the feature dimension size. Our method produces competitive performances in machine translation and language modeling benchmarks such as IWSLT'14 De-En, WMT'14 En-De, WMT'14 En-Fr, and WikiText-103 datasets. Moreover, our model maintains the performances with considerably reduced FLOPs of the self-Attention mechanism, which is up to 87% less compared to the baseline transformer. Especially, our model improves the efficiency in a large volume of the training dataset.

Author supplied keywords

Cite

CITATION STYLE

APA

Bae, J., Cheon, B. D., & Kim, H. Y. (2022). Pro-Attention: Efficient Probability Distribution Matching-Based Attention Through Feature Space Conversion. IEEE Access, 10, 131192–131201. https://doi.org/10.1109/ACCESS.2022.3229055

Pro-Attention: Efficient Probability Distribution Matching-Based Attention Through Feature Space Conversion

Abstract

Author supplied keywords

Cite

Register to see more suggestions