The self-Attention mechanism requires a huge amount of computational cost despite its successful use in the transformer. The computational cost linearly increases in proportion to the embedding dimension size, owing to the dot product operation that calculates token similarities in vector spaces. To tackle this problem, we propose a novel efficient self-Attention mechanism (Pro-Attention) that computes the attention scores through distribution matching in probability space. To this end, we assume that each token has its unique probability distribution and regard each component of a token vector as a sample from the probability distribution. Then, we estimate the statistics of each token-specific probability distribution from the samples, and the token similarities are obtained by using the Kullback-Leibler Divergence. According to the time complexity analysis, the computational cost is markedly saved because the time complexity is independent of the feature dimension size. Our method produces competitive performances in machine translation and language modeling benchmarks such as IWSLT'14 De-En, WMT'14 En-De, WMT'14 En-Fr, and WikiText-103 datasets. Moreover, our model maintains the performances with considerably reduced FLOPs of the self-Attention mechanism, which is up to 87% less compared to the baseline transformer. Especially, our model improves the efficiency in a large volume of the training dataset.
CITATION STYLE
Bae, J., Cheon, B. D., & Kim, H. Y. (2022). Pro-Attention: Efficient Probability Distribution Matching-Based Attention Through Feature Space Conversion. IEEE Access, 10, 131192–131201. https://doi.org/10.1109/ACCESS.2022.3229055
Mendeley helps you to discover research relevant for your work.