DOTA: Detect and OmitWeak Attentions for Scalable Transformer Acceleration

Zheng Qu; Liu Liu; Fengbin Tu; Zhaodong Chen; Yufei Ding; Yuan Xie

Conference ProceedingsOPEN ACCESS

DOTA: Detect and OmitWeak Attentions for Scalable Transformer Acceleration

Qu Z
Liu L
Tu F
et al.

International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS (2022) 14-26

DOI: 10.1145/3503222.3507738

108Citations

38Readers

Get full text

Abstract

Transformer Neural Networks have demonstrated leading performance in many applications spanning over language understanding, image processing, and generative modeling. Despite the impressive performance, long-sequence Transformer processing is expensive due to quadratic computation complexity and memory consumption of self-Attention. In this paper, we present DOTA, an algorithm-Architecture co-design that effectively addresses the challenges of scalable Transformer inference. Based on the insight that not all connections in an attention graph are equally important, we propose to jointly optimize a lightweight Detector with the Transformer model to accurately detect and omit weak connections during runtime. Furthermore, we design a specialized system architecture for end-To-end Transformer acceleration using the proposed attention detection mechanism. Experiments on a wide range of benchmarks demonstrate the superior performance of DOTA over other solutions. In summary, DOTA achieves 152.6x and 4.5x performance speedup and orders of magnitude energy-efficiency improvements over GPU and customized hardware, respectively.

Author supplied keywords

Cite

CITATION STYLE

APA

Qu, Z., Liu, L., Tu, F., Chen, Z., Ding, Y., & Xie, Y. (2022). DOTA: Detect and OmitWeak Attentions for Scalable Transformer Acceleration. In International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS (pp. 14–26). Association for Computing Machinery. https://doi.org/10.1145/3503222.3507738

DOTA: Detect and OmitWeak Attentions for Scalable Transformer Acceleration

Abstract

Author supplied keywords

Cite

Register to see more suggestions