Abstract
Transformer Neural Networks have demonstrated leading performance in many applications spanning over language understanding, image processing, and generative modeling. Despite the impressive performance, long-sequence Transformer processing is expensive due to quadratic computation complexity and memory consumption of self-Attention. In this paper, we present DOTA, an algorithm-Architecture co-design that effectively addresses the challenges of scalable Transformer inference. Based on the insight that not all connections in an attention graph are equally important, we propose to jointly optimize a lightweight Detector with the Transformer model to accurately detect and omit weak connections during runtime. Furthermore, we design a specialized system architecture for end-To-end Transformer acceleration using the proposed attention detection mechanism. Experiments on a wide range of benchmarks demonstrate the superior performance of DOTA over other solutions. In summary, DOTA achieves 152.6x and 4.5x performance speedup and orders of magnitude energy-efficiency improvements over GPU and customized hardware, respectively.
Author supplied keywords
Cite
CITATION STYLE
Qu, Z., Liu, L., Tu, F., Chen, Z., Ding, Y., & Xie, Y. (2022). DOTA: Detect and OmitWeak Attentions for Scalable Transformer Acceleration. In International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS (pp. 14–26). Association for Computing Machinery. https://doi.org/10.1145/3503222.3507738
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.