Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs

Jiangsu Du; Jiazhi Jiang; Jiang Zheng; Hongbin Zhang; Dan Huang; Yutong Lu

Journal ArticleOPEN ACCESS

Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs

ACM Transactions on Architecture and Code Optimization (2023) 20(4)

DOI: 10.1145/3617689

9Citations

9Readers

Abstract

Transformer models have emerged as a leading approach in the field of natural language processing (NLP) and are increasingly being deployed in production environments. Graphic processing units (GPUs) have become a popular choice for the transformer deployment and often rely on the batch processing technique to ensure high hardware performance. Nonetheless, the current practice for transformer inference encounters computational and memory redundancy due to the heavy-tailed distribution of sequence lengths in NLP scenarios, resulting in low practical performance. In this article, we propose a unified solution for improving both computation and memory efficiency of the real-world transformer inference on GPUs. The solution eliminates the redundant computation and memory footprint across a transformer model. At first, a GPU-oriented computation approach is proposed to process the self-attention module in a fine-grained manner, eliminating its redundant computation. Next, the multi-layer perceptron module continues to use the word-accumulation approach to eliminate its redundant computation. Then, to better unify the fine-grained approach and the word-accumulation approach, it organizes the data layout of the self-attention module in block granularity. Since aforementioned approaches make the required memory size largely reduce and constantly fluctuate, we propose the chunk-based approach to enable a better balance between memory footprint and allocation/free efficiency. Our experimental results show that our unified solution achieves a decrease of average latency by 28% on the entire transformer model, 63.8% on the self-attention module, and reduces memory footprint of intermediate results by 7.8×, compared with prevailing frameworks.

Author supplied keywords

Cite

CITATION STYLE

APA

Du, J., Jiang, J., Zheng, J., Zhang, H., Huang, D., & Lu, Y. (2023). Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs. ACM Transactions on Architecture and Code Optimization, 20(4). https://doi.org/10.1145/3617689

Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs

Abstract

Author supplied keywords

Cite

Register to see more suggestions