Easy and Efficient Transformer: Scalable Inference Solution For Large NLP Model

1Citations
Citations of this article
42Readers
Mendeley users who have this article in their library.

Abstract

Recently, large-scale transformer-based models have been proven to be effective over various tasks across many domains. Nevertheless, applying them in industrial production requires tedious and heavy works to reduce inference costs. To fill such a gap, we introduce a scalable inference solution: Easy and Efficient Transformer (EET), including a series of transformer inference optimization at the algorithm and implementation levels. First, we design highly optimized kernels for long inputs and large hidden sizes. Second, we propose a flexible CUDA memory manager to reduce the memory footprint when deploying a large model. Compared with the state-of-the-art transformer inference library (Faster Transformer v4.0), EET can achieve an average of 1.40-4.20x speedup on the transformer decoder layer with an A100 GPU.

Cite

CITATION STYLE

APA

Li, G., Xi, Y., Ding, J., Wang, D., Luo, Z., Zhang, R., … Zhao, Z. (2022). Easy and Efficient Transformer: Scalable Inference Solution For Large NLP Model. In NAACL 2022 - 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Industry Papers (pp. 62–68). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.naacl-industry.8

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free