NIOT: A Novel Inference Optimization of Transformers on Modern CPUs

Zining Zhang; Yao Chen; Bingsheng He; Zhenjie Zhang

Journal Article

NIOT: A Novel Inference Optimization of Transformers on Modern CPUs

IEEE Transactions on Parallel and Distributed Systems (2023) 34(6) 1982-1995

DOI: 10.1109/TPDS.2023.3269530

8Citations

15Readers

Get full text

Abstract

In the machine learning era, model inference efficiency is one of the most important issues for machine learning systems. It is a major challenge to find the optimal configuration in a huge search space as the combinations of kernel fusion, memory tiling, and thread allocation strategies result in highly variable and unpredictable inference performance. The problem is particularly pronounced in models with large parameter matrices such as Transformers. In this article, we aim to develop a general and powerful framework for inference optimization, called NIOT, to achieve desirable efficiency for the prevailing Transformer-like models on CPUs. To take full advantage of modern CPU features such as SIMD and cache hierarchy, NIOT employs various methods to provide promising strategies tailored to the target Transformer model. Our C++ implementation of NIOT shows significant performance improvements over popular well-optimized model-serving runtimes such as PyTorch and ONNXRuntime.

Author supplied keywords

Cite

CITATION STYLE

APA

Zhang, Z., Chen, Y., He, B., & Zhang, Z. (2023). NIOT: A Novel Inference Optimization of Transformers on Modern CPUs. IEEE Transactions on Parallel and Distributed Systems, 34(6), 1982–1995. https://doi.org/10.1109/TPDS.2023.3269530

NIOT: A Novel Inference Optimization of Transformers on Modern CPUs

Abstract

Author supplied keywords

Cite

Register to see more suggestions