Abstract
Quantization optimizes machine learning inference for resource constrained environments by reducing the precision of its computation. In the extreme, even single-bit computations can produce acceptable results at dramatically lower cost. But this ultra-low-precision quantization is difficult to exploit because extracting optimal performance requires hand-tuning both high-level scheduling decisions and lowlevel implementations. As a result, practitioners settle for a few predefined quantized kernels, sacrificing optimality and restricting their ability to adapt to new hardware. This paper presents a new automated approach to implementing quantized inference for machine learning models. We integrate the choice of how to lay out quantized values into the scheduling phase of a machine learning compiler, allowing it to be optimized in concert with tiling and parallelization decisions. After scheduling, we use program synthesis to automatically generate efficient low-level operator implementations for the desired precision and data layout. We scale up synthesis using a novel reduction sketch that exploits the structure of matrix multiplication. On a ResNet18 model, our generated code outperforms an optimized floating-point baseline by up to 3.9×, and a state-ofthe- art quantized implementation by up to 16.6×.
Author supplied keywords
Cite
CITATION STYLE
Cowan, M., Moreau, T., Chen, T., Bornholt, J., & Ceze, L. (2020). Automatic generation of high-performance quantized machine learning kernels. In CGO 2020 - Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (pp. 305–316). Association for Computing Machinery, Inc. https://doi.org/10.1145/3368826.3377912
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.