High Accuracy Matrix Computations on Neural Engines: A Study of QR Factorization and its Applications

9Citations
Citations of this article
11Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Fueled by the surge of ever expanding successful applications of deep neural networks and the great computational power demanded, modern computer processors and accelerators are beginning to offer half precision floating point arithmetic support, and special units (neural engines) such as NVIDIA TensorCore on GPU and Google Tensor Processing Unit (TPU) to accelerate the training and prediction of deep neural networks. It remains unclear how neural engines can be profitably used in applications other than neural networks. In this paper we present an endeavor of accelerating and stabilizing a fundamental matrix factorization on neural engines - -the QR factorization - -which may open doors to much wider relevance to scientific, engineering, and data science. We show that traditional Householder QR algorithms and implementations do not have the necessary data locality, parallelism, accuracy, and robustness on neural engines which are characterized by extreme speed and low precision/range. We demonstrate that neural engines can be effectively used to accelerate matrix computations (QR 3.0x-14.6x speedup compared to cuSOLVER, reaching up to 36.6TFLOPS); however different algorithms (recursive Gram-Schmidt) are needed to expose more locality and parallelism, even at the cost of increased computations. Moreover, scaling, iterative refinement, and other safeguarding procedures are also needed to regain accuracy and avoid overflowing. Our experience seems to suggest that presently with neural engines the matrix factorizations (QR, LU, Cholesky) are best to be co-designed with its applications (linear solver, least square, orthogonalization, SVD, etc) to achieve high performance and adequate accuracy and reliability.

Cite

CITATION STYLE

APA

Zhang, S., Baharlouei, E., & Wu, P. (2020). High Accuracy Matrix Computations on Neural Engines: A Study of QR Factorization and its Applications. In HPDC 2020 - Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing (pp. 17–28). Association for Computing Machinery, Inc. https://doi.org/10.1145/3369583.3392685

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free