Accelerating the SVD bi-diagonalization of a batch of small matrices using GPUs

12Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The acceleration of many small-sized linear algebra problems has become extremely challenging for current many-core architectures, and in particular GPUs. Standard interfaces have been proposed for some of these problems, called batched problems, so that they get targeted for optimization and used in a standard way in applications, calling them directly from highly optimized, standard numerical libraries, like (batched) BLAS and LAPACK. While most of the developments have been for one-sided factorizations and solvers, many important applications – from big data analytics to information retrieval, low-rank approximations for solvers and preconditioners – require two-sided factorizations, and most notably the SVD factorization. To address these needs and the parallelization challenges related to them, we developed a number of new batched computing techniques and designed batched Basic Linear Algebra Subroutines (BLAS) routines, and in particular the Level-2 BLAS GEMV and the Level-3 BLAS GEMM routines, to solve them. We propose a device functions-based methodology and big-tile setting techniques in our batched BLAS design. The different optimization techniques result in many software versions that must be tuned, for which we adopt an auto-tuning strategy to automatically derive the optimized instances of the routines. We illustrate our batched BLAS approach to optimize batched SVD bi-diagonalization progressively on GPUs. The progression is illustrated on an NVIDIA K40c GPU, and also, ported and presented on AMD Fiji Nano GPU, using AMD's Heterogeneous–Compute Interface for Portability (HIP) C++ runtime API. We demonstrate achieving 80% of the theoretically achievable peak performance for the overall algorithm, and significant acceleration of the Level-2 BLAS GEMV and Level-3 BLAS GEMM needed compared to vendor-optimized libraries on GPUs and multicore CPUs. The optimization techniques in this paper are applicable to the other two-sided factorizations as well.

Cite

CITATION STYLE

APA

Dong, T., Haidar, A., Tomov, S., & Dongarra, J. (2018). Accelerating the SVD bi-diagonalization of a batch of small matrices using GPUs. Journal of Computational Science, 26, 237–245. https://doi.org/10.1016/j.jocs.2018.01.007

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free