A distributed CPU-GPU sparse direct solver

38Citations
Citations of this article
26Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

This paper presents the first hybrid MPI+OpenMP+CUDA implementation of a distributed memory right-looking unsymmetric sparse direct solver (i.e., sparse LU factorization) that uses static pivoting. While BLAS calls can account for more than 40% of the overall factorization time, the difficulty is that small problem sizes dominate the workload, making efficient GPU utilization challenging. This fact motivates our approach, which is to find ways to aggregate collections of small BLAS operations into larger ones; to schedule operations to achieve load balance and hide long-latency operations, such as PCIe transfer; and to exploit simultaneously all of a node's available CPU cores and GPUs. © 2014 Springer International Publishing Switzerland.

Cite

CITATION STYLE

APA

Sao, P., Vuduc, R., & Li, X. S. (2014). A distributed CPU-GPU sparse direct solver. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8632 LNCS, pp. 487–498). Springer Verlag. https://doi.org/10.1007/978-3-319-09873-9_41

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free