A distributed CPU-GPU sparse direct solver

Piyush Sao; Richard Vuduc; Xiaoye Sherry Li

Conference ProceedingsOPEN ACCESS

A distributed CPU-GPU sparse direct solver

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2014) 8632 LNCS 487-498

DOI: 10.1007/978-3-319-09873-9_41

38Citations

26Readers

Abstract

This paper presents the first hybrid MPI+OpenMP+CUDA implementation of a distributed memory right-looking unsymmetric sparse direct solver (i.e., sparse LU factorization) that uses static pivoting. While BLAS calls can account for more than 40% of the overall factorization time, the difficulty is that small problem sizes dominate the workload, making efficient GPU utilization challenging. This fact motivates our approach, which is to find ways to aggregate collections of small BLAS operations into larger ones; to schedule operations to achieve load balance and hide long-latency operations, such as PCIe transfer; and to exploit simultaneously all of a node's available CPU cores and GPUs. © 2014 Springer International Publishing Switzerland.

Cite

CITATION STYLE

APA

Sao, P., Vuduc, R., & Li, X. S. (2014). A distributed CPU-GPU sparse direct solver. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8632 LNCS, pp. 487–498). Springer Verlag. https://doi.org/10.1007/978-3-319-09873-9_41

A distributed CPU-GPU sparse direct solver

Abstract

Cite

Register to see more suggestions