Abstract
This paper presents the first hybrid MPI+OpenMP+CUDA implementation of a distributed memory right-looking unsymmetric sparse direct solver (i.e., sparse LU factorization) that uses static pivoting. While BLAS calls can account for more than 40% of the overall factorization time, the difficulty is that small problem sizes dominate the workload, making efficient GPU utilization challenging. This fact motivates our approach, which is to find ways to aggregate collections of small BLAS operations into larger ones; to schedule operations to achieve load balance and hide long-latency operations, such as PCIe transfer; and to exploit simultaneously all of a node's available CPU cores and GPUs. © 2014 Springer International Publishing Switzerland.
Cite
CITATION STYLE
Sao, P., Vuduc, R., & Li, X. S. (2014). A distributed CPU-GPU sparse direct solver. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8632 LNCS, pp. 487–498). Springer Verlag. https://doi.org/10.1007/978-3-319-09873-9_41
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.