Sequential performance versus scalability: Optimizing parallel LU-decomposition

Jens Simon; Jens Michael Wierum

Conference Proceedings

Sequential performance versus scalability: Optimizing parallel LU-decomposition

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (1996) 1067 627-632

DOI: 10.1007/3-540-61142-8_606

2Citations

2Readers

Get full text

Abstract

High efficient implementations of parallel algorithms need high efficient sequential kernels. Therefore, libraries like BLAS are successfully used in many numerical applications. In this paper we show the tradeoff between the performance of these kernels and the scalability of parallel applications. It turns out that the fastest routine on a single node does not necessarily lead to the fastest parallel program and that the structure of the kernels have be adapted to the communication parameters of the machine. As an example application we present an optimized parallel LU-decomposition for dense systems on a distributed memory machine. Here, the size of submatrices of the blocked-algorithm determines the performance of the matrix-matrix multiplication and with a contrary effect the scalability behavior.

Cite

CITATION STYLE

APA

Simon, J., & Wierum, J. M. (1996). Sequential performance versus scalability: Optimizing parallel LU-decomposition. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 1067, pp. 627–632). Springer Verlag. https://doi.org/10.1007/3-540-61142-8_606

Sequential performance versus scalability: Optimizing parallel LU-decomposition

Abstract

Cite

Register to see more suggestions