Sequential performance versus scalability: Optimizing parallel LU-decomposition

2Citations
Citations of this article
2Readers
Mendeley users who have this article in their library.
Get full text

Abstract

High efficient implementations of parallel algorithms need high efficient sequential kernels. Therefore, libraries like BLAS are successfully used in many numerical applications. In this paper we show the tradeoff between the performance of these kernels and the scalability of parallel applications. It turns out that the fastest routine on a single node does not necessarily lead to the fastest parallel program and that the structure of the kernels have be adapted to the communication parameters of the machine. As an example application we present an optimized parallel LU-decomposition for dense systems on a distributed memory machine. Here, the size of submatrices of the blocked-algorithm determines the performance of the matrix-matrix multiplication and with a contrary effect the scalability behavior.

Cite

CITATION STYLE

APA

Simon, J., & Wierum, J. M. (1996). Sequential performance versus scalability: Optimizing parallel LU-decomposition. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 1067, pp. 627–632). Springer Verlag. https://doi.org/10.1007/3-540-61142-8_606

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free