High efficient implementations of parallel algorithms need high efficient sequential kernels. Therefore, libraries like BLAS are successfully used in many numerical applications. In this paper we show the tradeoff between the performance of these kernels and the scalability of parallel applications. It turns out that the fastest routine on a single node does not necessarily lead to the fastest parallel program and that the structure of the kernels have be adapted to the communication parameters of the machine. As an example application we present an optimized parallel LU-decomposition for dense systems on a distributed memory machine. Here, the size of submatrices of the blocked-algorithm determines the performance of the matrix-matrix multiplication and with a contrary effect the scalability behavior.
CITATION STYLE
Simon, J., & Wierum, J. M. (1996). Sequential performance versus scalability: Optimizing parallel LU-decomposition. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 1067, pp. 627–632). Springer Verlag. https://doi.org/10.1007/3-540-61142-8_606
Mendeley helps you to discover research relevant for your work.