Developing scientific computing applications on the stream processor has absorbed a lot of researchers attention. In this paper, we implement and optimize dense LU decomposition on the stream processor. Different from other existing parallel algorithms for LU decomposition, StreamLUD algorithm aims at exploiting producer-consumer locality and at overlapping chip-off memory access with kernel execution. Simulation results show that dealing with matrices of different sizes, compared with LUD of HPL on an Itanium 2 processor, StreamLUD we implement and optimize gets a speedup from 2.56 to 3.64 ultimately. © 2008 Springer-Verlag Berlin Heidelberg.
CITATION STYLE
Zhang, Y., Tang, T., Li, G., & Yang, X. (2008). Implementation and optimization of dense LU decomposition on the stream processor. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4967 LNCS, pp. 78–88). https://doi.org/10.1007/978-3-540-68111-3_9
Mendeley helps you to discover research relevant for your work.