Implementation and optimization of dense LU decomposition on the stream processor

Ying Zhang; Tao Tang; Gen Li; Xuejun Yang

Conference Proceedings

Implementation and optimization of dense LU decomposition on the stream processor

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2008) 4967 LNCS 78-88

DOI: 10.1007/978-3-540-68111-3_9

N/ACitations

2Readers

Get full text

Abstract

Developing scientific computing applications on the stream processor has absorbed a lot of researchers attention. In this paper, we implement and optimize dense LU decomposition on the stream processor. Different from other existing parallel algorithms for LU decomposition, StreamLUD algorithm aims at exploiting producer-consumer locality and at overlapping chip-off memory access with kernel execution. Simulation results show that dealing with matrices of different sizes, compared with LUD of HPL on an Itanium 2 processor, StreamLUD we implement and optimize gets a speedup from 2.56 to 3.64 ultimately. © 2008 Springer-Verlag Berlin Heidelberg.

Author supplied keywords

Cite

CITATION STYLE

APA

Zhang, Y., Tang, T., Li, G., & Yang, X. (2008). Implementation and optimization of dense LU decomposition on the stream processor. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4967 LNCS, pp. 78–88). https://doi.org/10.1007/978-3-540-68111-3_9

Implementation and optimization of dense LU decomposition on the stream processor

Abstract

Author supplied keywords

Cite

Register to see more suggestions