Toward scalable matrix multiply on multithreaded architectures

17Citations
Citations of this article
14Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

We show empirically that some of the issues that affected the design of linear algebra libraries for distributed memory architectures will also likely affect such libraries for shared memory architectures with many simultaneous threads of execution, including SMP architectures and future multicore processors. The always-important matrix-matrix multiplication is used to demonstrate that a simple one-dimensional data partitioning is suboptimal in the context of dense linear algebra operations and hinders scalability. In addition we advocate the publishing of low-level interfaces to supporting operations, such as the copying of data to contiguous memory, so that library developers may further optimize parallel linear algebra implementations. Data collected on a 16 CPU Itanium2 server supports these observations. © Springer-Verlag Berlin Heidelberg 2007.

Cite

CITATION STYLE

APA

Marker, B., Van Zee, F. G., Goto, K., Quintana-Ortí, G., & Van De Geijn, R. A. (2007). Toward scalable matrix multiply on multithreaded architectures. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4641 LNCS, pp. 748–757). Springer Verlag. https://doi.org/10.1007/978-3-540-74466-5_79

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free