Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

Jeremy D. Frens; David S. Wise

Conference ProceedingsOPEN ACCESS

Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP (1997) 206-216

DOI: 10.1145/263764.263789

46Citations

6Readers

Abstract

An elementary, machine-independent, recursive algorithm for matrix multiplication C + = A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimal code, tracking hand-coded BLAS3 routines. Proof of concept is demonstrated by racing the in-place algorithm against manufacturer's hand-tuned BLAS3 routines; it can win. The recursive code bifurcates naturally at the top level into independent block-oriented processes, that each writes to a disjoint and contiguous region of memory. Experience has shown that the indexing vastly improves the patterns of memory access at all levels of the memory hierarchy, independently of the sizes of caches or pages and without ad hoc programming. It also exposed a weakness in SGI's C compilers that merrily unroll loops for the superscalar R8000 processor, but do not analogously unfold the base cases of the most elementary recursions. Such deficiencies might deter future programmers from using this rich class of recursive algorithms.

Cite

CITATION STYLE

APA

Frens, J. D., & Wise, D. S. (1997). Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP (pp. 206–216). ACM. https://doi.org/10.1145/263764.263789

Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

Abstract

Cite

Register to see more suggestions