Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUs

Tuowen Zhao; Protonu Basu; Samuel Williams; Mary Hall; Hans Johansen

Conference Proceedings

Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUs

International Conference for High Performance Computing, Networking, Storage and Analysis, SC (2019)

DOI: 10.1145/3295500.3356210

31Citations

32Readers

Get full text

Abstract

Stencil computations in real-world scientific applications may contain multiple interrelated stencils, have multiple input grids, and use higher order discretizations with high arithmetic intensity and complex expression structures. In combination, these properties place immense demands on the memory hierarchy that limit performance. Blocking techniques like tiling are used to exploit reuse in caches. Additional fine-grain data blocking can also reduce TLB, hardware prefetch, and cache pressure. In this paper, we present a code generation approach designed to further improve tiled stencil performance by exploiting reuse within the block, increasing instruction-level parallelism, and exposing opportunities for the backend compiler to eliminate redundant computation. It also enables efficient vector code generation for CPUs and GPUs. For a wide range of complex stencil computations, we are able to achieve substantial speedups over tiled baselines for the Intel KNL, Intel Skylake-X, and NVIDIA P100 architectures.

Author supplied keywords

Cite

CITATION STYLE

APA

Zhao, T., Basu, P., Williams, S., Hall, M., & Johansen, H. (2019). Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUs. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC. IEEE Computer Society. https://doi.org/10.1145/3295500.3356210

Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUs

Abstract

Author supplied keywords

Cite

Register to see more suggestions