High GPU performance can only be achieved if a kernel efficientlyuses the multi-layered compute and memory hierarchies. For example, accelerators such as NVIDIA's Tensor Cores require specificmappings of threads to data that must be considered in data movements to and from registers. Current compilers struggle to matchthe performance of vendor libraries like cuBLAS, which are developed by experts in assembly. This manual low-level coding istime-consuming and complicates to unlock the full GPU potential,preventing experimentation to achieve even higher performance.In this paper we introduce Fireiron, a scheduling language aimedat performance experts. Fireiron provides high-level abstractionsfor expressing GPU optimizations that are unavailable to compilerstoday and which so far must be written in assembly. Our innovation is that both computations and data movements are first classconcepts that can be separately mapped to threads, as required forthe efficient use of specialized hardware like Tensor Cores.We evaluate Fireiron on three GPU architectures against expertwritten advanced matrix multiplications. First, we show that Fireiron schedules are able to express the strategies of these implementations requiring about 6× less lines of code. Second, we show thatthe code generated by Fireiron schedules outperforms the fastestimplementations (provided by cuBLAS) by more than 2×.
CITATION STYLE
Hagedorn, B., Elliott, A. S., Barthels, H., Bodik, R., & Grover, V. (2020). Fireiron: A data-movement-aware scheduling language for GPUs. In Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT (pp. 71–82). Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1145/3410463.3414632
Mendeley helps you to discover research relevant for your work.