ISPA: Exploiting Intra-SM Parallelism in GPUs via Fine-Grained Resource Management

2Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Emerging GPUs have multiple Streaming Multiprocessors (SM), while each SM is comprised of CUDA Cores and Tensor Cores. While CUDA Cores do the general computation, Tensor Cores are designed to speed up matrix multiplication for deep learning applications. However, a GPU kernel often either uses CUDA Cores or Tensor Cores, leaving the other processing units idle. Although many prior research works have been proposed to co-locate kernels to improve GPU utilization, they cannot leverage the Intra-SM CUDA Core-Tensor Core Parallelism. Specifically, ISPA designs persistent and elastic block to solve the thread slot and shared memory contention between co-located kernels. ISPA also adopts the register allocation method to manage the register contention. These resource management methods are applicable for both white-box kernels and cudnn kernels. Experimental results on an Nvidia 2080Ti GPU show that ISPA improves the system-wide throughput by 15.3% for white-box workloads, and 7.1% for cudnn-based workloads compared with prior co-location work.

Author supplied keywords

Cite

CITATION STYLE

APA

Zhao, H., Cui, W., Chen, Q., & Guo, M. (2023). ISPA: Exploiting Intra-SM Parallelism in GPUs via Fine-Grained Resource Management. IEEE Transactions on Computers, 72(5), 1473–1487. https://doi.org/10.1109/TC.2022.3214088

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free