In recent years, the ever-growing application complexity and input dataset sizes have driven the popularity of multi-GPU systems as a desirable computing platform for many application domains. While employing multiple GPUs intuitively exposes substantial parallelism for the application acceleration, the delivered performance rarely scales with the number of GPUs. One of the major challenges behind is the address translation efficiency. Many prior works focus on CPUs or single GPU execution scenarios while the address translation in multi-GPU systems receives little attention. In this paper, we conduct a comprehensive investigation of the address translation efficiency in both "single-application-multi-GPU"and "multi-application-multi-GPU"execution paradigms. Based on our observations, we propose a new TLB hierarchy design, called least- TLB, tailored for multi-GPU systems and effectively improves the TLB performance with minimal hardware overheads. Experimental results on 9 single-application workloads and 10 multi-application workloads indicate the proposed least-TLB improves the performances, on average, by 23.5% and 16.3%, respectively.
CITATION STYLE
Li, B., Yin, J., Zhang, Y., & Tang, X. (2021). Improving address translation in multi-GPUs via sharing and spilling aware TLB design. In Proceedings of the Annual International Symposium on Microarchitecture, MICRO (pp. 1154–1168). IEEE Computer Society. https://doi.org/10.1145/3466752.3480083
Mendeley helps you to discover research relevant for your work.