Using machine learning to model the training scalability of convolutional neural networks on clusters of GPUs

Sergio Barrachina; Adrián Castelló; Mar Catalán; Manuel F. Dolz; Jose I. Mestre

Journal ArticleOPEN ACCESS

Using machine learning to model the training scalability of convolutional neural networks on clusters of GPUs

Computing (2023) 105(5) 915-934

DOI: 10.1007/s00607-021-00997-9

2Citations

6Readers

Abstract

In this work, we build a general piece-wise model to analyze data-parallel (DP) training costs of convolutional neural networks (CNNs) on clusters of GPUs. This general model is based on i) multi-layer perceptrons (MLPs) in charge of modeling the NVIDIA cuDNN/cuBLAS library kernels involved in the training of some of the state-of-the-art CNNs; and ii) an analytical model in charge of modeling the NVIDIA NCCL Allreduce collective primitive using the Ring algorithm. The CNN training scalability study performed using this model in combination with the Roofline technique on varying batch sizes, node (floating-point) arithmetic performance, node memory bandwidth, network link bandwidth, and cluster dimension unveil some crucial bottlenecks at both GPU and cluster level. To provide evidence of this analysis, we validate the accuracy of the proposed model against a Python library for distributed deep learning training.

Author supplied keywords

Cite

CITATION STYLE

APA

Barrachina, S., Castelló, A., Catalán, M., Dolz, M. F., & Mestre, J. I. (2023). Using machine learning to model the training scalability of convolutional neural networks on clusters of GPUs. Computing, 105(5), 915–934. https://doi.org/10.1007/s00607-021-00997-9

Using machine learning to model the training scalability of convolutional neural networks on clusters of GPUs

Abstract

Author supplied keywords

Cite

Register to see more suggestions