Performance of Distributed Deep Learning Workloads on a Composable Cyberinfrastructure

Zhenhua He; Aditi Saluja; Richard Lawrence; Dhruva Chakravorty; Francis Dang; Lisa Perez; Honggao Liu

Conference ProceedingsOPEN ACCESS

Performance of Distributed Deep Learning Workloads on a Composable Cyberinfrastructure

PEARC 2023 - Computing for the common good: Practice and Experience in Advanced Research Computing (2023) 60-67

DOI: 10.1145/3569951.3593601

9Citations

5Readers

Abstract

The next generation of computing systems are likely to rely on disaggregated resources that can be dynamically reconfigured and customized for researchers to support scientific and engineering workflows that require different cyberinfrastructure (CI) technologies. These resources would include memory, accelerators, co-processors among other technologies. This would represent a significant shift in High Performance Computing (HPC) from the now typical model of clusters that have these resources permanently connected to a single server. While composing hardware frameworks with disaggregated resources holds promise, we need to understand how to situate workflows on these resources and evaluate the impact of this approach on workflow performance against "traditional"clusters. Toward developing this knowledge framework, we study the applicability and performance of deep learning workloads on GPU-enabled composable and traditional HPC computing platforms. Results from tests performed using the Horovod framework with TensorFlow and PyTorch models on these HPC environments are presented here.

Author supplied keywords

Cite

CITATION STYLE

APA

He, Z., Saluja, A., Lawrence, R., Chakravorty, D., Dang, F., Perez, L., & Liu, H. (2023). Performance of Distributed Deep Learning Workloads on a Composable Cyberinfrastructure. In PEARC 2023 - Computing for the common good: Practice and Experience in Advanced Research Computing (pp. 60–67). Association for Computing Machinery, Inc. https://doi.org/10.1145/3569951.3593601

Performance of Distributed Deep Learning Workloads on a Composable Cyberinfrastructure

Abstract

Author supplied keywords

Cite

Register to see more suggestions