Abstract
The next generation of computing systems are likely to rely on disaggregated resources that can be dynamically reconfigured and customized for researchers to support scientific and engineering workflows that require different cyberinfrastructure (CI) technologies. These resources would include memory, accelerators, co-processors among other technologies. This would represent a significant shift in High Performance Computing (HPC) from the now typical model of clusters that have these resources permanently connected to a single server. While composing hardware frameworks with disaggregated resources holds promise, we need to understand how to situate workflows on these resources and evaluate the impact of this approach on workflow performance against "traditional"clusters. Toward developing this knowledge framework, we study the applicability and performance of deep learning workloads on GPU-enabled composable and traditional HPC computing platforms. Results from tests performed using the Horovod framework with TensorFlow and PyTorch models on these HPC environments are presented here.
Author supplied keywords
Cite
CITATION STYLE
He, Z., Saluja, A., Lawrence, R., Chakravorty, D., Dang, F., Perez, L., & Liu, H. (2023). Performance of Distributed Deep Learning Workloads on a Composable Cyberinfrastructure. In PEARC 2023 - Computing for the common good: Practice and Experience in Advanced Research Computing (pp. 60–67). Association for Computing Machinery, Inc. https://doi.org/10.1145/3569951.3593601
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.