Profiling Deep Learning Workloads at Scale using Amazon SageMaker

Nathalie Rauschmayr; Sami Kama; Muhyun Kim; Miyoung Choi; Krishnaram Kenthapadi

Conference ProceedingsOPEN ACCESS

Profiling Deep Learning Workloads at Scale using Amazon SageMaker

Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2022) 3801-3809

DOI: 10.1145/3534678.3539036

5Citations

15Readers

Get full text

Abstract

With the rise of deep learning (DL), machine learning (ML) has become compute and data intensive, typically requiring multi-node multi-GPU clusters. As state-of-the-art models grow in size in the order of trillions of parameters, their computational complexity and cost also increase rapidly. Since 2012, the cost of deep learning doubled roughly every quarter, and this trend is likely to continue. ML practitioners have to cope with common challenges of efficient resource utilization when training such large models. In this paper, we propose a new profiling tool that cross-correlates relevant system utilization metrics and framework operations. The tool supports profiling DL models at scale, identifies performance bottlenecks, and provides insights with recommendations. We deployed the profiling functionality as an add-on to Amazon SageMaker Debugger, a fully-managed service that leverages an on-the-fly analysis system (called rules) to automatically identify complex issues in DL training jobs. By presenting deployment results and customer case studies, we show that it enables users to identify and fix issues caused by inefficient hardware resource usage, thereby reducing training time and cost.

Author supplied keywords

Cite

CITATION STYLE

APA

Rauschmayr, N., Kama, S., Kim, M., Choi, M., & Kenthapadi, K. (2022). Profiling Deep Learning Workloads at Scale using Amazon SageMaker. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 3801–3809). Association for Computing Machinery. https://doi.org/10.1145/3534678.3539036

Profiling Deep Learning Workloads at Scale using Amazon SageMaker

Abstract

Author supplied keywords

Cite

Register to see more suggestions