Profiling Deep Learning Workloads at Scale using Amazon SageMaker

5Citations
Citations of this article
15Readers
Mendeley users who have this article in their library.
Get full text

Abstract

With the rise of deep learning (DL), machine learning (ML) has become compute and data intensive, typically requiring multi-node multi-GPU clusters. As state-of-the-art models grow in size in the order of trillions of parameters, their computational complexity and cost also increase rapidly. Since 2012, the cost of deep learning doubled roughly every quarter, and this trend is likely to continue. ML practitioners have to cope with common challenges of efficient resource utilization when training such large models. In this paper, we propose a new profiling tool that cross-correlates relevant system utilization metrics and framework operations. The tool supports profiling DL models at scale, identifies performance bottlenecks, and provides insights with recommendations. We deployed the profiling functionality as an add-on to Amazon SageMaker Debugger, a fully-managed service that leverages an on-the-fly analysis system (called rules) to automatically identify complex issues in DL training jobs. By presenting deployment results and customer case studies, we show that it enables users to identify and fix issues caused by inefficient hardware resource usage, thereby reducing training time and cost.

Cite

CITATION STYLE

APA

Rauschmayr, N., Kama, S., Kim, M., Choi, M., & Kenthapadi, K. (2022). Profiling Deep Learning Workloads at Scale using Amazon SageMaker. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 3801–3809). Association for Computing Machinery. https://doi.org/10.1145/3534678.3539036

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free