Abstract
With the rise of deep learning (DL), machine learning (ML) has become compute and data intensive, typically requiring multi-node multi-GPU clusters. As state-of-the-art models grow in size in the order of trillions of parameters, their computational complexity and cost also increase rapidly. Since 2012, the cost of deep learning doubled roughly every quarter, and this trend is likely to continue. ML practitioners have to cope with common challenges of efficient resource utilization when training such large models. In this paper, we propose a new profiling tool that cross-correlates relevant system utilization metrics and framework operations. The tool supports profiling DL models at scale, identifies performance bottlenecks, and provides insights with recommendations. We deployed the profiling functionality as an add-on to Amazon SageMaker Debugger, a fully-managed service that leverages an on-the-fly analysis system (called rules) to automatically identify complex issues in DL training jobs. By presenting deployment results and customer case studies, we show that it enables users to identify and fix issues caused by inefficient hardware resource usage, thereby reducing training time and cost.
Author supplied keywords
Cite
CITATION STYLE
Rauschmayr, N., Kama, S., Kim, M., Choi, M., & Kenthapadi, K. (2022). Profiling Deep Learning Workloads at Scale using Amazon SageMaker. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 3801–3809). Association for Computing Machinery. https://doi.org/10.1145/3534678.3539036
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.