A Dual-Agent Scheduler for Distributed Deep Learning Jobs on Public Cloud via Reinforcement Learning

Mingzhe Xing; Hangyu Mao; Shenglin Yin; Lichen Pan; Zhengchao Zhang; Zhen Xiao; Jieyi Long

Conference ProceedingsOPEN ACCESS

A Dual-Agent Scheduler for Distributed Deep Learning Jobs on Public Cloud via Reinforcement Learning

Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2023) 2776-2788

DOI: 10.1145/3580305.3599241

4Citations

7Readers

Get full text

Abstract

Public cloud GPU clusters are becoming emerging platforms for training distributed deep learning jobs. Under this training paradigm, the job scheduler is a crucial component to improve user experiences, i.e., reducing training fees and job completion time, which can also save power costs for service providers. However, the scheduling problem is known to be NP-hard. Most existing work divides it into two easier sub-tasks, i.e., ordering task and placement task, which are responsible for deciding the scheduling orders of jobs and placement orders of GPU machines, respectively. Due to the superior adaptation ability, learning-based policies can generally perform better than traditional heuristic-based methods. Nevertheless, there are still two main challenges that have not been well-solved. First, most learning-based methods only focus on ordering or placement policy independently, while ignoring their cooperation. Second, the unbalanced machine performances and resource contention impose huge overhead and uncertainty on job duration, but rarely be considered in existing work. To tackle these issues, this paper presents a dual-agent scheduler framework abstracted from the two sub-tasks to jointly learn the ordering and placement policies and make better-informed scheduling decisions. Specifically, we design an ordering agent with a scalable squeeze-and-communicate strategy for better cooperation; for the placement agent, we propose a novel Random Walk Gaussian Process to learn the performance similarities of GPU machines while being aware of the uncertain performance fluctuation. Finally, the dual-agent is jointly optimized with multi-agent reinforcement learning. Extensive experiments conducted on the real-world production cluster trace demonstrate the superiority of our model.

Author supplied keywords

Cite

CITATION STYLE

APA

Xing, M., Mao, H., Yin, S., Pan, L., Zhang, Z., Xiao, Z., & Long, J. (2023). A Dual-Agent Scheduler for Distributed Deep Learning Jobs on Public Cloud via Reinforcement Learning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 2776–2788). Association for Computing Machinery. https://doi.org/10.1145/3580305.3599241

A Dual-Agent Scheduler for Distributed Deep Learning Jobs on Public Cloud via Reinforcement Learning

Abstract

Author supplied keywords

Cite

Register to see more suggestions