A Dual-Agent Scheduler for Distributed Deep Learning Jobs on Public Cloud via Reinforcement Learning

4Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Public cloud GPU clusters are becoming emerging platforms for training distributed deep learning jobs. Under this training paradigm, the job scheduler is a crucial component to improve user experiences, i.e., reducing training fees and job completion time, which can also save power costs for service providers. However, the scheduling problem is known to be NP-hard. Most existing work divides it into two easier sub-tasks, i.e., ordering task and placement task, which are responsible for deciding the scheduling orders of jobs and placement orders of GPU machines, respectively. Due to the superior adaptation ability, learning-based policies can generally perform better than traditional heuristic-based methods. Nevertheless, there are still two main challenges that have not been well-solved. First, most learning-based methods only focus on ordering or placement policy independently, while ignoring their cooperation. Second, the unbalanced machine performances and resource contention impose huge overhead and uncertainty on job duration, but rarely be considered in existing work. To tackle these issues, this paper presents a dual-agent scheduler framework abstracted from the two sub-tasks to jointly learn the ordering and placement policies and make better-informed scheduling decisions. Specifically, we design an ordering agent with a scalable squeeze-and-communicate strategy for better cooperation; for the placement agent, we propose a novel Random Walk Gaussian Process to learn the performance similarities of GPU machines while being aware of the uncertain performance fluctuation. Finally, the dual-agent is jointly optimized with multi-agent reinforcement learning. Extensive experiments conducted on the real-world production cluster trace demonstrate the superiority of our model.

Cite

CITATION STYLE

APA

Xing, M., Mao, H., Yin, S., Pan, L., Zhang, Z., Xiao, Z., & Long, J. (2023). A Dual-Agent Scheduler for Distributed Deep Learning Jobs on Public Cloud via Reinforcement Learning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 2776–2788). Association for Computing Machinery. https://doi.org/10.1145/3580305.3599241

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free