GAI: A centralized tree-based scheduler for machine learning workload in large shared clusters

Ce Gao; Rui Ren; Hongming Cai

Conference Proceedings

GAI: A centralized tree-based scheduler for machine learning workload in large shared clusters

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2018) 11335 LNCS 611-629

DOI: 10.1007/978-3-030-05054-2_46

5Citations

8Readers

Get full text

Abstract

With widespread applications in image recognition, language translation, computer vision and other areas, deep learning (DL) have been proliferating over the past decade. Practitioners from different business groups in industries train DL models on a shared cloud computing infrastructure for these applications with different priorities. During the model training process, one of the key challenges is to minimize the lifecycle of high priority model training jobs. This paper analyzes the distributed training of machine learning (ML) models and identifies short board effect in the training process: GPU training requires higher network bandwidth compared to CPU training. The key insight motivates the design of GAI, a centralized scheduler for ML workload. It relies on two techniques: (1) tree-based structure. The structure stores the cluster information hierarchically to apply multi-layer scheduling. (2) well-extended priority algorithm. We consider priorities from multiple dimensions for model training jobs comprehensively to support resource degradation and preemption. The prototype of GAI is implemented on top of Kubernetes, Kubeflow, and TensorFlow. It is evaluated using a simulator and a real cloud-based cluster. Evaluations show 28% increase in scheduling throughput and 21% training convergence speedup on DL models.

Author supplied keywords

Cite

CITATION STYLE

APA

Gao, C., Ren, R., & Cai, H. (2018). GAI: A centralized tree-based scheduler for machine learning workload in large shared clusters. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11335 LNCS, pp. 611–629). Springer Verlag. https://doi.org/10.1007/978-3-030-05054-2_46

GAI: A centralized tree-based scheduler for machine learning workload in large shared clusters

Abstract

Author supplied keywords

Cite

Register to see more suggestions