Towards scalable distributed workload manager with monitoring-based weakly consistent resource stealing

Ke Wang; Michael Lang; Xiaobing Zhou; Benjamin McClelland; Kan Qiao; Ioan Raicu

Conference Proceedings

Towards scalable distributed workload manager with monitoring-based weakly consistent resource stealing

HPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (2015) 219-222

DOI: 10.1145/2749246.2749249

18Citations

23Readers

Get full text

Abstract

One way to efficiently utilize the coming exascale machines is to support a mixture of applications in various domains, such as traditional large-scale HPC, the ensemble runs, and the fine-grained many-task computing (MTC). Delivering high performance in resource allocation, scheduling and launching for all types of jobs has driven us to develop Slurm++, a distributed workload manager directly extended from the Slurm centralized production system. Slurm++ employs multiple controllers with each one managing a partition of compute nodes and participating in resource allocation through resource balancing techniques. In this paper, we propose a monitoring-based weakly consistent resource stealing technique to achieve resource balancing in distributed HPC job launch, and implement the technique in Slurm++. We compare Slurm++ with Slurm using micro-benchmark workloads with different job sizes. Slurm++ showed 10X faster than Slurm in allocating resources and launching jobs-we expect the performance gap to grow as the job sizes and system scales increase in future high-end computing systems.

Author supplied keywords

Cite

CITATION STYLE

APA

Wang, K., Lang, M., Zhou, X., McClelland, B., Qiao, K., & Raicu, I. (2015). Towards scalable distributed workload manager with monitoring-based weakly consistent resource stealing. In HPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (pp. 219–222). Association for Computing Machinery, Inc. https://doi.org/10.1145/2749246.2749249

Towards scalable distributed workload manager with monitoring-based weakly consistent resource stealing

Abstract

Author supplied keywords

Cite

Register to see more suggestions