Towards scalable distributed workload manager with monitoring-based weakly consistent resource stealing

18Citations
Citations of this article
23Readers
Mendeley users who have this article in their library.
Get full text

Abstract

One way to efficiently utilize the coming exascale machines is to support a mixture of applications in various domains, such as traditional large-scale HPC, the ensemble runs, and the fine-grained many-task computing (MTC). Delivering high performance in resource allocation, scheduling and launching for all types of jobs has driven us to develop Slurm++, a distributed workload manager directly extended from the Slurm centralized production system. Slurm++ employs multiple controllers with each one managing a partition of compute nodes and participating in resource allocation through resource balancing techniques. In this paper, we propose a monitoring-based weakly consistent resource stealing technique to achieve resource balancing in distributed HPC job launch, and implement the technique in Slurm++. We compare Slurm++ with Slurm using micro-benchmark workloads with different job sizes. Slurm++ showed 10X faster than Slurm in allocating resources and launching jobs-we expect the performance gap to grow as the job sizes and system scales increase in future high-end computing systems.

Cite

CITATION STYLE

APA

Wang, K., Lang, M., Zhou, X., McClelland, B., Qiao, K., & Raicu, I. (2015). Towards scalable distributed workload manager with monitoring-based weakly consistent resource stealing. In HPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (pp. 219–222). Association for Computing Machinery, Inc. https://doi.org/10.1145/2749246.2749249

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free