Abstract
One way to efficiently utilize the coming exascale machines is to support a mixture of applications in various domains, such as traditional large-scale HPC, the ensemble runs, and the fine-grained many-task computing (MTC). Delivering high performance in resource allocation, scheduling and launching for all types of jobs has driven us to develop Slurm++, a distributed workload manager directly extended from the Slurm centralized production system. Slurm++ employs multiple controllers with each one managing a partition of compute nodes and participating in resource allocation through resource balancing techniques. In this paper, we propose a monitoring-based weakly consistent resource stealing technique to achieve resource balancing in distributed HPC job launch, and implement the technique in Slurm++. We compare Slurm++ with Slurm using micro-benchmark workloads with different job sizes. Slurm++ showed 10X faster than Slurm in allocating resources and launching jobs-we expect the performance gap to grow as the job sizes and system scales increase in future high-end computing systems.
Author supplied keywords
Cite
CITATION STYLE
Wang, K., Lang, M., Zhou, X., McClelland, B., Qiao, K., & Raicu, I. (2015). Towards scalable distributed workload manager with monitoring-based weakly consistent resource stealing. In HPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (pp. 219–222). Association for Computing Machinery, Inc. https://doi.org/10.1145/2749246.2749249
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.