Elastic parameter server load distribution in deep learning clusters

33Citations
Citations of this article
42Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In distributed DNN training, parameter servers (PS) can become performance bottlenecks due to PS stragglers, caused by imbalanced parameter distribution, bandwidth contention, or computation interference. Few existing studies have investigated efficient parameter (aka load) distribution among PSs. We observe significant training inefficiency with the current parameter assignment in representative machine learning frameworks (e.g., MXNet, TensorFlow), and big potential for training acceleration with better PS load distribution. We design PSLD, a dynamic parameter server load distribution scheme, to mitigate PS straggler issues and accelerate distributed model training in the PS architecture. An exploitation-exploration method is carefully designed to scale in and out parameter servers and adjust parameter distribution among PSs on the go. We also design an elastic PS scaling module to carry out our scheme with little interruption to the training process. We implement our module on top of open-source PS architectures, including MXNet and BytePS. Testbed experiments show up to 2.86x speed-up in model training with PSLD, for different ML models under various straggler settings.

Cite

CITATION STYLE

APA

Chen, Y., Peng, Y., Bao, Y., Wu, C., Zhu, Y., & Guo, C. (2020). Elastic parameter server load distribution in deep learning clusters. In SoCC 2020 - Proceedings of the 2020 ACM Symposium on Cloud Computing (pp. 507–521). Association for Computing Machinery, Inc. https://doi.org/10.1145/3419111.3421307

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free