HiPS: Hierarchical parameter synchronization in large-scale distributed machine learning

35Citations
Citations of this article
37Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In large-scale distributed machine learning (DML) system, parameter (gradient) synchronization among machines plays an important role in improving the DML performance. State-of-the-art DML synchronization algorithms, either the parameter server (PS) based algorithm or the ring allreduce algorithm, work in a flat way and suffer when the network size is large. In this work, we propose HiPS, a hierarchical parameter (gradient) synchronization framework in large-scale DML. In HiPS, server-centric network topology is used to better embrace RDMA/RoCE transport between machines, and the parameters (gradients) are synchronized in a hierarchical and hybrid way. Our evaluation in BCube and Torus network demonstrates that HiPS can better match server-centric networks. Compared with the flat algorithms (PS-based and ring-based), HiPS reduces the synchronization time by 73% and 75% respectively.

Cite

CITATION STYLE

APA

Geng, J., Li, D., Cheng, Y., Wang, S., & Li, J. (2018). HiPS: Hierarchical parameter synchronization in large-scale distributed machine learning. In NetAI 2018 - Proceedings of the 2018 Workshop on Network Meets AI and ML, Part of SIGCOMM 2018 (pp. 1–7). Association for Computing Machinery, Inc. https://doi.org/10.1145/3229543.3229544

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free