Empirical performance analysis of collective communication for distributed deep learning in a many-core CPU environment

4Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.

Abstract

To accommodate lots of training data and complex training models, "distributed" deep learning training has become employed more and more frequently. However, communication bottlenecks between distributed systems lead to poor performance of distributed deep learning training. In this study, we proposed a new collective communication method in a Python environment by utilizing Multi-Channel Dynamic Random Access Memory (MCDRAM) in Intel Xeon Phi Knights Landing processors. Major deep learning software platforms, such as TensorFlow and PyTorch, offer Python as a main development language, so we developed an efficient communication library by adapting Memkind library, which is a C-based library to utilize high-performance memory MCDRAM. For performance evaluation, we tested the popular collective communication methods in distributed deep learning, such as Broadcast, Gather, and AllReduce. We conducted experiments to analyze the effect of high-performance memory and processor location on communication performance. In addition, we analyze performance in a Docker environment for further relevance given the recent major trend of Cloud computing. By extensive experiments in our testbed, we confirmed that the communication in our proposed method showed performance improvement by up to 487%.

Cite

CITATION STYLE

APA

Woo, J., Choi, H., & Lee, J. (2020). Empirical performance analysis of collective communication for distributed deep learning in a many-core CPU environment. Applied Sciences (Switzerland), 10(19). https://doi.org/10.3390/APP10196717

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free