DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks

74Citations
Citations of this article
69Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible. It is challenging due to large memory capacity and bandwidth requirements on a single compute node and high communication volumes across multiple nodes. In this paper, we present DistGNN that optimizes the wellknown Deep Graph Library (DGL) for full-batch training on CPU clusters via an efficient shared memory implementation, communication reduction using a minimum vertex-cut graph partitioning algorithm and communication avoidance using a family of delayedupdate algorithms. Our results on four common GNN benchmark datasets: Reddit, OGB-Products, OGB-Papers and Proteins, show up to 3.7× speed-up using a single CPU socket and up to 97× speed-up using 128 CPU sockets, respectively, over baseline DGL implementations running on a single CPU socket.

Cite

CITATION STYLE

APA

Md, V., Misra, S., Ma, G., Mohanty, R., Georganas, E., Heinecke, A., … Avancha, S. (2021). DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC. IEEE Computer Society. https://doi.org/10.1145/3458817.3480856

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free