Chimera: Efficiently training large-scale neural networks with bidirectional pipelines

Shigang Li; Torsten Hoefler

Conference ProceedingsOPEN ACCESS

Chimera: Efficiently training large-scale neural networks with bidirectional pipelines

International Conference for High Performance Computing, Networking, Storage and Analysis, SC (2021)

DOI: 10.1145/3458817.3476145

76Citations

57Readers

Get full text

Abstract

Training large deep learning models at scale is very challenging. This paper proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for efficiently training largescale models. Chimera is a synchronous approach and therefore no loss of accuracy, which is more convergence-friendly than asynchronous approaches. Compared with the latest synchronous pipeline approach, Chimera reduces the number of bubbles by up to 50%; benefiting from the sophisticated scheduling of bidirectional pipelines, Chimera has a more balanced activation memory consumption. Evaluations are conducted on Transformer based language models. For a GPT-2 model with 1.3 billion parameters running on 2,048 GPU nodes of the Piz Daint supercomputer, Chimera improves the training throughput by 1.16x-2.34x over the state-of-The-Art synchronous and asynchronous pipeline approaches.

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Li, S., & Hoefler, T. (2021). Chimera: Efficiently training large-scale neural networks with bidirectional pipelines. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC. IEEE Computer Society. https://doi.org/10.1145/3458817.3476145

Readers' Seniority

PhD / Post grad / Masters / Doc 22

76%

Researcher 5

17%

Professor / Associate Prof. 1

Lecturer / Post doc 1

Readers' Discipline

Computer Science 24

83%

Engineering 3

10%

Neuroscience 1

Medicine and Dentistry 1

Chimera: Efficiently training large-scale neural networks with bidirectional pipelines

Abstract

Author supplied keywords

References Powered by Scopus

Long Short-Term Memory

End-to-End Object Detection with Transformers

Optimization methods for large-scale machine learning

Cited by Powered by Scopus

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

Extending the limit of molecular dynamics with ab initio accuracy to 10 billion atoms

Near-Optimal Sparse Allreduce for Distributed Deep Learning

Register to see more suggestions

Cite

Readers' Seniority

Readers' Discipline