Chimera: Efficiently training large-scale neural networks with bidirectional pipelines

76Citations
Citations of this article
57Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Training large deep learning models at scale is very challenging. This paper proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for efficiently training largescale models. Chimera is a synchronous approach and therefore no loss of accuracy, which is more convergence-friendly than asynchronous approaches. Compared with the latest synchronous pipeline approach, Chimera reduces the number of bubbles by up to 50%; benefiting from the sophisticated scheduling of bidirectional pipelines, Chimera has a more balanced activation memory consumption. Evaluations are conducted on Transformer based language models. For a GPT-2 model with 1.3 billion parameters running on 2,048 GPU nodes of the Piz Daint supercomputer, Chimera improves the training throughput by 1.16x-2.34x over the state-of-The-Art synchronous and asynchronous pipeline approaches.

References Powered by Scopus

Long Short-Term Memory

78194Citations
N/AReaders
Get full text

End-to-End Object Detection with Transformers

9861Citations
N/AReaders
Get full text

Optimization methods for large-scale machine learning

2056Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

39Citations
N/AReaders
Get full text

Extending the limit of molecular dynamics with ab initio accuracy to 10 billion atoms

36Citations
N/AReaders
Get full text

Near-Optimal Sparse Allreduce for Distributed Deep Learning

34Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Li, S., & Hoefler, T. (2021). Chimera: Efficiently training large-scale neural networks with bidirectional pipelines. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC. IEEE Computer Society. https://doi.org/10.1145/3458817.3476145

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 22

76%

Researcher 5

17%

Professor / Associate Prof. 1

3%

Lecturer / Post doc 1

3%

Readers' Discipline

Tooltip

Computer Science 24

83%

Engineering 3

10%

Neuroscience 1

3%

Medicine and Dentistry 1

3%

Save time finding and organizing research with Mendeley

Sign up for free