Orchestrating Fault Prediction with Live Migration and Checkpointing

8Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Checkpoint/Restart (C/R) is widely used to provide fault tolerance on High-Performance Computing (HPC) systems. However, Parallel File System (PFS) overhead and failure uncertainty cause significant application overhead. This paper develops an adaptive multi-level C/R model that incorporates a failure prediction and analysis model, which orchestrates failure prediction, checkpointing, checkpoint frequency, and proactive live migration along with the additional benefit of Burst Buffers (BB). It effectively reduces the overheads due to failures, checkpointing, and recovery. Simulation results for the Summit supercomputer yield a reduction of ∼20%-86% in application overhead due to BBs, orchestrated failure prediction, and migration. We also observe a ∼29% decrease in checkpoint writes to BBs, which can increase the longevity of the BB storage devices.

Cite

CITATION STYLE

APA

Behera, S., Wan, L., Mueller, F., Wolf, M., & Klasky, S. (2020). Orchestrating Fault Prediction with Live Migration and Checkpointing. In HPDC 2020 - Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing (pp. 167–171). Association for Computing Machinery, Inc. https://doi.org/10.1145/3369583.3392672

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free