Proactive fault tolerance in MPI applications via task migration

50Citations
Citations of this article
31Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Failures are likely to be more frequent in systems with thousands of processors. Therefore, schemes for dealing with faults become increasingly important. In this paper, we present a fault tolerance solution for parallel applications that proactively migrates execution from processors where failure is imminent. Our approach assumes that some failures are predictable, and leverages the features in current hardware devices supporting early indication of faults. We use the concepts of processor virtualization and dynamic task migration, provided by Charm++ and Adaptive MPI (AMPI), to implement a mechanism that migrates tasks away from processors which are expected to fail. To demonstrate the feasibility of our approach, we present performance data from experiments with existing MPI applications. Our results show that proactive task migration is an effective technique to tolerate faults in MPI applications. © 2006 Springer-Verlag.

Cite

CITATION STYLE

APA

Chakravorty, S., Mendes, C. L., & Kalé, L. V. (2006). Proactive fault tolerance in MPI applications via task migration. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4297 LNCS, pp. 485–496). https://doi.org/10.1007/11945918_47

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free