HADAB: Enabling fault tolerance in parallel applications running in distributed environments

Vania Boccia; Luisa Carracciuolo; Giuliano Laccetti; Marco Lapegna; Valeria Mele

Conference Proceedings

HADAB: Enabling fault tolerance in parallel applications running in distributed environments

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2012) 7203 LNCS(PART 1) 700-709

DOI: 10.1007/978-3-642-31464-3_71

21Citations

3Readers

Get full text

Abstract

The development of scientific software, reliable and efficient, in distributed computing environments, requires the identification and the analysis of issues related to the design and the deployment of algorithms for high-performance computing architectures and their integration in distributed contexts. In these environments, indeed, resources efficiency and availability can change unexpectedly because of overloading or failure i.e. of both computing nodes and interconnection network. The scenario described above, requires the design of mechanisms enabling the software to "survive" to such unexpected events by ensuring, at the same time, an effective use of the computing resources. Although many researchers are working on these problems for years, fault tolerance, for some classes of applications is an open matter still today. Here we focus on the design and the deployment of a checkpointing/migration system to enable fault tolerance in parallel applications running in distributed environments. In particular we describe details about HADAB, a new hybrid checkpointing strategy, and its deployment in a meaningful case study: the PETSc Conjugate Gradient algortithm implementation. The related testing phase has been performed on the University of Naples distributed infrastructure (S.Co.P.E. infrastructure). © 2012 Springer-Verlag.

Author supplied keywords

Cite

CITATION STYLE

APA

Boccia, V., Carracciuolo, L., Laccetti, G., Lapegna, M., & Mele, V. (2012). HADAB: Enabling fault tolerance in parallel applications running in distributed environments. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7203 LNCS, pp. 700–709). https://doi.org/10.1007/978-3-642-31464-3_71

HADAB: Enabling fault tolerance in parallel applications running in distributed environments

Abstract

Author supplied keywords

Cite

Register to see more suggestions