Persistent fault-tolerance for divide-and-conquer applications on the grid

2Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Grid applications need to be fault tolerant, malleable, and migratable. In previous work, we have presented orphan saving, an efficient mechanism addressing these issues for divide-and-conquer applications. In this paper, we present a mechanism for writing partial results to checkpoint files, adding the capability to also tolerate the total loss of all processors, and to allow suspending and later resuming an application. Both mechanisms have only negligible overheads in the absence of faults, even with extremely short checkpointing intervals like one minute. In the case of faults, the new checkpointing mechanism outperforms orphan saving by 10% to 15%. Also, suspending/resuming an application has only little overhead, making our approach very attractive for writing grid applications. © Springer-Verlag Berlin Heidelberg 2007.

Cite

CITATION STYLE

APA

Wrzesinska, G., Oprescu, A. M., Kielmann, T., & Bal, H. (2007). Persistent fault-tolerance for divide-and-conquer applications on the grid. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4641 LNCS, pp. 425–436). Springer Verlag. https://doi.org/10.1007/978-3-540-74466-5_46

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free