Grid applications need to be fault tolerant, malleable, and migratable. In previous work, we have presented orphan saving, an efficient mechanism addressing these issues for divide-and-conquer applications. In this paper, we present a mechanism for writing partial results to checkpoint files, adding the capability to also tolerate the total loss of all processors, and to allow suspending and later resuming an application. Both mechanisms have only negligible overheads in the absence of faults, even with extremely short checkpointing intervals like one minute. In the case of faults, the new checkpointing mechanism outperforms orphan saving by 10% to 15%. Also, suspending/resuming an application has only little overhead, making our approach very attractive for writing grid applications. © Springer-Verlag Berlin Heidelberg 2007.
CITATION STYLE
Wrzesinska, G., Oprescu, A. M., Kielmann, T., & Bal, H. (2007). Persistent fault-tolerance for divide-and-conquer applications on the grid. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4641 LNCS, pp. 425–436). Springer Verlag. https://doi.org/10.1007/978-3-540-74466-5_46
Mendeley helps you to discover research relevant for your work.