Fault tolerant wide-area parallel computing

Jon B. Weissman

Conference Proceedings

Fault tolerant wide-area parallel computing

Weissman J

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2000) 1800 LNCS 1214-1225

DOI: 10.1007/3-540-45591-4_168

9Citations

2Readers

Get full text

Abstract

Executing parallel applications across distributed networks introduces the problem of fault tolerance. A viable solution for fault tolerance must keep overhead manageable and not compromise the high performance objective of parallel processing. In this paper, we explore two options for achieving fault tolerance for a common class of parallel applications, single-program-multiple- data (SPMD). We quantitatively compare checkpoint-recovery and wide-area replication as a means of achieving fault tolerance. The experimental results obtained for a canonical SPMD application suggest that checkpoint-recovery may be preferable for small problems if local parallel disks are available, but wide-area replication outperforms checkpoint-recovery for larger-grain problems, precisely the problems most suited for the wide-area network environment. The results also show that it possible to accurately model and predict the overheads of the two methods © 2000 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Weissman, J. B. (2000). Fault tolerant wide-area parallel computing. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 1800 LNCS, pp. 1214–1225). Springer Verlag. https://doi.org/10.1007/3-540-45591-4_168

Fault tolerant wide-area parallel computing

Abstract

Cite

Register to see more suggestions