Experimental assessment of the practicality of a fault-tolerant system

0Citations
Citations of this article
1Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Fault-tolerance has gained renewed importance with the proliferation of high-performance clusters. However, fault-tolerant systems have not yet been widely adopted commercially because they are either hard to deploy, hard to use, hard to manage, hard to maintain, or hard to justify. We have developed M 3, a practical and easily-deployable multiple fault-tolerant MPI system for Myrinet, to satisfy the demand for a fault-tolerant system. In this paper, we run rigorous tests using real-world applications to validate that M3 can be used in commercial clusters. We also describe improvements made to our system to solve various problems that arose when deploying it on a commercial cluster. This paper models our system's checkpoint overhead and presents the results of a series of tests using computation- and communication-intensive MPI applications used commercially in various fields of science. The experimental results show that not only does our system conform to various types of running environment well, but that it can also be practically deployed in commercial clusters. © Springer-Verlag Berlin Heidelberg 2007.

Cite

CITATION STYLE

APA

Kim, J. W., Lee, J., & Yeom, H. Y. (2007). Experimental assessment of the practicality of a fault-tolerant system. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4362 LNCS, pp. 878–887). Springer Verlag. https://doi.org/10.1007/978-3-540-69507-3_76

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free