An intelligent management of fault tolerance in cluster using RADICMPI

10Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Independence of special elements, transparency and scalability are very significant features required from the fault tolerance schemes for modern clusters of computers. In order to attend such requirements we developed the RADIC architecture (Redundant Array of Distributed Independent Checkpoints). RADIC is an architecture based on a fully distributed array of processes that collaborate in order to create a distributed fault tolerance controller. This controller works without special, central or stable elements. RADIC implements the fault tolerance activities, transparently to the user application, using a message-log rollback-recovery protocol. Using the RADIC concepts we implemented a prototype, RADICMPI, which contains some standard MPI directives and includes all functionalities of RADIC. We tested RADICMPI in a real environment by injecting failures in nodes of the cluster and monitoring the behavior of the application. Our tests confirmed the correct operation of RADICMPI and the effectiveness of the RADIC mechanism. © Springer-Verlag Berlin Heidelberg 2006.

Cite

CITATION STYLE

APA

Duarte, A., Rexachs, D., & Luque, E. (2006). An intelligent management of fault tolerance in cluster using RADICMPI. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4192 LNCS, pp. 150–157). Springer Verlag. https://doi.org/10.1007/11846802_26

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free