A dynamic and reliable failure detection and failure recovery services in the grid systems

1Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Fault tolerance and resource monitoring are the important services in the grid computing systems, which are comprised of heterogeneous and geographically distributed resources. The reliability and performance must be considered as a major criterion to execute the safety-critical applications in the grid systems. Since the failure of resources can leads to job execution failure, fault tolerance service is essential to satisfy dependability in grid systems. This paper proposes a fault tolerance and resource monitoring service to improve dependability factor with respect economic efficiency. Dynamic architecture of this method leads to reduce resource consumption, performance overhead and network traffic. The proposed fault tolerance service consists of failure detection and failure recovery. A two layered detection service is proposed to improve failure coverage and reduce the probability of false alarm states. Application-level Checkpointing technique with an appropriate graining size is proposed as recovery service to attain a tradeoff between failure detection latency and performance overhead. Analytical approach is used to analyze the reliability and efficiency of proposed Fault tolerance services. © 2012 Springer Science+Business Media B.V.

Cite

CITATION STYLE

APA

Arasteh, B., Zadahmadjafarlou, M., & Hosseini, M. J. (2012). A dynamic and reliable failure detection and failure recovery services in the grid systems. In Lecture Notes in Electrical Engineering (Vol. 114 LNEE, pp. 497–509). https://doi.org/10.1007/978-94-007-2792-2_47

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free