Fault tolerance and resource monitoring are the important services in the grid computing systems, which are comprised of heterogeneous and geographically distributed resources. The reliability and performance must be considered as a major criterion to execute the safety-critical applications in the grid systems. Since the failure of resources can leads to job execution failure, fault tolerance service is essential to satisfy dependability in grid systems. This paper proposes a fault tolerance and resource monitoring service to improve dependability factor with respect economic efficiency. Dynamic architecture of this method leads to reduce resource consumption, performance overhead and network traffic. The proposed fault tolerance service consists of failure detection and failure recovery. A two layered detection service is proposed to improve failure coverage and reduce the probability of false alarm states. Application-level Checkpointing technique with an appropriate graining size is proposed as recovery service to attain a tradeoff between failure detection latency and performance overhead. Analytical approach is used to analyze the reliability and efficiency of proposed Fault tolerance services. © 2012 Springer Science+Business Media B.V.
CITATION STYLE
Arasteh, B., Zadahmadjafarlou, M., & Hosseini, M. J. (2012). A dynamic and reliable failure detection and failure recovery services in the grid systems. In Lecture Notes in Electrical Engineering (Vol. 114 LNEE, pp. 497–509). https://doi.org/10.1007/978-94-007-2792-2_47
Mendeley helps you to discover research relevant for your work.