Fault Tolerance In Grid Computing: State of the Art and Open Issues

Ritu Garg; Awadhesh Kumar Singh

Journal ArticleOPEN ACCESS

Fault Tolerance In Grid Computing: State of the Art and Open Issues

Garg R
Kumar Singh A

International Journal of Computer Science & Engineering Survey (2011) 2(1) 88-97

DOI: 10.5121/ijcses.2011.2107

N/ACitations

24Readers

Abstract

Fault tolerance is an important property for large scale computational grid systems, where geographically distributed nodes cooperate to execute a task. In order to achieve high level of reliability and availability, the grid infrastructure should be a foolproof fault tolerant. Since the failure of resources affects job execution fatally, fault tolerance service is essential to satisfy QOS requirement in grid computing. Commonly utilized techniques for providing fault tolerance are job checkpointing and replication. Both techniques mitigate the amount of work lost due to changing system availability but can introduce significant runtime overhead. The latter largely depends on the length of checkpointing interval and the chosen number of replicas, respectively. In case of complex scientific workflows where tasks can execute in well defined order reliability is another biggest challenge because of the unreliable nature of the grid resources.

Cite

CITATION STYLE

APA

Garg, R., & Kumar Singh, A. (2011). Fault Tolerance In Grid Computing: State of the Art and Open Issues. International Journal of Computer Science & Engineering Survey, 2(1), 88–97. https://doi.org/10.5121/ijcses.2011.2107

Fault Tolerance In Grid Computing: State of the Art and Open Issues

Abstract

Cite

Register to see more suggestions