High performance computing systems with various checkpointing schemes

N. Naksinehaboon; M. Pǎun; R. Nassar; B. Leangsuksun; S. Scott

Journal ArticleOPEN ACCESS

High performance computing systems with various checkpointing schemes

International Journal of Computers, Communications and Control (2009) 4(4) 386-400

DOI: 10.15837/ijccc.2009.4.2455

14Citations

8Readers

Abstract

Finding the failure rate of a system is a crucial step in high performance computing systems analysis. To deal with this problem, a fault tolerant mechanism, called checkpoint/restart technique, was introduced. However, there are additional costs to perform this mechanism. Thus, we propose two models for different schemes (full and incremental checkpoint schemes). The models which are based on the reliability of the system are used to determine the checkpoint placements. Both proposed models consider a balance of between checkpoint overhead and the re-computing time. Due to the extra costs from each incremental checkpoint during the recovery period, a method to find the number of incremental checkpoints between two consecutive full checkpoints is given. Our simulation suggests that in most cases our incremental checkpoint model can reduce the waste time more than it is reduced by the full checkpoint model. The waste times produced by both models are in the range of 2% to 28% of the application completion time depending on the checkpoint overheads. Copyright © 2006-2009 by CCC Publications.

Author supplied keywords

Cite

CITATION STYLE

APA

Naksinehaboon, N., Pǎun, M., Nassar, R., Leangsuksun, B., & Scott, S. (2009). High performance computing systems with various checkpointing schemes. International Journal of Computers, Communications and Control, 4(4), 386–400. https://doi.org/10.15837/ijccc.2009.4.2455

High performance computing systems with various checkpointing schemes

Abstract

Author supplied keywords

Cite

Register to see more suggestions