Nowadays, checkpoints have gained some relevance, given the increasing complexity of scientific applications for the use of many resources over a long period of time. Thus, in fault tolerance strategies, in addition to taking into account the impact that the application itself has on HPC systems, we must add the impact of the checkpoint. The checkpoint saves information about the application and the system in order to be able to restore the application, if necessary, in stable storage. The checkpoint can be considered as an intensive I/O application, so its storage need can have a great impact on the application. Therefore, in this paper, the analysis of the checkpoint’s I/O behavior is presented. The number of checkpoints to be performed in an application is often related to the maximum overhead that you want to introduce in the application. If we know the maximum overload the user wants to pay for and the overhead that a checkpoint introduces, we can calculate the number of checkpoints to be performed. This overhead depends significantly on the I/O operations. The PIOM-PX tool was used to analyze the spatial and temporal I/O patterns of the checkpoint. Based on this analysis, a model was designed to predict their behavior. This information is used to calculate the number of checkpoints to be performed in an application given a maximum overhead predefined by the user. This will allow us to understand what happens when a checkpoint is created in an HPC system, in order to make decisions that adapt to the user’s requirements.
CITATION STYLE
León, B., Gomez-Sanchez, P., Franco, D., Rexachs, D., & Luque, E. (2020). Analysis of checkpoint i/o behavior. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12137 LNCS, pp. 191–205). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-50371-0_14
Mendeley helps you to discover research relevant for your work.