HPC-SFI: System-level fault injection for high performance computing systems

Yanqi Wang; Qi Zhang; Yi Liu; Depei Qian

Conference ProceedingsOPEN ACCESS

HPC-SFI: System-level fault injection for high performance computing systems

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2018) 11276 LNCS 103-113

DOI: 10.1007/978-3-030-05677-3_9

1Citations

1Readers

Abstract

Resilience/fault-tolerance has become a key challenge for large-scale parallel systems. To ensure reliability of high performance computing systems, various kinds of techniques have been proposed, such as hardware-level fault-tolerance, checkpointing, replication, algorithm-base fault-tolerance, etc. There are also many software systems to monitor and handle system-failures, e.g. management and job-scheduling system of HPC systems. To evaluate the effectiveness of these systems, it is necessary to provide some kind of tool to inject failures in a HPC system. This paper proposes HPC-SFI, a system-level fault injection tool for HPC systems. Basically, HPC-SFI can generate three kinds of system-failures in a HPC system including in-node faults, failure in the interconnection network and failure of storage/parallel-file system. In addition, HPC-SFI can inject system-faults in pseudo-random model according to pre-defined parameters and probabilities. Preliminary experimental results demonstrate effectiveness of the tool.

Cite

CITATION STYLE

APA

Wang, Y., Zhang, Q., Liu, Y., & Qian, D. (2018). HPC-SFI: System-level fault injection for high performance computing systems. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11276 LNCS, pp. 103–113). Springer Verlag. https://doi.org/10.1007/978-3-030-05677-3_9

HPC-SFI: System-level fault injection for high performance computing systems

Abstract

Cite

Register to see more suggestions