Highly reliable linux HPC clusters: Self-awareness approach

Chokchai Leangsuksun; Tong Liu; Yudan Liu; Stephen L. Scott; Richard Libby; Ibrahim Haddad

Journal Article

Highly reliable linux HPC clusters: Self-awareness approach

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2004) 3358 217-222

DOI: 10.1007/978-3-540-30566-8_27

4Citations

5Readers

Get full text

Abstract

Current solutions for fault-tolerance in HPC systems focus on dealing with the result of a failure. However, most are unable to handle runtime system configuration changes caused by transient failures and require a complete restart of the entire machine. The recently released HA-OSCAR software stack is one such effort making inroads here. This paper discusses detailed solutions for the high-availability and serviceability enhancement of clusters by HA-OSCAR via multi-head-node failover and a service level fault tolerance mechanism. Our solution employs self-configuration and introduces Adaptive Self Healing (ASH) techniques. HA-OSCAR availability improvement analysis was also conducted with various sensitivity factors. Finally, the paper also entails the details of the system layering strategy, dependability modeling, and analysis of an actual experimental system by a Petri net-based model. Stochastic Reword Net (SRN). © Springer-Verlag Berlin Heidelberg 2004.

Cite

CITATION STYLE

APA

Leangsuksun, C., Liu, T., Liu, Y., Scott, S. L., Libby, R., & Haddad, I. (2004). Highly reliable linux HPC clusters: Self-awareness approach. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3358, 217–222. https://doi.org/10.1007/978-3-540-30566-8_27

Highly reliable linux HPC clusters: Self-awareness approach

Abstract

Cite

Register to see more suggestions