An approach for hierarchical system level diagnosis of massively parallel computers combined with a simulation-based method for dependability analysis

6Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The primary focus in the analysis of massively parallel supercomputers has traditionally been on their performance. However, their complex network topologies, large number of processors, and sophisticated system software can make them very unreliable. If every failure of one of the many components of a massively parallel computer could shut down the machine, the machine would be useless. Therefore fault tolerance is required. The basis of effective mehanisms for fault tolerance is an efficient diagnosis. This paper deals with concurrent and hierarchical system level diagnosis for a particular massively parallel architecture and with a sinaulation-based method to validate the proposed diagnosis algorithm. The diagnosis algorithm is presented and we describe a simulation-based method to test and verify the algorithms for fault tolerance already during the design phase of the target machine.

Cite

CITATION STYLE

APA

Aitmann, J., Balbach, F., & Hein, A. (1994). An approach for hierarchical system level diagnosis of massively parallel computers combined with a simulation-based method for dependability analysis. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 852 LNCS, pp. 372–385). Springer Verlag. https://doi.org/10.1007/3-540-58426-9_142

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free