The primary focus in the analysis of massively parallel supercomputers has traditionally been on their performance. However, their complex network topologies, large number of processors, and sophisticated system software can make them very unreliable. If every failure of one of the many components of a massively parallel computer could shut down the machine, the machine would be useless. Therefore fault tolerance is required. The basis of effective mehanisms for fault tolerance is an efficient diagnosis. This paper deals with concurrent and hierarchical system level diagnosis for a particular massively parallel architecture and with a sinaulation-based method to validate the proposed diagnosis algorithm. The diagnosis algorithm is presented and we describe a simulation-based method to test and verify the algorithms for fault tolerance already during the design phase of the target machine.
CITATION STYLE
Aitmann, J., Balbach, F., & Hein, A. (1994). An approach for hierarchical system level diagnosis of massively parallel computers combined with a simulation-based method for dependability analysis. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 852 LNCS, pp. 372–385). Springer Verlag. https://doi.org/10.1007/3-540-58426-9_142
Mendeley helps you to discover research relevant for your work.