The FTMPS-project provides a solution to the need for fault- tolerance in large systems. A complete fault-tolerance approach is developed and being implemented. The built-in hardware error-detection features combined with software error-detection techniques provide a high coverage of transient as well as permanent failures. Combined with the diagnosis software, the necessary information for the OSS (statistics and visualisation) and the possibly reconfiguration is collected. Backward error recovery based on checkpointing and rollback, is implemented.
CITATION STYLE
Vounckx, J., Deconinck, G., Lauwereins, R., Viehöver, G., Wagner, R., Madeira, H., … Willeke, H. (1994). The FTMPS-project: Design and implementation of fault-tolerance techniques for massively parallel systems. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 797 LNCS, pp. 401–406). Springer Verlag. https://doi.org/10.1007/3-540-57981-8_151
Mendeley helps you to discover research relevant for your work.