Hardware fault containment in scalable shared-memory multiprocessors

24Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Current shared-memory multiprocessors are inherently vulnerable to faults: any significant hardware or system software fault causes the entire system to fail. Unless provisions are made to limit the impact of faults, users will perceive a decrease in reliability when they entrust their applications to larger machines. This paper shows that fault containment techniques can be effectively applied to scalable shared-memory multiprocessors to reduce the reliability problems created by increased machine size. The primary goal of our approach is to leave normal-mode performance unaffected. Rather than using expensive fault-tolerance techniques to mask the effects of data and resource loss, our strategy is based on limiting the damage caused by faults to only a portion of the machine. After a hardware fault, we run a distributed recovery algorithm that allows normal operation to be resumed in the functioning parts of the machine. Our approach is implemented in the Stanford FLASH multiprocessor. Using a detailed hardware simulator, we have performed a number of fault injection experiments on a FLASH system running Hive, an operating system designed to support fault containment. The results we report validate our approach and show that in conjunction with an operating system like Hive, we can improve the reliability seen by unmodified applications without substantial performance cost. Simulation results suggest that our algorithms scale well for systems up to 128 processors.

Cite

CITATION STYLE

APA

Teodosiu, D., Baxter, J., Govil, K., Chapin, J., Rosenblum, M., & Horowitz, M. (1997). Hardware fault containment in scalable shared-memory multiprocessors. In Conference Proceedings - Annual International Symposium on Computer Architecture, ISCA (pp. 73–84). IEEE. https://doi.org/10.1145/264107.264141

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free