Fault-tolerant execution on cots multi-core processors with hardware transactional memory support

5Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The demand for fault-tolerant execution on high performance computer systems increases due to higher fault rates resulting from smaller structure sizes. As an alternative to hardware-based lockstep solutions, software-based fault-tolerance mechanisms can increase the reliability of multi-core commercial-of-the-shelf (COTS) CPUs while being cheaper and more flexible. This paper proposes a software/ hardware hybrid approach, which targets Intel’s current x86 multi-core platforms of the Core and Xeon family. We leverage hardware transactional memory (Intel TSX) to support implicit checkpoint creation and fast rollback. Redundant execution of processes and signature-based comparison of their computations provides error detection, and transactional wrapping enables error recovery. Existing applications are enhanced towards fault-tolerant redundant execution by post-link binary instrumentation. Hardware enhancements to further increase the applicability of the approach are proposed and evaluated with SPEC CPU 2006 benchmarks. The resulting performance overhead is 47% on average, assuming the existence of the proposed hardware support.

Cite

CITATION STYLE

APA

Haas, F., Weis, S., Ungerer, T., Pokam, G., & Wu, Y. (2017). Fault-tolerant execution on cots multi-core processors with hardware transactional memory support. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10172 LNCS, pp. 16–30). Springer Verlag. https://doi.org/10.1007/978-3-319-54999-6_2

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free