A case for epidemic fault detection and group membership in HPC storage systems

Shane Snyder; Philip Carns; Jonathan Jenkins; Kevin Harms; Robert Ross; Misbah Mubarak; Christopher Carothers

Conference Proceedings

A case for epidemic fault detection and group membership in HPC storage systems

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2015) 8966 237-248

DOI: 10.1007/978-3-319-17248-4_12

8Citations

8Readers

Get full text

Abstract

Fault response strategies are crucial to maintaining performance and availability in HPC storage systems, and the first responsibility of a successful fault response strategy is to detect failures and maintain an accurate view of group membership. This is a nontrivial problem given the unreliable nature of communication networks and other system components. As with many engineering problems, trade-offs must be made to account for the competing goals of fault detection efficiency and accuracy. Today’s production HPC services typically rely on distributed consensus algorithms and heartbeat monitoring for group membership. In this work, we investigate epidemic protocols to determine whether they would be a viable alternative. Epidemic protocols have been proposed in previous work for use in peer-to-peer systems, but they have the potential to increase scalability and decrease fault response time for HPC systems as well. We focus our analysis on the Scalable Weakly-consistent Infection-style Process Group Membership (SWIM) protocol. We begin by exploring how the semantics of this protocol differ from those of typical HPC group membership protocols, and we discuss how storage systems might need to adapt as a result. We use existing analytical models to choose appropriate SWIM parameters for an HPC use case. We then develop a new, high-resolution parallel discrete event simulation of the protocol to confirm existing analytical models and explore protocol behavior that cannot be readily observed with analytical models. Our preliminary results indicate that the SWIM protocol is a promising alternative for group membership in HPC storage systems, offering rapid convergence, tolerance to transient network failures, and minimal network load.

Cite

CITATION STYLE

APA

Snyder, S., Carns, P., Jenkins, J., Harms, K., Ross, R., Mubarak, M., & Carothers, C. (2015). A case for epidemic fault detection and group membership in HPC storage systems. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8966, pp. 237–248). Springer Verlag. https://doi.org/10.1007/978-3-319-17248-4_12

A case for epidemic fault detection and group membership in HPC storage systems

Abstract

Cite

Register to see more suggestions