Design and implementation of a scalable membership service for supercomputer resiliency-aware runtime

Yoav Tock; Benjamin Mandler; José Moreira; Terry Jones

Conference ProceedingsOPEN ACCESS

Design and implementation of a scalable membership service for supercomputer resiliency-aware runtime

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2013) 8097 LNCS 354-366

DOI: 10.1007/978-3-642-40047-6_37

3Citations

7Readers

Abstract

As HPC systems and applications get bigger and more complex, we are approaching an era in which resiliency and run-time elasticity concerns become paramount. We offer a building block for an alternative resiliency approach in which computations will be able to make progress while components fail, in addition to enabling a dynamic set of nodes throughout a computation lifetime. The core of our solution is a hierarchical scalable membership service providing eventual consistency semantics. An attribute replication service is used for hierarchy organization, and is exposed to external applications. Our solution is based on P2P technologies and provides resiliency and elastic runtime support at ultra large scales. Resulting middleware is general purpose while exploiting HPC platform unique features and architecture. We have implemented and tested this system on BlueGene/P with Linux, and using worst-case analysis, evaluated the service scalability as effective for up to 1M nodes. © 2013 Springer-Verlag.

Cite

CITATION STYLE

APA

Tock, Y., Mandler, B., Moreira, J., & Jones, T. (2013). Design and implementation of a scalable membership service for supercomputer resiliency-aware runtime. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8097 LNCS, pp. 354–366). https://doi.org/10.1007/978-3-642-40047-6_37

Design and implementation of a scalable membership service for supercomputer resiliency-aware runtime

Abstract

Cite

Register to see more suggestions