Implementing reliable data structures for MPI services in high component count systems

2Citations
Citations of this article
1Readers
Mendeley users who have this article in their library.
Get full text

Abstract

High performance computing systems continue to grow: currently deployed systems exceed 160,000 cores and systems exceeding 1,000,000 cores are planned. Without significant improvements in component reliability, partial system failure modes could become an unacceptably regular occurrence, limiting the usability of advanced computing infrastructures. In this work, we intend to ease the development of survivable systems and applications through the implementation of a reliable key/value data store based on a distributed hash table (DHT). Borrowing from techniques developed for unreliable wide-area systems, we implemented a distributed data service built with MPI [1] that enables user data structures to survive partial system failure. The service is based on a new implementation of the Kademlia [2] distributed hash table. © 2009 Springer Berlin Heidelberg.

Cite

CITATION STYLE

APA

Wozniak, J. M., Jacobs, B., Latham, R., Lang, S., Son, S. W., & Ross, R. (2009). Implementing reliable data structures for MPI services in high component count systems. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5759 LNCS, pp. 321–322). Springer Verlag. https://doi.org/10.1007/978-3-642-03770-2_39

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free