This paper presents the design and impementation of MPI_Rejoin() for MPICH-GF, a grid-enabled fault tolerant MPICH implementation. To provide fault tolerance to the MPI applications, it is mandatory for a failed process to recover and continue execution. However, current MPI implementations do not support dynamic process management and it is not possible to restore the information regarding communication channels. The 'rejoin' operation allows the restored process to rejoin the existing group by updating the corresponding entries of the channel table with the new physical address. We have verified that our implementation can correctly reconstruct the MPI communication structure by running NPB applications. We also report on the cost of 'rejoin' operation. © Springer-Verlag Berlin Heidelberg 2003.
CITATION STYLE
Kim, S., Woo, N., Yeom, H. Y., Park, T., & Park, H. W. (2003). Design and implementation of dynamic process management for grid-enabled MPICH. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2840, 653–656. https://doi.org/10.1007/978-3-540-39924-7_87
Mendeley helps you to discover research relevant for your work.