Consisting of large numbers of computing nodes, parallel cluster systems have high risks of individual node failure. To overcome the high overhead drawbacks of current fault tolerant MPI systems, this paper presents TH-MPI for parallel cluster systems. Being integrated into Linux kernel, THMPI is implemented in a more effective, transparent and extensive way. With supports of dynamic kernel module and diskless checkpointing technologies, our experiment shows that checkpointing in TH-MPI is effectively optimized.
CITATION STYLE
Chen, Y., Fang, Q., Du, Z., & Li, S. (2001). TH-MPI: OS Kernel integrated fault tolerant MPI. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 2131, pp. 75–82). Springer Verlag. https://doi.org/10.1007/3-540-45417-9_15
Mendeley helps you to discover research relevant for your work.