We design and implement a high availability parallel run-time system - ChaRM64, a Checkpoint- based Rollback Recovery and Migration system for parallel running programs on a cluster of IA-64 computers. At first, we discuss our solution of a user-level, single process checkpoint/recovery library running on IA-64 systems. Based on this library, ChaRM64 is realized, which implements a user-transparent, coordinated checkpointing and rollback recovery (CRR) mechanism, quasi-asynchronous migration and the dynamic reconfiguration function. Owing to the above techniques and efficient error detection, ChaRM64 can handle cluster node crashes and hardware transient faults in a IA-64 cluster. Now ChaRM64 for PVM has been implemented in Linux and the MPI version is under construction. As we know, there are few similar projects accomplished for IA-64 architecture. © Springer-Verlag Berlin Heidelberg 2004.
CITATION STYLE
Zhang, Y., Wang, D., & Zheng, W. (2004). Parallel checkpoint/recovery on cluster of IA-64 computers. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3358, 212–216. https://doi.org/10.1007/978-3-540-30566-8_26
Mendeley helps you to discover research relevant for your work.