Parallel checkpoint/recovery on cluster of IA-64 computers

0Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We design and implement a high availability parallel run-time system - ChaRM64, a Checkpoint- based Rollback Recovery and Migration system for parallel running programs on a cluster of IA-64 computers. At first, we discuss our solution of a user-level, single process checkpoint/recovery library running on IA-64 systems. Based on this library, ChaRM64 is realized, which implements a user-transparent, coordinated checkpointing and rollback recovery (CRR) mechanism, quasi-asynchronous migration and the dynamic reconfiguration function. Owing to the above techniques and efficient error detection, ChaRM64 can handle cluster node crashes and hardware transient faults in a IA-64 cluster. Now ChaRM64 for PVM has been implemented in Linux and the MPI version is under construction. As we know, there are few similar projects accomplished for IA-64 architecture. © Springer-Verlag Berlin Heidelberg 2004.

Cite

CITATION STYLE

APA

Zhang, Y., Wang, D., & Zheng, W. (2004). Parallel checkpoint/recovery on cluster of IA-64 computers. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3358, 212–216. https://doi.org/10.1007/978-3-540-30566-8_26

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free