Affinity-aware checkpoint restart

2Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.

Abstract

Current checkpointing techniques employed to overcome faults for HPC applications result in inferior application performance after restart from a checkpoint for a number of applications. This is due to a lack of page and core affinity awareness of the checkpoint/restart (C/R) mechanism, i.e., application tasks originally pinned to cores may be restarted on different cores, and in case of non-uniform memory architectures (NUMA), quite common today, memory pages associated with tasks on a NUMA node may be associated with a different NUMA node after restart. This work contributes a novel design technique for C/R mechanisms to preserve task-to-core maps and NUMA node specific page affinities across restarts. Experimental results with BLCR, a C/R mechanism, enhanced with affinity awareness demonstrate significant performance benefits of 37%-73% for the NAS Parallel Benchmark codes and 6-12% for NAMD with negligible overheads instead of up to nearly four times longer an execution times without affinity-aware restarts on 16 cores.

Cite

CITATION STYLE

APA

Saini, A., Rezaei, A., Mueller, F., Hargrove, P., & Roman, E. (2014). Affinity-aware checkpoint restart. In Proceedings of the 15th International Middleware Conference, Middleware 2014 (pp. 121–132). Association for Computing Machinery. https://doi.org/10.1145/2663165.2663325

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free