Current checkpointing techniques employed to overcome faults for HPC applications result in inferior application performance after restart from a checkpoint for a number of applications. This is due to a lack of page and core affinity awareness of the checkpoint/restart (C/R) mechanism, i.e., application tasks originally pinned to cores may be restarted on different cores, and in case of non-uniform memory architectures (NUMA), quite common today, memory pages associated with tasks on a NUMA node may be associated with a different NUMA node after restart. This work contributes a novel design technique for C/R mechanisms to preserve task-to-core maps and NUMA node specific page affinities across restarts. Experimental results with BLCR, a C/R mechanism, enhanced with affinity awareness demonstrate significant performance benefits of 37%-73% for the NAS Parallel Benchmark codes and 6-12% for NAMD with negligible overheads instead of up to nearly four times longer an execution times without affinity-aware restarts on 16 cores.
CITATION STYLE
Saini, A., Rezaei, A., Mueller, F., Hargrove, P., & Roman, E. (2014). Affinity-aware checkpoint restart. In Proceedings of the 15th International Middleware Conference, Middleware 2014 (pp. 121–132). Association for Computing Machinery. https://doi.org/10.1145/2663165.2663325
Mendeley helps you to discover research relevant for your work.