Affinity-aware checkpoint restart

Ajay Saini; Arash Rezaei; Frank Mueller; Paul Hargrove; Eric Roman

Conference ProceedingsOPEN ACCESS

Affinity-aware checkpoint restart

Proceedings of the 15th International Middleware Conference, Middleware 2014 (2014) 121-132

DOI: 10.1145/2663165.2663325

2Citations

8Readers

Abstract

Current checkpointing techniques employed to overcome faults for HPC applications result in inferior application performance after restart from a checkpoint for a number of applications. This is due to a lack of page and core affinity awareness of the checkpoint/restart (C/R) mechanism, i.e., application tasks originally pinned to cores may be restarted on different cores, and in case of non-uniform memory architectures (NUMA), quite common today, memory pages associated with tasks on a NUMA node may be associated with a different NUMA node after restart. This work contributes a novel design technique for C/R mechanisms to preserve task-to-core maps and NUMA node specific page affinities across restarts. Experimental results with BLCR, a C/R mechanism, enhanced with affinity awareness demonstrate significant performance benefits of 37%-73% for the NAS Parallel Benchmark codes and 6-12% for NAMD with negligible overheads instead of up to nearly four times longer an execution times without affinity-aware restarts on 16 cores.

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Saini, A., Rezaei, A., Mueller, F., Hargrove, P., & Roman, E. (2014). Affinity-aware checkpoint restart. In Proceedings of the 15th International Middleware Conference, Middleware 2014 (pp. 121–132). Association for Computing Machinery. https://doi.org/10.1145/2663165.2663325

Readers' Seniority

PhD / Post grad / Masters / Doc 4

67%

Researcher 2

33%

Readers' Discipline

Computer Science 6

86%

Engineering 1

14%

Affinity-aware checkpoint restart

Abstract

Author supplied keywords

References Powered by Scopus

Scalable molecular dynamics with NAMD

The nas parallel benchmarks

CHARM++: A Portable Concurrent Object Oriented System Based On C++

Cited by Powered by Scopus

Speculative memory checkpointing

An improved adaptive genetic algorithm for solving 3-SAT problems based on effective restart and greedy strategy

Register to see more suggestions

Cite

Readers' Seniority

Readers' Discipline