Checkpointing Strategies for Shared High-Performance Computing Platforms

  • Herault T
  • Robert Y
  • Bouteiller A
  • et al.
N/ACitations
Citations of this article
6Readers
Mendeley users who have this article in their library.

Abstract

Input/output (I/O) from various sources often contend for scarcely available bandwidth. For example, checkpoint/restart (CR) protocols can help to ensure application progress in failure-prone environments. However, CR I/O alongside an application's normal, requisite I/O can increase I/O contention and might negatively impact performance. In this work, we consider different aspects (system-level scheduling policies and hardware) that optimize the overall performance of concurrently executing CR-based applications that share I/O resources. We provide a theoretical model and derive a set of necessary constraints to minimize the global waste on a given platform. Our results demonstrate that Young/Daly's optimal checkpoint interval, despite providing a sensible metric for a single, undisturbed application, is not sufficient to optimally address resource contention at scale. We show that by combining optimal checkpointing periods with contention-aware system-level I/O scheduling strategies, we can significantly improve overall application performance and maximize the platform throughput. Finally, we evaluate how specialized hardware, namely burst buffers, may help to mitigate the I/O contention problem. Overall, these results provide critical analysis and direct guidance on how to design efficient, CR ready, large-scale platforms without a large investment in the I/O subsystem.

Cite

CITATION STYLE

APA

Herault, T., Robert, Y., Bouteiller, A., Arnold, A., Ferreira, K. B., George, G., & Dongarra, J. (2019). Checkpointing Strategies for Shared High-Performance Computing Platforms. International Journal of Networking and Computing, 9(1), 28–52. https://doi.org/10.15803/ijnc.9.1_28

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free