Towards Local-Failure Local-Recovery in PDE Frameworks: The Case of Linear Solvers

1Citations
Citations of this article
1Readers
Mendeley users who have this article in their library.
Get full text

Abstract

It is expected that with the appearance of exascale supercomputers the mean time between failure in supercomputers will decrease. Classical checkpoint-restart approaches are too expensive at that scale. Local-failure local-recovery (LFLR) strategies are an option that promises to leverage the costs, but actually implementing it into any sufficiently large simulation environment is a challenging task. In this paper we discuss how LFLR methods can be incorporated in a PDE framework, focussing at the linear solvers as the innermost component. We discuss how Krylov solvers can be modified to support LFLR, and present numerical tests. We exemplify our approach by reporting on the implementation of these features in the Dune framework, present C++ software abstractions, which simplify the incorporation of LFLR techniques and show how we use these in our solver library. To reduce the memory costs of full remote backups, we further investigate the benefits of lossy compression and in-memory checkpointing.

Cite

CITATION STYLE

APA

Altenbernd, M., Dreier, N. A., Engwer, C., & Göddeke, D. (2021). Towards Local-Failure Local-Recovery in PDE Frameworks: The Case of Linear Solvers. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12456 LNCS, pp. 17–38). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-67077-1_2

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free