ABHA: A framework for autonomie job recovery

0Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Key issues to address in autonomie job recovery for cluster computing are recognizing job failure; understanding the failure sufficiently to know if and how to restart the job; and rapidly integrating this information into the cluster architecture so that the failure is better mitigated in the future. The Agent Based High Availability (ABHA) system provides an API and a collection of services for building autonomie batch job recovery into cluster computing environments. An agent API allows users to define agents for failure diagnosis and recovery. It is currently being evaluated in the U.S. Department of Energy's STAR project. © IFIP International Federation for Information Processing 2004.

Cite

CITATION STYLE

APA

Earl, C., Remolina, E., Ong, J., Brown, J., Kuszmaul, C., & Stone, B. (2004). ABHA: A framework for autonomie job recovery. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3278, 259–262. https://doi.org/10.1007/978-3-540-30184-4_23

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free