Abstract
Key issues to address in autonomie job recovery for cluster computing are recognizing job failure; understanding the failure sufficiently to know if and how to restart the job; and rapidly integrating this information into the cluster architecture so that the failure is better mitigated in the future. The Agent Based High Availability (ABHA) system provides an API and a collection of services for building autonomie batch job recovery into cluster computing environments. An agent API allows users to define agents for failure diagnosis and recovery. It is currently being evaluated in the U.S. Department of Energy's STAR project. © IFIP International Federation for Information Processing 2004.
Cite
CITATION STYLE
Earl, C., Remolina, E., Ong, J., Brown, J., Kuszmaul, C., & Stone, B. (2004). ABHA: A framework for autonomie job recovery. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3278, 259–262. https://doi.org/10.1007/978-3-540-30184-4_23
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.