An approach for ensuring reliable functioning of a supercomputer based on a formal model

10Citations
Citations of this article
11Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this article we describe the Octotron project intended to ensure reliability and sustainability of a supercomputer. Octotron is based on a formal model of computing system that describes system components and their interconnections in graph form. The model determines relations between data describing current supercomputer state (monitoring data) under which all components are functioning properly. Relations are given in form of rules, with the input of real monitoring data. If these relations are violated, Octotron registers the presence of abnormal situation and performs one of the predefined actions: notification of system administrators, logging, disabling or restarting faulty hardware or software components, etc. This paper describes the general structure of the model, augmented with details of its realization and evaluation at supercomputing center in Moscow State University.

Cite

CITATION STYLE

APA

Antonov, A., Nikitenko, D., Shvets, P., Sobolev, S., Stefanov, K., Voevodin, V., … Zhumatiy, S. (2016). An approach for ensuring reliable functioning of a supercomputer based on a formal model. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9573, pp. 12–22). Springer Verlag. https://doi.org/10.1007/978-3-319-32149-3_2

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free