In this article we describe the Octotron project intended to ensure reliability and sustainability of a supercomputer. Octotron is based on a formal model of computing system that describes system components and their interconnections in graph form. The model determines relations between data describing current supercomputer state (monitoring data) under which all components are functioning properly. Relations are given in form of rules, with the input of real monitoring data. If these relations are violated, Octotron registers the presence of abnormal situation and performs one of the predefined actions: notification of system administrators, logging, disabling or restarting faulty hardware or software components, etc. This paper describes the general structure of the model, augmented with details of its realization and evaluation at supercomputing center in Moscow State University.
CITATION STYLE
Antonov, A., Nikitenko, D., Shvets, P., Sobolev, S., Stefanov, K., Voevodin, V., … Zhumatiy, S. (2016). An approach for ensuring reliable functioning of a supercomputer based on a formal model. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9573, pp. 12–22). Springer Verlag. https://doi.org/10.1007/978-3-319-32149-3_2
Mendeley helps you to discover research relevant for your work.