High performance computers (HPCs) have contributed to rapid scientific discovery and global economic prosperity as well as defense-related applications. However, their complex nature makes them difficult to troubleshoot thus questioning their reliability. As a result, these supercomputing systems are susceptible to malicious behavior or cyber attacks. Similar investigations have been made in the context of malicious objects in computer networks; however, limited attention has been given in the context of large-scale parallel systems. In this chapter, we present a sophisticated process that characterizes observed failures in supercomputing infrastructures due to variations of consistent intentional attacks. First, we present a data network extrapolation (DNE) process that automatically does failure accounting and error checking while considering a HPC tree-like reliability infrastructure. Next, dynamic and static characterization of failures are performed. By introducing a normalization metric, we observe that the complete spectrum of failure observations is deterministic in nature that depends on the total number of failed jobs, the time between processed jobs, and the total number of processed jobs per node. Our simulations using the Structural Simulation Toolkit (SST) show that our approach is highly effective for dynamically and statically representing observed failures. Furthermore, our results can be applied for improving job-based scheduling in supercomputing environments.
CITATION STYLE
Clark, A. D., & Absher, J. M. (2018). Cyber-Surveillance Analysis for Supercomputing Environments. In Advanced Sciences and Technologies for Security Applications (pp. 395–412). Springer. https://doi.org/10.1007/978-3-319-68533-5_19
Mendeley helps you to discover research relevant for your work.