Dependability: Basic Concepts and...
IFIP WG 10.4 - Dependable Computing and Fault Tolerance Dependability: Basic Concepts and Terminology DRAFT August 1994
INTERNATIONAL FEDERATION FOR INFORMATION PROCESSING IFIP WG 10.4 - Dependable Computing and Fault Tolerance J.C. Laprie (ed.) Dependability: Basic Concepts and Terminology Contributors (this list is provisional: it is the list of contributors to the five-language version published at Springer-Verlag) T. Anderson (The University of Newcastle upon Tyne, UK) A. Avizienis (UCLA, Los Angeles, California, USA) W.C. Carter (Bailey Island, Maine, USA) A. Costes (LAAS-CNRS, Toulouse, France) F. Cristian (UCSD, San Diego, California, USA) Y. Koga (National Defense Academy, Yokusuka, Japan) H. Kopetz (Technische Universitt Wien, Austria) J.H. Lala (C.S. Draper Lab., Cambridge, Massachusetts, USA) J.C. Laprie (LAAS-CNRS, Toulouse, France) J.F. Meyer (University of Michigan, Ann Arbor, Michigan, USA) B. Randell (The University of Newcastle upon Tyne, UK) A.S. Robinson (STDC, Reston, Virginia, USA)
Contents 1- Introduction ........................................................................................ 1 2- Basic Definitions .................................................................................. 2 3- On System Function, Behavior, and Structure ................................................ 4 4- The Attributes of Dependability.................................................................. 7 5- The Impairments to Dependability............................................................... 9 5.1- Failures....................................................................................9 5.2- Errors.................................................................................... 11 5.3- Faults .................................................................................... 12 5.4- Fault pathology......................................................................... 16 6- The Means for Dependability .................................................................. 20 6.1- Fault tolerance .......................................................................... 20 6.2- Fault removal ........................................................................... 24 6.3- Fault forecasting........................................................................ 27 6.4- Dependencies between the means for dependability ............................... 29 7- Summary and Conclusion...................................................................... 32 8- Glossary.......................................................................................... 34 9- References........................................................................................ 41
1 1- Introduction This document is aimed at giving informal but precise definitions characterizing the various attributes of computing systems dependability. It is a contribution to the work undertaken within the "Reliable and Fault Tolerant Computing" scientific and technical community in order to propose clear and widely acceptable definitions for some basic concepts [Avizienis 67, Jessep 77, Melliar-Smith & Randell 77, Avizienis 78, Randell et al. 78, Carter 79, Anderson & Lee 81, FTCS 82, Siewiorek & Swarz 82, Cristian et al. 85a, Laprie 85, Avizienis & Laprie 86, Laprie 89]. The document results from revising and updating the English section of [Laprie 92a]. Dependability is first introduced as a global concept which subsumes the usual attributes of reliability, availability, safety, security. The basic definitions given in the first section are then commented, and supplemented by additional definitions, in the subsequent sections. A glossary is given in the annex, which recapitulates the definitions given throughout the document. Boldface characters are used when a term is defined, italic characters being an invitation to focus the reader���s attention. The guidelines which have governed this presentation can be summed up as follows: �� search for a reduced number of concepts enabling the dependability attributes to be expressed �� use of terms which are identical to whenever possible or as close as possible to those generally used as a rule, a term which has not been defined retains its ordinary sense (as given by any dictionary) �� emphasis on integration [Goldberg 82, Randell 86] (as opposed to specialization) through the independence of the given definitions with respect to the classes of faults. This document can be seen as a minimum consensus within the community in order to facilitate fruitful interactions in addition this document is hoped to be suitable a) for use by other bodies (including standard organizations), and b) for educational purposes. In this view, the associated terminology effort is not an end in itself: words are only of interest in so far as they unequivocally label concepts, and enable ideas and viewpoints to be shared. The document is devoid of any pretension of being a state-of-the-art or "Tablets of Stone": the concepts that are presented have to evolve with technology, and with our progress in understanding and mastering the specification, design and assessment of dependable computer systems.
2 2- Basic Definitions Dependability is that property of a computer system such that reliance can justifiably be placed on the service it delivers. The service delivered by a system is its behavior as it is perceived by its user(s) a user is another system (physical, human) which interacts with the former. Depending on the application(s) intended for the system, different emphasis may be put on different facets of dependability, i.e. dependability may be viewed according to different, but complementary, properties, which enable the attributes of dependability to be defined: �� the readiness for usage leads to availability �� the continuity of service leads to reliability �� the non-occurrence of catastrophic consequences on the environment leads to safety �� the non-occurrence of unauthorized disclosure of information leads to confidentiality �� the non-occurrence of improper alterations of information leads to integrity �� the aptitude to undergo repairs and evolutions leads to maintainability. Associating integrity and availability with respect to authorized actions, together with confidentiality, leads to security. A system failure occurs when the delivered service deviates from fulfilling the system function, the latter being what the system is intended for. An error is that part of the system state which is liable to lead to subsequent failure: an error affecting the service is an indication that a failure occurs or has occurred. The adjudged or hypothesized cause of an error is a fault. The development of a dependable computing system calls for the combined utilization of a set of methods which can be classed into: �� fault prevention: how to prevent fault occurrence or introduction �� fault tolerance: how to provide a service up to fulfilling the system function in spite of faults �� fault removal: how to reduce the presence (number, seriousness) of faults �� fault forecasting: how to estimate the present number, the future incidence, and the consequences of faults.
3 The notions introduced up to now can be grouped into three classes (figure 1): �� the impairments to dependability: faults, errors, failures they are undesired but not in principle unexpected circumstances causing or resulting from un-dependability (whose definition is very simply derived from the definition of dependability: reliance cannot, or will not any longer, be placed on the service) �� the means for dependability: fault prevention, fault tolerance, fault removal, fault forecasting these are the methods and techniques enabling one a) to provide the ability to deliver a service on which reliance can be placed, and b) to reach confidence in this ability. �� the attributes of dependability: availability, reliability, safety, confidentiality, integrity, maintainability these a) enable the properties which are expected from the system to be expressed, and b) allow the system quality resulting from the impairments and the means opposing to them to be assessed. DEPENDABILITY ATTRIBUTES AVAILABILITY RELIABILITY SAFETY CONFIDENTIALITY INTEGRITY MAINTAINABILITY FAULT PREVENTION FAULT TOLERANCE FAULT REMOVAL FAULT FORECASTING MEANS IMPAIRMENTS FAULTS ERRORS FAILURES Figure 1 - The dependability tree
4 3- On System Function, Behavior, and Structure Up to now, a system has been implicitly considered as a whole, emphasizing its externally perceived behavior. A definition complying with this "black box" view is: an entity having interacted or interfered, interacting or interfering, or likely to interact or interfere with other entities, i.e., with other systems. These other systems have been, are, or will constitute the environment of the considered system1. A system user is that part of the environment which interacts with the considered system: the user provides inputs to and/or receives outputs from the system, its distinguishing feature being to use the service delivered by the system. As already indicated in section 2, the function of a system is what the system is intended for [Kuipers 85]. The behavior of a system is what the system does. What makes it do what it does is the structure of the system [Ziegler 76]. Adopting the spirit of [Anderson & Lee 81], a system, from a structural ("white box" or "glassbox") viewpoint, is a set of components bound together in order to interact a component is another system, etc. The recursion stops when a system is considered as being atomic: any further internal structure cannot be discerned, or is not of interest and can be ignored. The term "component" has to be understood in a broad sense: layers of a system as well as intralayer components in addition, a component being itself a system, it embodies the interrelation(s) of the components of which it is composed. A more classical definition of system structure is what a system is. Such a definition fits in perfectly when representing a system without accounting explicitly for any impairments to dependability, and thus in the case where the structure is considered as fixed. We do not want to restrict our- selves to systems whose structure is fixed. In particular, we need to allow for structural changes caused by, or resulting from, dependability impairments. It thus appears that a 1 a) Giving recursive definitions enables to emphasize relativity with respect to the adopted viewpoint. So is it for the notion of system: a given system���s boundaries may vary depending on whether it is viewed by its designer(s), by its user(s), by its maintenance crew, etc. b) The past, present and future forms are employed in order to stress that a system���s environment may vary with time, especially with respect to the phases of its life-cycle. For instance, the notion of "programming environment" fits into the given definition, as well as the physical environment a system is confronted with during operation.
5 structure may have states2. Hence a definition for the notion of state: a condition of being with respect to a set of phenomena, whether of behavior or of structure3. From its very definition (the user-perceived behavior), the service delivered by a system is clearly an abstraction of the system���s behavior. It is noteworthy that this abstraction is highly dependent on the application that the computer system supports. An example of this dependence is the important role played in this abstraction by time: the time granularities of the system and of its user(s) are generally different, and the difference varies from one application to another. In addition, the service is of course not restricted to the outputs only, but encompasses all interactions which are of interest to the user for instance, scanning sensors clearly is part of the service expected from a monitoring system. We have up to now used the singular for function and service. A system generally fulfils more than one function, and delivers more than one service. The function and the service can be thus seen as composed of function items and of service items. For the sake of simplicity, we shall simply use the plural functions, services when it is worth distinguishing several items of function or of service. Of special interest with respect to dependability are timeliness properties. A real-time function or service is a function or service that is required to be fulfilled or delivered within finite time intervals dictated by the environment, and a real-time system is a system which fulfils at least one real-time function or delivers at least one real-time service [PDCS 90]. Based on the preceding view of system structure, the notions of function and of service apply equally naturally to the components. This is especially interesting in the design process, when off-the-shelf components, either hardware or software are used: what is more of interest to the designer is the function and/or the service they are able to provide, rather than their detailed (internal) behavior. Of central role in dependability is the system specification, i.e. an agreed description of the system function or service4. The system function or service is usually specified first in terms of what should be fulfilled or delivered regarding the system���s primary aim(s) (e.g., performing transactions, controling or monitoring an industrial process, piloting a plane or a rocket, etc.). When considering safety- or security-related systems, this specification is generally completed with what should not happen (e.g. the hazardous states from which a 2 It could therefore be said that a "structure" has also a "behavior", especially with respect to the dependability impairments, even if the considered velocities of evolution with respect i) to the user���s request on one hand, and ii) to the impairments on the other, are hopefully different. 3 This definition is aimed at emphasizing the relativity of the notion of state, which depends directly upon the phenomena considered e.g. state wrt to computation activity, state wrt to failure occurrence. 4 The agreement is usually to take place among two persons or corporate bodies in fact, legal personnae: the system supplier (in a broad sense of the term: designer, builder, vendor, etc.) and its human user(s) the agreement may be implicit, as when purchasing a system which comes with its specification and user���s manual, or when using off-the-shelf systems.
6 catastrophe may ensue, or the disclosure of sensitive information). Such a specification may in turn lead to specifying additional functions or services that the system should fulfill or deliver in order to reduce the likelihood of what should not happen (e.g. authenticating a user and checking his or her authorization rights). In addition, these various specifications may be: a) expressed according to various degrees of detail : requirement specification, design specification, realization specification, etc. b) decomposed according to the absence or the presence of failure the former case relates to what is usually termed the nominal mode of operation, and the latter case may relate to the degraded mode of operation if the surviving resources are not any longer sufficient for delivering the nominal service(s). As a consequence, there is usually not a single specification, but several, and, clearly, a system may fail with respect to some of these multiple specifications, and still comply with the others. Expressing the functions of a system is an activity which is naturally initialized during the very first steps of a system development. It is however not generally limited to this phase of a systm���s life experience shows that specifying a system���s functions is pursued during all the system���s life, due to the difficulty in identifying what is expected from a system.
7 4- The Attributes of Dependability The attributes of dependability have been defined in section 2 according to different proper- ties, which may be more or less emphasized depending on the application intended for the computer system under consideration: �� availability is always required, although to a varying degree depending on the application �� reliability, safety, confidentiality may or may not be required according to the application. Integrity is a pre-requisite for availability, reliability and safety, but may not be so for confidentiality (for instance when considering attacks via covert channels or passive listening). The definition given for integrity absence of improper alterations of information generalizes the usual definitions, which relate to the notion of authorized actions only (e.g., prevention of the unauthorized amendment or deletion of information [CEC 91], assurance of approved data alterations [Jacob 91] naturally, when a system implements an authorization policy, "improper" encompasses "unauthorized". Whether a system holds the properties which have enabled the attributes of dependability to be defined should be interpreted in a relative, probabilistic, sense, and not in an absolute, deterministic sense: due to the unavoidable presence or occurrence of faults, systems are never totally available, reliable, safe, or secure. The definition given for maintainability goes deliberately beyond corrective maintenance, aimed at preserving or improving the system���s ability to deliver a service fulfilling its function (relating to reparability only), and encompasses via evolvability the other forms of maintenance: adaptive maintenance, which adjusts the system to environmental changes (e.g. change of operating systems or system data-bases), and perfective maintenance, which improves the system���s function by responding to customer and designer defined changes, which may involve removal of specification faults [Ramamoorthy 84]. Actually, maintainability conditions a system���s dependability all along its life-cycle, due to the unavoidable evolutions during its operational life. Security has not been introduced as a single attribute of dependability, in agreement with the usual definitions of security, which view it as a composite notion, namely ��the combination of confidentiality, the prevention of the unauthorized disclosure of information, integrity, the
8 prevention of the unauthorized amendment or deletion of information, and availability, the prevention of the unauthorized withholding of information�� [CEC 91]. From their definitions, availability and reliability emphasize the avoidance of failures, safety the avoidance of a specific class of failures (catastrophic failures), and security the prevention of what can be viewed as a specific class of faults (the prevention of the unauthorized access and/or handling of information). Reliability and availability are thus closer to each other than they are to safety on one hand, and to security on the other reliability and availability can thus be grouped together [Laprie 92b, Jonson & Olovsson 92], and be collectively defined via the avoidance or minimization of the service outages. However, this remark should not lead to consider that reliability and availability do not depend on the system environment: it has long been recognized that a computing system reliability/availability is highly correlated to its utilization profile, be the failures due to physical faults or to design faults (see e.g., [Iye 82]). The variations in the emphasis to be put on the attributes of dependability have a direct influence on the appropriate balance of the techniques addressed in the previous section to be employed in order that the resulting system be dependable. This is an all the more difficult problem as some of the attributes are antagonistic (e.g. availability and safety, availability and security), necessitating that trade-offs be performed. Considering the three main design dimen- sions of a computer system, i.e. cost, performance and dependability, the problem is further exacerbated by the fact that the dependability dimension is less understood than the cost- performance design space [Siewiorek & Johnson 92].
9 5- The Impairments to Dependability In this section, we examine in turns the notions of failure, error and fault, as well as their manifestation mechanisms, i.e. fault pathology. 5.1- Failures Failure occurrence has been defined in section 2 with respect to the function of as system, not with respect to its specification. Indeed, if an unacceptable behavior is generally identified as a failure due to a deviation from the compliance with the specification, it may happen that such a behavior complies with the specification, and be however unacceptable for the system user(s), thus uncovering a specification fault. In the latter, recognizing that the event is undesired (and is in fact a failure) can only be performed after its occurrence, for instance via its consequences5. A system may not, and generally does not, always fail in the same way. The ways a system can fail are its failure modes, which may be characterized according to three viewpoints: domain, perception by the system users, and consequences on the environment. The failure domain viewpoint leads one to distinguish: �� value failures: the value of the delivered service does not any longer fulfills the system function �� timing failures: the timing of the service delivery does not any longer fulfills the system function. Such general definitions can be refined. For instance, the notion of timing failure may be refined into early timing failure or late timing failure, depending on whether the service is delivered too early or too late. A class of failures relating to both value and timing are the halting failures: system activity, if any, is no longer perceptible to the users. According to how the system interacts with its user(s), such an absence of activity may take the form of a) frozen outputs (a constant value service is delivered the constant value delivered may vary according to the application, e.g. last correct value, some predetermined value, etc.), or of b) a 5 In fact, what has to be recognized is that, if it is mostly desirable that specifications can be stated at the beginning of, or during, the system development, there are some specifications which can only be derived from the observation of the system in its context and environment.
10 silence (no message sent in a distributed system). A system whose failures can be or more generally are to an acceptable extent only halting failures, is a fail-halt system the situations of frozen outputs and of silence lead respectively to fail-passive systems and to fail-silent systems [Powell et al. 88]. The failure perception viewpoint leads one to distinguish, when a system has several users: �� consistent failures: all system users have the same perception of the failures �� inconsistent failures: the system users may have different perceptions of a given failure inconsistent failures are usually termed, after [Lamport et al. 82], Byzantine failures. It is noteworthy that failures of a fail-silent system are consistent, whereas it may not be so for a fail-passive system. Grading the consequences of the failures upon the system environment enable the failure severities to be defined, via the ordering of the failure modes into severity levels, to which are generally associated maximum admissible probabilities of occurrence. The number, the labeling and the definition of the severity levels, as well as the admissible probabilities of occurrence, are largely dependent upon the applications. Two extreme levels can however be defined according to the relation between the benefit (in the broad sense of the term, not limited to economic considerations) provided by the service delivered in the absence of failure and the consequences of failures: �� benign failures, where the consequences are of the same order of magnitude as the benefit provided by service delivery in the absence of failure �� catastrophic failures, where the consequences are incommensurably greater than the benefit provided by service delivery in the absence of failure. A system whose failures can only be or more generally are to an acceptable extent benign failures is a fail-safe system. The notion of failure severity enables the notion of criticality to be defined: the criticality of a system is the highest severity of its (possible) failure modes. The relation between failure modes and failure severities is highly application-dependent. However, there exist a broad class of applications where inoperation is considered as being a naturally safe position (e.g. ground transportation, energy production), whence the direct correspondence which is often made between fail-halt and fail-safe [Mine & Koga 67, Nicolaidis et al. 89]. Fail-halt systems (either fail-passive or fail-silent) and fail-safe systems are however examples of fail-controlled systems, i.e. systems which are designed and realized in order that they may only fail or may fail to an acceptable extent according to restrictive modes of failure, e.g. frozen output as opposed to delivering erratic values, silence as opposed to babbling, consistent failures as opposed to inconsistent ones fail-controlled systems may in addition be defined via imposing some internal state condition or accessibility, as in the so-called fail-stop systems [Schlichting & Schneider 83].
11 Figure 2 summarizes the failure classes. FAILURES DOMAIN VALUE FAILURES TIMING FAILURES CONSEQUENCES ON ENVIRONMENT CATASTROPHIC FAILURES BENIGN FAILURES �� �� �� PERCEPTION BY SEVERAL USERS CONSISTENT FAILURES UNCONSISTENT FAILURES Figure 2 - The failure classes 5.2- Errors An error was defined as being liable to lead to subsequent failure. Whether or not an error will actually lead to a failure depends on three major factors: 1) The system composition, and especially the nature of the existing redundancy: �� intentional redundancy (introduced to provide fault tolerance) which is explicitly intended to prevent an error from leading to failure, �� unintentional redundancy (it is practically difficult if not impossible to build a system without any form of redundancy6) which may have the same unexpected result as intentional redundancy. 2) The system activity: an error may be overwritten before creating damage. 3) The definition of a failure from the user���s viewpoint: what is a failure for a given user may be a bearable nuisance for another one. Examples are a) accounting for the user���s time granularity: an error which "passes through" the system-user(s) interface may or may not be viewed as a failure depending on the user���s time granularity, b) the notion of "acceptable error rate" implicitly before considering that a failure has occurred in data transmission. This discussion explains why it is often desirable to explicitly mention in the specification such conditions as the maximum outage time (related to the user time granularity). 6 A classical problem in hardware testing is the removal of such "false redundancies", whose effect may be to mask faults, and as such to make the task of test pattern generation more complicated.
12 5.3- Faults Faults and their sources are extremely diverse. They can be classified according to five main viewpoints which are their phenomenological cause, their nature, their phase of creation or of occurrence, their situation with respect to the system boundaries, and their persistence. The phenomenological causes leads one to distinguish [Avizienis 78]: �� physical faults, which are due to adverse physical phenomena, �� human-made faults, which result from human imperfections. The nature of faults leads one to distinguish: �� accidental faults, which appear or are created fortuitously �� intentional faults, which are created deliberately, with or without a malicious intention. The phase of creation with respect to the system���s life leads one to distinguish: �� development faults, which result from imperfections arising either a) during the development of the system (from requirement specification to implementation) or during subsequent modifications, or b) during the establishment of the procedures for operating or maintaining the system �� operational faults, which appear during the system���s exploitation. The system boundaries leads one to distinguish: �� internal faults, which are those parts of the state of a system which, when invoked by the computation activity, will produce an error, �� external faults, which result from interference or from interaction with its physical (electromagnetic perturbations, radiation, temperature, vibration, etc.) or human environment The temporal persistence leads one to distinguish: �� permanent faults, whose presence is not related to pointwise conditions whether they be internal (computation activity) or external (environment), �� temporary faults, whose presence is related to such conditions, and are thus present for a limited amount of time. It could be argued that introducing the phenomenological causes in the classification criteria of faults may lead recursively "a long way back", e.g. why do programmers make mistakes? why do integrated circuits fail? The very notion of fault is arbitrary, and is in fact a facility provided for stopping the recursion. Hence the definition given: adjudged or hypothesized cause of an error. This cause may vary depending upon the chosen viewpoint: fault tolerance mechanisms, maintenance engineers, repair shop, developer, semiconductor physicist, etc. In our view, recursion stops at the cause which is intended to be prevented or tolerated. This view