Guy, C.G. Computer Reliability The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton CRC Press llc. 2000
Guy, C.G. “Computer Reliability” The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000
98 Computer reliability 98.1 Introduction 98.2 Definitions of Failure Fault, and error 98.3 Failure Rate and Reliability 98.4 Relationship Between Reliability and Failure Rate 98.5 Mean Time to failure 98.6 Mean Time to Repa 98.7 Mean time between failures 98.9 Calculation of Computer System Reliability 98.10 Markov Modeling Chris Guy 98.11 Software Reliability University of reading 98.12 Reliability Calculations for Real System 98.1 Introduction This chapter outlines the knowledge needed to estimate the reliability of any electronic system or subsystem within a computer. The word estimate was used in the first sentence to emphasize that the following calculations, even if carried out perfectly correctly, can provide no guarantee that a particular example of a piece of electronic equipment will work for any length of time. However, they can provide a reasonable guide to the probability that something will function as expected over a given time period. The first step in estimating the reliability f a computer system is to determine the likelihood of failure of each of the individual components, such as resistors, capacitors, integrated circuits, and connectors, that make up the system. This information can the be used in a full system analysis 98.2 Definitions of failure. Fault and error A failure occurs when a system or component does not perform as expected. Examples of failures at the component level could be a base-emitter short in a transistor somewhere within a large integrated circuit or a solder joint going open circuit because of vibrations. If a component experiences a failure, it may cause a fault, ding to an error, which may lead to a system failure. A fault may be either the outward manifestation of a component failure or a design fault Component failure may be caused by internal physical phenomena or by external environmental effects such as electromagnetic fields or power supply variations. Design faults may be divided into two classes. The first class of design fault is caused by using components outside their rated specification. It should be possible to eliminate this class of faults by careful design checking. The second class, which is characteristic of large digital circuits such as those found in computer systems, is caused by the designer not taking into account every logical condition that could ccur during system operation. All computer systems have a software component as an integral part of their operation, and software is especially prone to this kind of design fault. c 2000 by CRC Press LLC
© 2000 by CRC Press LLC 98 Computer Reliability 98.1 Introduction 98.2 Definitions of Failure, Fault, and Error 98.3 Failure Rate and Reliability 98.4 Relationship Between Reliability and Failure Rate 98.5 Mean Time to Failure 98.6 Mean Time to Repair 98.7 Mean Time Between Failures 98.8 Availability 98.9 Calculation of Computer System Reliability 98.10 Markov Modeling 98.11 Software Reliability 98.12 Reliability Calculations for Real Systems 98.1 Introduction This chapter outlines the knowledge needed to estimate the reliability of any electronic system or subsystem within a computer. The word estimate was used in the first sentence to emphasize that the following calculations, even if carried out perfectly correctly, can provide no guarantee that a particular example of a piece of electronic equipment will work for any length of time. However, they can provide a reasonable guide to the probability that something will function as expected over a given time period. The first step in estimating the reliability of a computer system is to determine the likelihood of failure of each of the individual components, such as resistors, capacitors, integrated circuits, and connectors, that make up the system. This information can then be used in a full system analysis. 98.2 Definitions of Failure, Fault, and Error A failure occurs when a system or component does not perform as expected. Examples of failures at the component level could be a base-emitter short in a transistor somewhere within a large integrated circuit or a solder joint going open circuit because of vibrations. If a component experiences a failure, it may cause a fault, leading to an error, which may lead to a system failure. A fault may be either the outward manifestation of a component failure or a design fault. Component failure may be caused by internal physical phenomena or by external environmental effects such as electromagnetic fields or power supply variations. Design faults may be divided into two classes. The first class of design fault is caused by using components outside their rated specification. It should be possible to eliminate this class of faults by careful design checking. The second class, which is characteristic of large digital circuits such as those found in computer systems, is caused by the designer not taking into account every logical condition that could occur during system operation. All computer systems have a software component as an integral part of their operation, and software is especially prone to this kind of design fault. Chris G. Guy University of Reading
A fault may be permanent or transitory. Examples of permanent faults are short or open circuits within a component caused by physical failures. Transitory faults can be subdivided further into two classes. The first usually called transient faults, are caused by such things as alpha-particle radiation or power supply variations. Large random access memory circuits are particularly prone to this kind of fault. By definition, a transient fault is not caused by physical damage to the hardware. The second class is usually called intermittent faults. These faults are temporary but reoccur in an unpredictable manner. They are caused by loose physical connections between components or by components used at the limits of their specification. Intermittent faults often become permanent faults after a period of time. A fault may be active or inactive. For example, if a fault causes the output of a digital component to be stuck at logic 1, and the desired output is logic 1, then this would be classed inactive fault. Once the desired output becomes logic 0, then the fault becomes active The consequence for the system operation of a fault is an error. As the error may be caused by a p or by a transitory fault, it may be classed as a hard error or a soft error. An error in an individual may be due to a fault in that subsystem or to the propagation of an error from another part of the overa m The terms fault and error are sometimes interchanged. The term failure is often used to mean anything covered by these definitions. The definitions given here are those in most common usage. Physical faults within a component can be characterized by their external electrical effects. These effects are ommonly classified into fault models. The intention of any fault model is to take into account every possible failure mechanism, so that the effects on the system can be worked out. The manifestation of faults in a system can be classified according to the likely effects, producing an error model. The purpose of error models is to ry to establish what kinds of corrective action need be taken in order to effect repair 98.3 Failure Rate and reliability An individual component may fail after a random time, so it is impossible to predict any pattern of failure from one example. It is possible, however, to estimate the rate at which members of a group of identical components will fail. This rate can be determined by experimental means using accelerated life tests. In a normal operating environment, the time for a statistically significant number of failures to have occurred in a group of modern digital components could be tens or even hundreds of years. Consequently, the manufacturers must make the environment for the tests extremely unfavorable in order to produce failures in a few hours or days and then extrapolate back to produce the likely number of failures in a normal environment. The failure rate is then defined as the number of failures per unit time, in a given environment, compared with the number of surviving components. It is usually expressed as a number of failures per million hours If f(n)is the number of components that have failed up to time t, and s(t) is the number of o ived, then z(o), the failure rate or hazard rate, is defined as Most electronic components will exhibit a variation of failure rate with time. Many studies have shown that this variation can often be approximated to the pattern shown in Fig. 98. 1. For obvious reasons this is known as a bathtub curve. The first phase, where the failure rate starts high but is decreasing with time, is where the components are suffering infant mortality; in other words, those that had manufacturing defects are failing This is often called the burn-in phase. The second part, where the failure rate is roughly constant, is the useful life period of operation for the component. The final part, where the failure rate is increasing with time, is where the components are starting to wear out. ng the same nomenclature s(t)+ f(t) e 2000 by CRC Press LLC
© 2000 by CRC Press LLC A fault may be permanent or transitory. Examples of permanent faults are short or open circuits within a component caused by physical failures. Transitory faults can be subdivided further into two classes. The first, usually called transient faults, are caused by such things as alpha-particle radiation or power supply variations. Large random access memory circuits are particularly prone to this kind of fault. By definition, a transient fault is not caused by physical damage to the hardware. The second class is usually called intermittent faults. These faults are temporary but reoccur in an unpredictable manner. They are caused by loose physical connections between components or by components used at the limits of their specification. Intermittent faults often become permanent faults after a period of time. A fault may be active or inactive. For example, if a fault causes the output of a digital component to be stuck at logic 1, and the desired output is logic 1, then this would be classed as an inactive fault. Once the desired output becomes logic 0, then the fault becomes active. The consequence for the system operation of a fault is an error. As the error may be caused by a permanent or by a transitory fault, it may be classed as a hard error or a soft error. An error in an individual subsystem may be due to a fault in that subsystem or to the propagation of an error from another part of the overall system. The terms fault and error are sometimes interchanged. The term failure is often used to mean anything covered by these definitions. The definitions given here are those in most common usage. Physical faults within a component can be characterized by their external electrical effects. These effects are commonly classified into fault models. The intention of any fault model is to take into account every possible failure mechanism, so that the effects on the system can be worked out. The manifestation of faults in a system can be classified according to the likely effects, producing an error model. The purpose of error models is to try to establish what kinds of corrective action need be taken in order to effect repairs. 98.3 Failure Rate and Reliability An individual component may fail after a random time, so it is impossible to predict any pattern of failure from one example. It is possible, however, to estimate the rate at which members of a group of identical components will fail. This rate can be determined by experimental means using accelerated life tests. In a normal operating environment, the time for a statistically significant number of failures to have occurred in a group of modern digital components could be tens or even hundreds of years. Consequently, the manufacturers must make the environment for the tests extremely unfavorable in order to produce failures in a few hours or days and then extrapolate back to produce the likely number of failures in a normal environment. The failure rate is then defined as the number of failures per unit time, in a given environment, compared with the number of surviving components. It is usually expressed as a number of failures per million hours. If f(t) is the number of components that have failed up to time t, and s(t) is the number of components that have survived, then z(t), the failure rate or hazard rate, is defined as (98.1) Most electronic components will exhibit a variation of failure rate with time. Many studies have shown that this variation can often be approximated to the pattern shown in Fig. 98.1. For obvious reasons this is known as a bathtub curve. The first phase, where the failure rate starts high but is decreasing with time, is where the components are suffering infant mortality; in other words, those that had manufacturing defects are failing. This is often called the burn-in phase. The second part, where the failure rate is roughly constant, is the useful life period of operation for the component. The final part, where the failure rate is increasing with time, is where the components are starting to wear out. Using the same nomenclature as before, if: (98.2) z t s t df t d t ( ) ( ) ( ) ( ) = × 1 s t( ) + = f (t) N
Time FIGURE 98.1 Variation of failure rate with time i.e., N is the total number of components in the test, then the reliability r(t)is defined as r(t)=s(t) (98.3) or in words, and using the definition from the IEEE Standard Dictionary of Electrical and Electronic Terms, reliability is the probability that a device will function without failure over a specified time or amount of usage, under stated conditions. 98.4 Relationship Between Reliability and Failure Rate Using Eqs.(98.1),(98.2), and (98.3)then z(t)= n dr(t) (984) s(t) d(t) A is commonly used as the symbol for the failure rate z (t)in the period where it is a constant, i.e., the useful life of the component. Consequently, we may write Eq.(98.4)as dr(t) r(t d(t) Rewriting, integrating, and using the limits of integration as r(r)=l at t =0 and r(t)=0 at t=oo gives the result r(t)=e-Ar (986) This result is true only for the period of operation where the failure rate is a constant. For most common components, real failure rates can be obtained from such handbooks as the American military MIL-HDBK 217E, as explained in Section 98.12 It must also be borne in mind that the calculated reliability is a probability function based on lifetime tests. There can be no guarantee that any batch of components will exhibit the same failure rate and hence reliability as those predicted because of variations in manufacturing conditions. Even if the components were made at e 2000 by CRC Press LLC
© 2000 by CRC Press LLC i.e., N is the total number of components in the test, then the reliability r(t) is defined as (98.3) or in words, and using the definition from the IEEE Standard Dictionary of Electrical and Electronic Terms, reliability is the probability that a device will function without failure over a specified time period or amount of usage, under stated conditions. 98.4 Relationship Between Reliability and Failure Rate Using Eqs. (98.1), (98.2), and (98.3) then (98.4) l is commonly used as the symbol for the failure rate z(t) in the period where it is a constant, i.e., the useful life of the component. Consequently, we may write Eq. (98.4) as (98.5) Rewriting, integrating, and using the limits of integration as r(t) = 1 at t=0 and r(t) = 0 at t = • gives the result: (98.6) This result is true only for the period of operation where the failure rate is a constant. For most common components, real failure rates can be obtained from such handbooks as the American military MIL-HDBK- 217E, as explained in Section 98.12. It must also be borne in mind that the calculated reliability is a probability function based on lifetime tests. There can be no guarantee that any batch of components will exhibit the same failure rate and hence reliability as those predicted because of variations in manufacturing conditions. Even if the components were made at FIGURE 98.1 Variation of failure rate with time. r t s t N ( ) ( ) = z t N s t dr t d t ( ) ( ) ( ) ( ) =- × l=- × 1 r t dr t ( ) d t ( ) ( ) rt e t ( ) = - l
he same factory as those tested, the process used might have been slightly different and the equipment will be older. Quality assurance standards are imposed on companies to try to guarantee that they meet minimum manufacturing standards, but some cases in the United States have shown that even the largest plants can fall short of these standards 98.5 Mean Time to Failure A figure that is commonly quoted because it gives a readier feel for the system performance is the mean time to failure or mttf This is defined as MTTF=r(t)dt (98.7) Hence, for the period where the failure rate is constant: MTTE (988) 98.6 Mean Time to Repair For many computer systems it is possible to define a mean time to repair(MTTR). This will be a function of a number of things, including the time taken to detect the failure, the time taken to isolate and replace the faulty component, and the time taken to verify that the system is operating correctly again. while the MTTF is a function of the system design and the operating environment, the MTTR is often a function of unpredictable human factors and, hence, is difficult to quantify Figures used for MTTR for a given system in a fixed situation could be predictions based on the experience of the reliability engineers or could be simply the maximum response time given in the maintenance contract for a computer. In either case, MTTR Predictions may be subject to some fluctuations. To take an extreme example, if the service engineer has a flat tire while on the way to effect the repair, then the repair time may be many times the predicted MTTR. For some systems no MTTR can be predicted, as they are in situations that make repair impossible or uneconomic. Computers in satellites are a good example. In these cases and all others where no errors in the output can be allowed, fault tolerant approaches must be used in order to extend the mTTF beyond the desired system operational lifetime. 98.7 Mean Time between failures For systems where repair is possible, a figure for the expected time between failures can be defined as MTBF= MTTF MTTR (989) The definitions given for MTTF and MTBF are the most commonly accepted ones. In some texts, MTBF is wrongly used as mean time before failure, confusing it with MTTE. In many real systems, MTTF is very much greater than MTTR, so the values of MTTF and MTBF will be almost identical, in any case 98.8 Availability Availability is defined as the probability that the system will be functioning at a given time during its normal working p e 2000 by CRC Press LLC
© 2000 by CRC Press LLC the same factory as those tested, the process used might have been slightly different and the equipment will be older. Quality assurance standards are imposed on companies to try to guarantee that they meet minimum manufacturing standards, but some cases in the United States have shown that even the largest plants can fall short of these standards. 98.5 Mean Time to Failure A figure that is commonly quoted because it gives a readier feel for the system performance is the mean time to failure or MTTF. This is defined as (98.7) Hence, for the period where the failure rate is constant: (98.8) 98.6 Mean Time to Repair For many computer systems it is possible to define a mean time to repair (MTTR). This will be a function of a number of things, including the time taken to detect the failure, the time taken to isolate and replace the faulty component, and the time taken to verify that the system is operating correctly again. While the MTTF is a function of the system design and the operating environment, the MTTR is often a function of unpredictable human factors and, hence, is difficult to quantify. Figures used for MTTR for a given system in a fixed situation could be predictions based on the experience of the reliability engineers or could be simply the maximum response time given in the maintenance contract for a computer. In either case, MTTR predictions may be subject to some fluctuations. To take an extreme example, if the service engineer has a flat tire while on the way to effect the repair, then the repair time may be many times the predicted MTTR. For some systems no MTTR can be predicted, as they are in situations that make repair impossible or uneconomic. Computers in satellites are a good example. In these cases and all others where no errors in the output can be allowed, fault tolerant approaches must be used in order to extend the MTTF beyond the desired system operational lifetime. 98.7 Mean Time Between Failures For systems where repair is possible, a figure for the expected time between failures can be defined as MTBF = MTTF + MTTR (98.9) The definitions given for MTTF and MTBF are the most commonly accepted ones. In some texts, MTBF is wrongly used as mean time before failure, confusing it with MTTF. In many real systems, MTTF is very much greater than MTTR, so the values of MTTF and MTBF will be almost identical, in any case. 98.8 Availability Availability is defined as the probability that the system will be functioning at a given time during its normal working period. MTTF = r( )t dt 0 • Ú MTTF = 1 l