Johnson, B w."Fault tolerance The Electrical Engineering Handbook Ed. Richard C. Dorf Boca raton crc Press llc. 2000
Johnson, B.W. “Fault Tolerance” The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000
93 Fault tolerance 93.2 Hardware Redundancy 93.3 Information redundanc 93.4 Time Redundancy Barry W. Johnson 93.5 Software Redundancy University of virginia 93.6 Dependability Evaluation 93.1 Introduction Fault tolerance is the ability of a system to continue correct performance of its tasks after the occurrence of hardware or software faults. A fault is simply any physical defect, imperfection, or flaw that occurs in hardware or software Applications of fault-tolerant computing can be categorized broadly into four primary areas: long- life, critical computations, maintenance postponement, and high availability. The most common examples of ng-life applications are unmanned space flight and satellites. Examples of critical-computation applications include aircraft flight control systems, military systems, and certain types of industrial controllers. Maintenance postponement applications appear most frequently when maintenance operations are extremely costly, incon- venient, or difficult to perform. Remote processing stations and certain space applications are good examples. Banking and other time-shared systems are good examples of high-availability applications. Fault tolerance can be achieved in systems by incorporating various forms of redundancy, including hardware, information, time, and software redundancy [Johnson, 1989] 93.2 Hardware Redundancy The physical replication of hardware is perhaps the most common form of fault tolerance used in systems. As semiconductor components have become smaller and less expensive, the concept of hardware redundancy has become more common and more practical. There are three basic forms of hardware redundancy. First, passive techniques use the concept of fault masking to hide the occurrence of faults and prevent the faults from resulting in errors. Passive approaches are designed to achieve fault tolerance without requiring any action on the part of the system or an operator. Passive techniques, in their most basic form, do not provide for the detection of faults but simply mask the faults. An example of a passive approach is triple modular redundancy (Tmr), which is illustrated in Fig. 93. 1. In the TMR system three identical units perform identical functions, and a majority vote is performed on the output. The second form of hardware redundancy is the active approach, which is sometimes called the dynamic method. Active methods achieve fault tolerance by detecting the existence of faults and performing some action to remove the faulty hardware from the system. In other words, active techniques require that the system perform reconfiguration to tolerate faults. Active hardware redundancy uses fault detection, fault location, and fault recovery in an attempt to achieve fault tolerance. An example of an active approach to hardware redundancy is standby sparing, which is illustrated in Fig. 93. 2. In standby sparing one or more units operate as spa replace the primary unit when it fails c 2000 by CRC Press LLC
© 2000 by CRC Press LLC 93 Fault Tolerance 93.1 Introduction 93.2 Hardware Redundancy 93.3 Information Redundancy 93.4 Time Redundancy 93.5 Software Redundancy 93.6 Dependability Evaluation 93.1 Introduction Fault tolerance is the ability of a system to continue correct performance of its tasks after the occurrence of hardware or software faults. A fault is simply any physical defect, imperfection, or flaw that occurs in hardware or software. Applications of fault-tolerant computing can be categorized broadly into four primary areas: longlife, critical computations, maintenance postponement, and high availability. The most common examples of long-life applications are unmanned space flight and satellites. Examples of critical-computation applications include aircraft flight control systems, military systems, and certain types of industrial controllers. Maintenance postponement applications appear most frequently when maintenance operations are extremely costly, inconvenient, or difficult to perform. Remote processing stations and certain space applications are good examples. Banking and other time-shared systems are good examples of high-availability applications. Fault tolerance can be achieved in systems by incorporating various forms of redundancy, including hardware, information, time, and software redundancy [Johnson, 1989]. 93.2 Hardware Redundancy The physical replication of hardware is perhaps the most common form of fault tolerance used in systems. As semiconductor components have become smaller and less expensive, the concept of hardware redundancy has become more common and more practical. There are three basic forms of hardware redundancy. First, passive techniques use the concept of fault masking to hide the occurrence of faults and prevent the faults from resulting in errors. Passive approaches are designed to achieve fault tolerance without requiring any action on the part of the system or an operator. Passive techniques, in their most basic form, do not provide for the detection of faults but simply mask the faults. An example of a passive approach is triple modular redundancy (TMR), which is illustrated in Fig. 93.1. In the TMR system three identical units perform identical functions, and a majority vote is performed on the output. The second form of hardware redundancy is the active approach, which is sometimes called the dynamic method. Active methods achieve fault tolerance by detecting the existence of faults and performing some action to remove the faulty hardware from the system. In other words, active techniques require that the system perform reconfiguration to tolerate faults. Active hardware redundancy uses fault detection, fault location, and fault recovery in an attempt to achieve fault tolerance.An example of an active approach to hardware redundancy is standby sparing, which is illustrated in Fig. 93.2. In standby sparing one or more units operate as spares and replace the primary unit when it fails. Barry W. Johnson University of Virginia
Module Module 2 locule 3 FIGURE 93.1 Fault masking using triple modular redundancy (tmr) Spare Module 1 Fault Detection Reconfiguration Unit FIGURE 93.2 General concept of standby sparing. The final form of hardware redundancy is the hybrid approach. Hybrid techniques combine the attractive features of both the passive and active approaches. Fault masking is used in hybrid systems to prevent erroneous results from being generated. Fault detection, fault location, and fault recovery are also used in the hybrid approaches to improve fault tolerance by removing faulty hardware and replacing it with spares. Providing spares is one form of providing redundancy in a system. Hybrid methods are most often used in the critical computation applications where fault masking is required to prevent momentary errors, and high reliability must be achieved. The basic concept of the hybrid approach is illustrated in Fig. 93.3 93.3 Information Redundancy Another approach to fault tolerance is to employ redundancy of information. Information redundancy is simply the addition of redundant information to data to allow fault detection, fault masking, or possibly fault tolerance Good examples of information redundancy are error detecting and error correcting codes, formed by the addition of redundant information to data words or by the mapping of data words into new representation containing redundant information [Lin and Costello, 1983] In general, a code is a means of representing information, or data, using a well-defined set of rules. A code word is a collection of symbols, often called digits if the symbols are numbers, used to represent a particular piece of data based upon a specified code. a binary code is one in which the symbols forming each code word onsist of only the digits 0 and 1. A code word is said to be valid if the code word adheres to all of the rules that define the code: otherwise the code word is said to be invalid. e 2000 by CRC Press LLC
© 2000 by CRC Press LLC The final form of hardware redundancy is the hybrid approach. Hybrid techniques combine the attractive features of both the passive and active approaches. Fault masking is used in hybrid systems to prevent erroneous results from being generated. Fault detection, fault location, and fault recovery are also used in the hybrid approaches to improve fault tolerance by removing faulty hardware and replacing it with spares. Providing spares is one form of providing redundancy in a system. Hybrid methods are most often used in the criticalcomputation applications where fault masking is required to prevent momentary errors, and high reliability must be achieved. The basic concept of the hybrid approach is illustrated in Fig. 93.3. 93.3 Information Redundancy Another approach to fault tolerance is to employ redundancy of information. Information redundancy is simply the addition of redundant information to data to allow fault detection, fault masking, or possibly fault tolerance. Good examples of information redundancy are error detecting and error correcting codes, formed by the addition of redundant information to data words or by the mapping of data words into new representations containing redundant information [Lin and Costello, 1983]. In general, a code is a means of representing information, or data, using a well-defined set of rules. A code word is a collection of symbols, often called digits if the symbols are numbers, used to represent a particular piece of data based upon a specified code. A binary code is one in which the symbols forming each code word consist of only the digits 0 and 1. A code word is said to be valid if the code word adheres to all of the rules that define the code; otherwise, the code word is said to be invalid. FIGURE 93.1 Fault masking using triple modular redundancy (TMR). FIGURE 93.2 General concept of standby sparing. Module 1 Module 2 Voter Output Module 3
Fault Detection Unit Active ●● imary Module I econfiguration Unit Primary Module N Inputs ●自● Spare Module I Spare Module M FIGURE 93.3 Hybrid redundancy appro The encoding operation is the process of determining the corresponding code word for a particular data item. other words, the encoding process takes an original data item and represents it as a code word using the rules of the code. The decoding operation is the process of recovering the original data from the code word. In other words, the decoding process takes a code word and determines the data that it represents It is possible to create a binary code for which the valid code words are a subset of the total number of possible combinations of ls and Os. If the code words are formed correctly, errors introduced into a code word will force it to lie in the range of illegal, or invalid, code words, and the error can be detected. This is the basic concept of the error detecting codes. The basic concept of the error correcting code is that the code word is structured such that it is possible to determine the correct code word from the corrupted, or erroneous, code A fundamental concept in the characterization of codes is the Hamming distance[Hamming, 1950]. The Hamming distance between any two binary words is the number of bit positions in which the two words differ. For example, the binary words 0000 and 0001 differ in only one position and therefore have a Hamming distance of 1. The binary words 0000 and 0101, however, differ in two positions; consequently, their Hamming distance is 2. Clearly, if two words have a Hamming distance of 1, it is possible to change one word into the other simply by modifying one bit in one of the words. If, however, two words differ in two bit positions, it is impossible to transform one word into the other by changing one bit in one of the word The Hamming distance gives insight into the requirements of error detecting codes and error correcting codes. We define the distance of a code as the minimum Hamming distance between any two valid code word a binary code has a distance of two, then any single-bit error introduced into a code word will result in the erroneous word being an invalid code word because all valid code words differ in at least two bit positions. If a code has a distance of 3, then any single-bit error or any double-bit error will result in the erroneous word eing an invalid code word because all valid code words differ in at least three positions. However, a code distance of 3 allows any single-bit error to be corrected, if it is desired to do so, because the erroneous word with a single-bit error will be a Hamming distance of 1 from the correct code word and at least a Hamming e 2000 by CRC Press LLC
© 2000 by CRC Press LLC The encoding operation is the process of determining the corresponding code word for a particular data item. In other words, the encoding process takes an original data item and represents it as a code word using the rules of the code. The decoding operation is the process of recovering the original data from the code word. In other words, the decoding process takes a code word and determines the data that it represents. It is possible to create a binary code for which the valid code words are a subset of the total number of possible combinations of 1s and 0s. If the code words are formed correctly, errors introduced into a code word will force it to lie in the range of illegal, or invalid, code words, and the error can be detected. This is the basic concept of the error detecting codes. The basic concept of the error correcting code is that the code word is structured such that it is possible to determine the correct code word from the corrupted, or erroneous, code word. A fundamental concept in the characterization of codes is the Hamming distance [Hamming, 1950]. The Hamming distance between any two binary words is the number of bit positions in which the two words differ. For example, the binary words 0000 and 0001 differ in only one position and therefore have a Hamming distance of 1. The binary words 0000 and 0101, however, differ in two positions; consequently, their Hamming distance is 2. Clearly, if two words have a Hamming distance of 1, it is possible to change one word into the other simply by modifying one bit in one of the words. If, however, two words differ in two bit positions, it is impossible to transform one word into the other by changing one bit in one of the words. The Hamming distance gives insight into the requirements of error detecting codes and error correcting codes. We define the distance of a code as the minimum Hamming distance between any two valid code words. If a binary code has a distance of two, then any single-bit error introduced into a code word will result in the erroneous word being an invalid code word because all valid code words differ in at least two bit positions. If a code has a distance of 3, then any single-bit error or any double-bit error will result in the erroneous word being an invalid code word because all valid code words differ in at least three positions. However, a code distance of 3 allows any single-bit error to be corrected, if it is desired to do so, because the erroneous word with a single-bit error will be a Hamming distance of 1 from the correct code word and at least a Hamming FIGURE 93.3 Hybrid redundancy approach. Primary Module 1 Fault Detection Unit Active Unit Outputs Disagreement Detection Inputs Reconfiguration Unit Voter Output Primary Module N Spare Module 1 Spare Module M
Error Signal Generator Data ata In Data Out FIGURE 93.4 Use of parity coding in a memory application. distance of 2 from all others. Consequently, the correct code word can be identified from the corrupted code In general, a binary code can correct up to c bit errors and detect an additional d bit errors if and only if 2c-d-1≤H where Hd is the distance of the code[Nelson and Carroll, 1986]. For example, a code with a distance of 2 cannot provide any error correction but can detect single-bit errors. Similarly, a code with a distance of 3 can correct le-bit errors or detect a double-bit error A second fundamental concept of codes is separability. A separable code is one in which the original infor mation is appended with new information to form the code word, thus allowing the decoding process to consist of simply removing the additional information and keeping the original data. In other words, the original data is obtained from the code word by stripping away extra bits, called the code bits or check bits, and retaining only those associated with the original information. A nonseparable code does not possess the property of separability and, consequently, requires more complicated decoding procedure rhaps the simplest form of a code is the parity code. The basic concept of parity is very straightforward, but there are variations on the fundamental idea. Single-bit parity codes require the addition of an extra bit to binary word such that the resulting code word has either an even number of ls or an odd number of 1s. If ne extra bit results in the total number of ls in the code word being odd, the code is referred to as odd If the resulting number of 1s in the code word is even, the code is called even parity. If a code word wit ph od parity experiences a change in one of its bits, the parity will become even. Likewise, if a code word with even parity encounters a single-bit change, the parity will become odd. Consequently, a single-bit error can be detected by checking the number of ls in the code words. The single-bit parity code(either odd or even) has a distance of 2, therefore allowing any single-bit error to be detected but not corrected. Figure 93.4 illustrates Arithmetic codes are very useful when it is desired to check arithmetic operations such as addition, multipli cation, and division[Avizienis, 1971]. The basic concept is the same as all coding techniques. The data presented to the arithmetic operation is encoded before the operations are performed. After completing the arithmetic operations, the resulting code words are checked to make sure that they are valid code words. If the resulting code words are not valid, an error condition is signaled. An arithmetic code must be invariant to a set of arithmetic operations. An arithmetic code, A, has the property that A(b*c)=A(b)*A(c), where b and c are operands, is some arithmetic operation, and A(b)and A(c) are the arithmetic code words for the operands b and c respectively. Stated verbally, the performance of the arithmetic operation on two arithmetic code words will produce the arithmetic code word of the result of the arithmetic operation. To completely define an arithmetic code, the method of encoding and the arithmetic operations for which the code is invariant must be specified. The most common examples of arithmetic codes are the AN codes, residue codes, and the inverse e 2000 by CRC Press LLC
© 2000 by CRC Press LLC distance of 2 from all others. Consequently, the correct code word can be identified from the corrupted code word. In general, a binary code can correct up to c bit errors and detect an additional d bit errors if and only if 2c 1 d 1 1 £ Hd where Hd is the distance of the code [Nelson and Carroll, 1986]. For example, a code with a distance of 2 cannot provide any error correction but can detect single-bit errors. Similarly, a code with a distance of 3 can correct single-bit errors or detect a double-bit error. A second fundamental concept of codes is separability. A separable code is one in which the original information is appended with new information to form the code word, thus allowing the decoding process to consist of simply removing the additional information and keeping the original data. In other words, the original data is obtained from the code word by stripping away extra bits, called the code bits or check bits, and retaining only those associated with the original information. A nonseparable code does not possess the property of separability and, consequently, requires more complicated decoding procedures. Perhaps the simplest form of a code is the parity code. The basic concept of parity is very straightforward, but there are variations on the fundamental idea. Single-bit parity codes require the addition of an extra bit to a binary word such that the resulting code word has either an even number of 1s or an odd number of 1s. If the extra bit results in the total number of 1s in the code word being odd, the code is referred to as odd parity. If the resulting number of 1s in the code word is even, the code is called even parity. If a code word with odd parity experiences a change in one of its bits, the parity will become even. Likewise, if a code word with even parity encounters a single-bit change, the parity will become odd. Consequently, a single-bit error can be detected by checking the number of ls in the code words. The single-bit parity code (either odd or even) has a distance of 2, therefore allowing any single-bit error to be detected but not corrected. Figure 93.4 illustrates the use of parity coding in a simple memory application. Arithmetic codes are very useful when it is desired to check arithmetic operations such as addition, multiplication, and division [Avizienis, 1971]. The basic concept is the same as all coding techniques. The data presented to the arithmetic operation is encoded before the operations are performed. After completing the arithmetic operations, the resulting code words are checked to make sure that they are valid code words. If the resulting code words are not valid, an error condition is signaled. An arithmetic code must be invariant to a set of arithmetic operations. An arithmetic code, A, has the property that A(b*c) = A(b)*A(c), where b and c are operands, * is some arithmetic operation, and A(b) and A(c) are the arithmetic code words for the operands b and c, respectively. Stated verbally, the performance of the arithmetic operation on two arithmetic code words will produce the arithmetic code word of the result of the arithmetic operation. To completely define an arithmetic code, the method of encoding and the arithmetic operations for which the code is invariant must be specified. The most common examples of arithmetic codes are the AN codes, residue codes, and the inverse residue codes. FIGURE 93.4 Use of parity coding in a memory application. Parity Generator Parity Checker Memory Data Data In Data Out Parity Bit Parity Bit Data Error Signal