《电子工程师手册》学习资料（英文版）Chapter 93 Fault Tolerance.pdf

93 Fault tolerance 93.2 Hardware Redundancy 93.3 Information redundanc 93.4 Time Redundancy Barry W. Johnson 93.5 Software Redundancy University of virginia 93.6 Dependability Evaluation 93.1 Introduction Fault tolerance is the ability of a system to continue correct performance of its tasks after the occurrence of hardware or software faults. A fault is simply any physical defect, imperfection, or flaw that occurs in hardware or software Applications of fault-tolerant computing can be categorized broadly into four primary areas: long- life, critical computations, maintenance postponement, and high availability. The most common examples of ng-life applications are unmanned space flight and satellites. Examples of critical-computation applications include aircraft flight control systems, military systems, and certain types of industrial controllers. Maintenance postponement applications appear most frequently when maintenance operations are extremely costly, incon- venient, or difficult to perform. Remote processing stations and certain space applications are good examples. Banking and other time-shared systems are good examples of high-availability applications. Fault tolerance can be achieved in systems by incorporating various forms of redundancy, including hardware, information, time, and software redundancy [Johnson, 1989] 93.2 Hardware Redundancy The physical replication of hardware is perhaps the most common form of fault tolerance used in systems. As semiconductor components have become smaller and less expensive, the concept of hardware redundancy has become more common and more practical. There are three basic forms of hardware redundancy. First, passive techniques use the concept of fault masking to hide the occurrence of faults and prevent the faults from resulting in errors. Passive approaches are designed to achieve fault tolerance without requiring any action on the part of the system or an operator. Passive techniques, in their most basic form, do not provide for the detection of faults but simply mask the faults. An example of a passive approach is triple modular redundancy (Tmr), which is illustrated in Fig. 93. 1. In the TMR system three identical units perform identical functions, and a majority vote is performed on the output. The second form of hardware redundancy is the active approach, which is sometimes called the dynamic method. Active methods achieve fault tolerance by detecting the existence of faults and performing some action to remove the faulty hardware from the system. In other words, active techniques require that the system perform reconfiguration to tolerate faults. Active hardware redundancy uses fault detection, fault location, and fault recovery in an attempt to achieve fault tolerance. An example of an active approach to hardware redundancy is standby sparing, which is illustrated in Fig. 93. 2. In standby sparing one or more units operate as spa replace the primary unit when it fails c 2000 by CRC Press LLC

© 2000 by CRC Press LLC 93 Fault Tolerance 93.1 Introduction 93.2 Hardware Redundancy 93.3 Information Redundancy 93.4 Time Redundancy 93.5 Software Redundancy 93.6 Dependability Evaluation 93.1 Introduction Fault tolerance is the ability of a system to continue correct performance of its tasks after the occurrence of hardware or software faults. A fault is simply any physical defect, imperfection, or flaw that occurs in hardware or software. Applications of fault-tolerant computing can be categorized broadly into four primary areas: longlife, critical computations, maintenance postponement, and high availability. The most common examples of long-life applications are unmanned space flight and satellites. Examples of critical-computation applications include aircraft flight control systems, military systems, and certain types of industrial controllers. Maintenance postponement applications appear most frequently when maintenance operations are extremely costly, inconvenient, or difficult to perform. Remote processing stations and certain space applications are good examples. Banking and other time-shared systems are good examples of high-availability applications. Fault tolerance can be achieved in systems by incorporating various forms of redundancy, including hardware, information, time, and software redundancy [Johnson, 1989]. 93.2 Hardware Redundancy The physical replication of hardware is perhaps the most common form of fault tolerance used in systems. As semiconductor components have become smaller and less expensive, the concept of hardware redundancy has become more common and more practical. There are three basic forms of hardware redundancy. First, passive techniques use the concept of fault masking to hide the occurrence of faults and prevent the faults from resulting in errors. Passive approaches are designed to achieve fault tolerance without requiring any action on the part of the system or an operator. Passive techniques, in their most basic form, do not provide for the detection of faults but simply mask the faults. An example of a passive approach is triple modular redundancy (TMR), which is illustrated in Fig. 93.1. In the TMR system three identical units perform identical functions, and a majority vote is performed on the output. The second form of hardware redundancy is the active approach, which is sometimes called the dynamic method. Active methods achieve fault tolerance by detecting the existence of faults and performing some action to remove the faulty hardware from the system. In other words, active techniques require that the system perform reconfiguration to tolerate faults. Active hardware redundancy uses fault detection, fault location, and fault recovery in an attempt to achieve fault tolerance.An example of an active approach to hardware redundancy is standby sparing, which is illustrated in Fig. 93.2. In standby sparing one or more units operate as spares and replace the primary unit when it fails. Barry W. Johnson University of Virginia

Fault Detection Unit Active ●● imary Module I econfiguration Unit Primary Module N Inputs ●自● Spare Module I Spare Module M FIGURE 93.3 Hybrid redundancy appro The encoding operation is the process of determining the corresponding code word for a particular data item. other words, the encoding process takes an original data item and represents it as a code word using the rules of the code. The decoding operation is the process of recovering the original data from the code word. In other words, the decoding process takes a code word and determines the data that it represents It is possible to create a binary code for which the valid code words are a subset of the total number of possible combinations of ls and Os. If the code words are formed correctly, errors introduced into a code word will force it to lie in the range of illegal, or invalid, code words, and the error can be detected. This is the basic concept of the error detecting codes. The basic concept of the error correcting code is that the code word is structured such that it is possible to determine the correct code word from the corrupted, or erroneous, code A fundamental concept in the characterization of codes is the Hamming distance[Hamming, 1950]. The Hamming distance between any two binary words is the number of bit positions in which the two words differ. For example, the binary words 0000 and 0001 differ in only one position and therefore have a Hamming distance of 1. The binary words 0000 and 0101, however, differ in two positions; consequently, their Hamming distance is 2. Clearly, if two words have a Hamming distance of 1, it is possible to change one word into the other simply by modifying one bit in one of the words. If, however, two words differ in two bit positions, it is impossible to transform one word into the other by changing one bit in one of the word The Hamming distance gives insight into the requirements of error detecting codes and error correcting codes. We define the distance of a code as the minimum Hamming distance between any two valid code word a binary code has a distance of two, then any single-bit error introduced into a code word will result in the erroneous word being an invalid code word because all valid code words differ in at least two bit positions. If a code has a distance of 3, then any single-bit error or any double-bit error will result in the erroneous word eing an invalid code word because all valid code words differ in at least three positions. However, a code distance of 3 allows any single-bit error to be corrected, if it is desired to do so, because the erroneous word with a single-bit error will be a Hamming distance of 1 from the correct code word and at least a Hamming e 2000 by CRC Press LLC

© 2000 by CRC Press LLC The encoding operation is the process of determining the corresponding code word for a particular data item. In other words, the encoding process takes an original data item and represents it as a code word using the rules of the code. The decoding operation is the process of recovering the original data from the code word. In other words, the decoding process takes a code word and determines the data that it represents. It is possible to create a binary code for which the valid code words are a subset of the total number of possible combinations of 1s and 0s. If the code words are formed correctly, errors introduced into a code word will force it to lie in the range of illegal, or invalid, code words, and the error can be detected. This is the basic concept of the error detecting codes. The basic concept of the error correcting code is that the code word is structured such that it is possible to determine the correct code word from the corrupted, or erroneous, code word. A fundamental concept in the characterization of codes is the Hamming distance [Hamming, 1950]. The Hamming distance between any two binary words is the number of bit positions in which the two words differ. For example, the binary words 0000 and 0001 differ in only one position and therefore have a Hamming distance of 1. The binary words 0000 and 0101, however, differ in two positions; consequently, their Hamming distance is 2. Clearly, if two words have a Hamming distance of 1, it is possible to change one word into the other simply by modifying one bit in one of the words. If, however, two words differ in two bit positions, it is impossible to transform one word into the other by changing one bit in one of the words. The Hamming distance gives insight into the requirements of error detecting codes and error correcting codes. We define the distance of a code as the minimum Hamming distance between any two valid code words. If a binary code has a distance of two, then any single-bit error introduced into a code word will result in the erroneous word being an invalid code word because all valid code words differ in at least two bit positions. If a code has a distance of 3, then any single-bit error or any double-bit error will result in the erroneous word being an invalid code word because all valid code words differ in at least three positions. However, a code distance of 3 allows any single-bit error to be corrected, if it is desired to do so, because the erroneous word with a single-bit error will be a Hamming distance of 1 from the correct code word and at least a Hamming FIGURE 93.3 Hybrid redundancy approach. Primary Module 1 Fault Detection Unit Active Unit Outputs Disagreement Detection Inputs Reconfiguration Unit Voter Output Primary Module N Spare Module 1 Spare Module M

Error Signal Generator Data ata In Data Out FIGURE 93.4 Use of parity coding in a memory application. distance of 2 from all others. Consequently, the correct code word can be identified from the corrupted code In general, a binary code can correct up to c bit errors and detect an additional d bit errors if and only if 2c-d-1≤H where Hd is the distance of the code[Nelson and Carroll, 1986]. For example, a code with a distance of 2 cannot provide any error correction but can detect single-bit errors. Similarly, a code with a distance of 3 can correct le-bit errors or detect a double-bit error A second fundamental concept of codes is separability. A separable code is one in which the original infor mation is appended with new information to form the code word, thus allowing the decoding process to consist of simply removing the additional information and keeping the original data. In other words, the original data is obtained from the code word by stripping away extra bits, called the code bits or check bits, and retaining only those associated with the original information. A nonseparable code does not possess the property of separability and, consequently, requires more complicated decoding procedure rhaps the simplest form of a code is the parity code. The basic concept of parity is very straightforward, but there are variations on the fundamental idea. Single-bit parity codes require the addition of an extra bit to binary word such that the resulting code word has either an even number of ls or an odd number of 1s. If ne extra bit results in the total number of ls in the code word being odd, the code is referred to as odd If the resulting number of 1s in the code word is even, the code is called even parity. If a code word wit ph od parity experiences a change in one of its bits, the parity will become even. Likewise, if a code word with even parity encounters a single-bit change, the parity will become odd. Consequently, a single-bit error can be detected by checking the number of ls in the code words. The single-bit parity code(either odd or even) has a distance of 2, therefore allowing any single-bit error to be detected but not corrected. Figure 93.4 illustrates Arithmetic codes are very useful when it is desired to check arithmetic operations such as addition, multipli cation, and division[Avizienis, 1971]. The basic concept is the same as all coding techniques. The data presented to the arithmetic operation is encoded before the operations are performed. After completing the arithmetic operations, the resulting code words are checked to make sure that they are valid code words. If the resulting code words are not valid, an error condition is signaled. An arithmetic code must be invariant to a set of arithmetic operations. An arithmetic code, A, has the property that A(b*c)=A(b)*A(c), where b and c are operands, is some arithmetic operation, and A(b)and A(c) are the arithmetic code words for the operands b and c respectively. Stated verbally, the performance of the arithmetic operation on two arithmetic code words will produce the arithmetic code word of the result of the arithmetic operation. To completely define an arithmetic code, the method of encoding and the arithmetic operations for which the code is invariant must be specified. The most common examples of arithmetic codes are the AN codes, residue codes, and the inverse e 2000 by CRC Press LLC

© 2000 by CRC Press LLC distance of 2 from all others. Consequently, the correct code word can be identified from the corrupted code word. In general, a binary code can correct up to c bit errors and detect an additional d bit errors if and only if 2c 1 d 1 1 £ Hd where Hd is the distance of the code [Nelson and Carroll, 1986]. For example, a code with a distance of 2 cannot provide any error correction but can detect single-bit errors. Similarly, a code with a distance of 3 can correct single-bit errors or detect a double-bit error. A second fundamental concept of codes is separability. A separable code is one in which the original information is appended with new information to form the code word, thus allowing the decoding process to consist of simply removing the additional information and keeping the original data. In other words, the original data is obtained from the code word by stripping away extra bits, called the code bits or check bits, and retaining only those associated with the original information. A nonseparable code does not possess the property of separability and, consequently, requires more complicated decoding procedures. Perhaps the simplest form of a code is the parity code. The basic concept of parity is very straightforward, but there are variations on the fundamental idea. Single-bit parity codes require the addition of an extra bit to a binary word such that the resulting code word has either an even number of 1s or an odd number of 1s. If the extra bit results in the total number of 1s in the code word being odd, the code is referred to as odd parity. If the resulting number of 1s in the code word is even, the code is called even parity. If a code word with odd parity experiences a change in one of its bits, the parity will become even. Likewise, if a code word with even parity encounters a single-bit change, the parity will become odd. Consequently, a single-bit error can be detected by checking the number of ls in the code words. The single-bit parity code (either odd or even) has a distance of 2, therefore allowing any single-bit error to be detected but not corrected. Figure 93.4 illustrates the use of parity coding in a simple memory application. Arithmetic codes are very useful when it is desired to check arithmetic operations such as addition, multiplication, and division [Avizienis, 1971]. The basic concept is the same as all coding techniques. The data presented to the arithmetic operation is encoded before the operations are performed. After completing the arithmetic operations, the resulting code words are checked to make sure that they are valid code words. If the resulting code words are not valid, an error condition is signaled. An arithmetic code must be invariant to a set of arithmetic operations. An arithmetic code, A, has the property that A(b*c) = A(b)*A(c), where b and c are operands, * is some arithmetic operation, and A(b) and A(c) are the arithmetic code words for the operands b and c, respectively. Stated verbally, the performance of the arithmetic operation on two arithmetic code words will produce the arithmetic code word of the result of the arithmetic operation. To completely define an arithmetic code, the method of encoding and the arithmetic operations for which the code is invariant must be specified. The most common examples of arithmetic codes are the AN codes, residue codes, and the inverse residue codes. FIGURE 93.4 Use of parity coding in a memory application. Parity Generator Parity Checker Memory Data Data In Data Out Parity Bit Parity Bit Data Error Signal