Suppose we have a system of constraints on possible sequences of the type which can be represented by alinear graph as in Fig.2.If probabilities pwere assigned to the various linesconecting stateito state this would become a source.There is one particular assignment which maximizes the resulting entropy (see Appendix 4). Theorem 8:Let the system of constraints considered as a channel have a capacity C=logW.If we assign p9=县m- Bi whereis the duration of the symbol leading from stateitostatejand the B satisfy B,=∑B,m-号 then H is maximized and equal to C. By proper assignment of the transition probabilities the entropy of symbols on a channel can be maxi- mized at the channel capacity. 9.THE FUNDAMENTAL THEOREM FOR A NOISELESS CHANNEL We will now justify our interpretation of H as the rate of generating information by proving that H deter- mines the channel capacity required with most efficient coding. Theorem 9:Let a source have entropy H(bits per symbol)and a channel have a capacity C(bits per second).Then it is possible to encode the output of the source in such a way as to transmit at the average C atesymbols per second over the channel where is arbitrarily small.It is not possible to transmit at C an average rate greater than The converse part of the theorem,that cannot be exceeded,may be proved by noting that the entropy of the channel input per second is equal to that of the source,since the transmitter must be non-singular,and also this entropy cannot exceed the channel capacity.Hence H'<C and the number of symbols per second =H'-H<C-H. The first part of the theorem will be proved in two different ways.The first method is to consider the set of all sequences of N symbols produced by the source.For N large we can divide these into two groups, one containing less than 2(+N members and the second containing less than 2RN members(where R is the logarithm of the number of different symbols)and having a total probability less than u.As N increases n and u approach zero.The number of signals of duration T in the channel is greater than 2(C-)T with small when T is large.if we choose T-(+N then there will be a sufficient number of sequences of channel symbols for the high probability group when N and T are sufficiently large(however small A)and also some additional ones.The high probability group is coded in an arbitrary one-to-one way into this set.The remaining sequences are represented by larger sequences,starting and ending with one of the sequences not used for the high probability group.This special sequence acts as a start and stop signal for a different code.In between a sufficient time is allowed to give enough different sequences for all the low probability messages.This will require where is small.The mean rate of transmission in message symbols per second will then be greater than [-听+调=[-(++(+. 16
Suppose we have a system of constraints on possible sequences of the type which can be represented by a linear graph as in Fig. 2. If probabilities p(s) i j were assigned to the various lines connecting state i to state j this would become a source. There is one particular assignment which maximizes the resulting entropy (see Appendix 4). Theorem 8: Let the system of constraints considered as a channel have a capacity C = logW. If we assign p(s) i j = Bj Bi W ` (s) i j where ` (s) i j is the duration of the sth symbol leading from state i to state j and the Bi satisfy Bi = ∑ s; j BjW ` (s) i j then H is maximized and equal to C. By proper assignment of the transition probabilities the entropy of symbols on a channel can be maximized at the channel capacity. 9. THE FUNDAMENTAL THEOREM FOR A NOISELESS CHANNEL We will now justify our interpretation of H as the rate of generating information by proving that H determines the channel capacity required with most efficient coding. Theorem 9: Let a source have entropy H (bits per symbol) and a channel have a capacity C (bits per second). Then it is possible to encode the output of the source in such a way as to transmit at the average rate C H symbols per second over the channel where is arbitrarily small. It is not possible to transmit at an average rate greater than C H . The converse part of the theorem, that C H cannot be exceeded, may be proved by noting that the entropy of the channel input per second is equal to that of the source, since the transmitter must be non-singular, and also this entropy cannot exceed the channel capacity. Hence H 0 C and the number of symbols per second = H 0=H C=H. The first part of the theorem will be proved in two different ways. The first method is to consider the set of all sequences of N symbols produced by the source. For N large we can divide these into two groups, one containing less than 2(H+)N members and the second containing less than 2RN members (where R is the logarithm of the number of different symbols) and having a total probability less than . As N increases and approach zero. The number of signals of duration T in the channel is greater than 2(C)T with small when T is large. if we choose T = H C + N then there will be a sufficient number of sequences of channel symbols for the high probability group when N and T are sufficiently large (however small ) and also some additional ones. The high probability group is coded in an arbitrary one-to-one way into this set. The remaining sequences are represented by larger sequences, starting and ending with one of the sequences not used for the high probability group. This special sequence acts as a start and stop signal for a different code. In between a sufficient time is allowed to give enough different sequences for all the low probability messages. This will require T1 = R C + ' N where ' is small. The mean rate of transmission in message symbols per second will then be greater than (1 ) T N + T1 N #1 = (1 ) H C + + R C + ' 1 : 16
As N increases >and approach zero and the rate approaches C Another method of performing this coding and thereby proving the theorem can be described as follows: Arrange the messages of length N in order of decreasing probability and suppose their probabilities are p1≥p2≥p3…≥pn.LetP,=∑-'pi;that is P,is the cumulative probability upto,but not including,.ps- We first encode into a binary system.The binary code for message s is obtained by expanding P,as a binary number.The expansion is carried out to ms places,where m,is the integer satisfying: log2一≤ms<1+log2 Ps Ps Thus the messages of high probability are represented by short codes and those of low probability by long codes.From these inequalities we have 1 2m≤Ps<2m The code for Ps will differ from all succeeding ones in one or more of its ms places,since all the remaining P are at least larger and their binary expansions therefore differ in the first ms places.Consequently all the codes are different and it is possible to recover the message from its code.If the channel sequences are not already sequences of binary digits,they can be ascribed binary numbers in an arbitrary fashion and the binary code thus translated into signals suitable for the channel. The average number H of binary digits used per symbol of original message is easily estimated.We have H-∑mp: But, 17 Ps Ps and therefore, Gw≤H<GN+N 1 As N increases Gy approaches H,the entropy of the source and H approaches H. We see from this that the inefficiency in coding,when only a finite delay of N symbols is used,need not be greater thanplus the difference between the true entropy H and the entropy GN calculated for sequences of length N.The per cent excess time needed over the ideal is therefore less than GN 1 H+m-1 This method of encoding is substantially the same as one found independently by R.M.Fano.9 His method is to arrange the messages of length N in order of decreasing probability.Divide this series into two groups of as nearly equal probability as possible.If the message is in the first group its first binary digit will be 0,otherwise 1.The groups are similarly divided into subsets of nearly equal probability and the particular subset determines the second binary digit.This process is continued until each subset contains only one message.It is easily seen that apart from minor differences(generally in the last digit)this amounts to the same thing as the arithmetic process described above 10.DISCUSSION AND EXAMPLES In order to obtain the maximum power transfer from a generator to a load,a transformer must in general be introduced so that the generator as seen from the load has the load resistance.The situation here is roughly analogous.The transducer which does the encoding should match the source to the channel in a statistical sense.The source as seen from the channel through the transducer should have the same statistical structure 9Technical Report No.65,The Research Laboratory of Electronics,M.I.T,March 17,1949. 17
As N increases , and ' approach zero and the rate approaches C H . Another method of performing this coding and thereby proving the theorem can be described as follows: Arrange the messages of length N in order of decreasing probability and suppose their probabilities are p1 p2 p3 pn. Let Ps = ∑s1 1 pi; that is Ps is the cumulative probability up to, but not including, ps. We first encode into a binary system. The binary code for message s is obtained by expanding Ps as a binary number. The expansion is carried out to ms places, where ms is the integer satisfying: log2 1 ps ms < 1 + log2 1 ps : Thus the messages of high probability are represented by short codes and those of low probability by long codes. From these inequalities we have 1 2ms ps < 1 2ms1 : The code for Ps will differ from all succeeding ones in one or more of its ms places, since all the remaining Pi are at least 1 2ms larger and their binary expansions therefore differ in the first ms places. Consequently all the codes are different and it is possible to recover the message from its code. If the channel sequences are not already sequences of binary digits, they can be ascribed binary numbers in an arbitrary fashion and the binary code thus translated into signals suitable for the channel. The average number H 0 of binary digits used per symbol of original message is easily estimated. We have H 0 = 1 N ∑msps: But, 1 N ∑ log2 1 ps ps 1 N ∑msps < 1 N ∑ 1 + log2 1 ps ps and therefore, GN H 0 < GN + 1 N As N increases GN approaches H, the entropy of the source and H 0 approaches H. We see from this that the inefficiency in coding, when only a finite delay of N symbols is used, need not be greater than 1 N plus the difference between the true entropy H and the entropy GN calculated for sequences of length N. The per cent excess time needed over the ideal is therefore less than GN H + 1 HN 1: This method of encoding is substantially the same as one found independently by R. M. Fano.9 His method is to arrange the messages of length N in order of decreasing probability. Divide this series into two groups of as nearly equal probability as possible. If the message is in the first group its first binary digit will be 0, otherwise 1. The groups are similarly divided into subsets of nearly equal probability and the particular subset determines the second binary digit. This process is continued until each subset contains only one message. It is easily seen that apart from minor differences (generally in the last digit) this amounts to the same thing as the arithmetic process described above. 10. DISCUSSION AND EXAMPLES In order to obtain the maximum power transfer from a generator to a load, a transformer must in general be introduced so that the generator as seen from the load has the load resistance. The situation here is roughly analogous. The transducer which does the encoding should match the source to the channel in a statistical sense. The source as seen from the channel through the transducer should have the same statistical structure 9Technical Report No. 65, The Research Laboratory of Electronics, M.I.T., March 17, 1949. 17
as the source which maximizes the entropy in the channel.The content of Theorem 9 is that,although an exact match is not in general possible,we can approximate it as closely as desired.The ratio of the actual rate of transmission to the capacity C may be called the efficiency of the coding system.This is of course equal to the ratio of the actual entropy of the channel symbols to the maximum possible entropy. In general,ideal or nearly ideal encoding requires a long delay in the transmitter and receiver.In the noiseless case which we have been considering,the main function of this delay is to allow reasonably good matching of probabilities to corresponding lengths of sequences.With a good code the logarithm of the reciprocal probability of a long message must be proportional to the duration of the corresponding signal,in fact logp-1 一C T must be small for all but a small fraction of the long messages. If a source can produce only one particular message its entropy is zero,and no channel is required.For example,a computing machine set up to calculate the successive digits of m produces a definite sequence with no chance element.No channel is required to "transmit this to another point.One could construct a second machine to compute the same sequence at the point.However,this may be impractical.In such a case we can choose to ignore some or all of the statistical knowledge we have of the source.We might consider the digits of m to be a random sequence in that we construct a system capable of sending any sequence of digits.In a similar way we may choose to use some of our statistical knowledge of English in constructing a code,but not all of it.In such a case we consider the source with the maximum entropy subject to the statistical conditions we wish to retain.The entropy of this source determines the channel capacity which is necessary and sufficient.In the m example the only information retained is that all the digits are chosen from the set 0,1,....9.In the case of English one might wish to use the statistical saving possible due to letter frequencies,but nothing else.The maximum entropy source is then the first approximation to English and its entropy determines the required channel capacity. As a simple example of some of these results consider a source which produces a sequence of letters chosen from among,BC.Dwith probabilitiessuccessive symbols being chosen independently. We have H--log+log+log) -子bits per symbol. Thus we can approximate a coding system to encode messages from this source into binary digits with an average of binary digit per symbol.In this case we can actually achieve the limiting value by the following code (obtained by the method of the second proof of Theorem 9): > 0 B 10 0 110 D 111 The average number of binary digits used in encoding a sequence of N symbols will be N×1+寸×2+发×3)-N It is easily seen that the binary digits 0,I have probabilities,so the H for the coded sequences is one bit per symbol.Since,on the average,we have binary symbols per original letter,the entropies on a time basis are the same.The maximum possible entropy for the original set is log4-2,occurring when A,B,C, Dhave probabilities Hence the relative entropy is We can translate the binary sequences into the original set of symbols on a two-to-one basis by the following table: 00 01 B 10 D 18
as the source which maximizes the entropy in the channel. The content of Theorem 9 is that, although an exact match is not in general possible, we can approximate it as closely as desired. The ratio of the actual rate of transmission to the capacity C may be called the efficiency of the coding system. This is of course equal to the ratio of the actual entropy of the channel symbols to the maximum possible entropy. In general, ideal or nearly ideal encoding requires a long delay in the transmitter and receiver. In the noiseless case which we have been considering, the main function of this delay is to allow reasonably good matching of probabilities to corresponding lengths of sequences. With a good code the logarithm of the reciprocal probability of a long message must be proportional to the duration of the corresponding signal, in fact log p1 T C must be small for all but a small fraction of the long messages. If a source can produce only one particular message its entropy is zero, and no channel is required. For example, a computing machine set up to calculate the successive digits of produces a definite sequence with no chance element. No channel is required to “transmit” this to another point. One could construct a second machine to compute the same sequence at the point. However, this may be impractical. In such a case we can choose to ignore some or all of the statistical knowledge we have of the source. We might consider the digits of to be a random sequence in that we construct a system capable of sending any sequence of digits. In a similar way we may choose to use some of our statistical knowledge of English in constructing a code, but not all of it. In such a case we consider the source with the maximum entropy subject to the statistical conditions we wish to retain. The entropy of this source determines the channel capacity which is necessary and sufficient. In the example the only information retained is that all the digits are chosen from the set 0; 1;:::; 9. In the case of English one might wish to use the statistical saving possible due to letter frequencies, but nothing else. The maximum entropy source is then the first approximation to English and its entropy determines the required channel capacity. As a simple example of some of these results consider a source which produces a sequence of letters chosen from among A, B,C, D with probabilities 1 2 , 1 4 , 1 8 , 1 8 , successive symbols being chosen independently. We have H = 1 2 log 1 2 + 1 4 log 1 4 + 2 8 log 1 8 = 7 4 bits per symbol: Thus we can approximate a coding system to encode messages from this source into binary digits with an average of 7 4 binary digit per symbol. In this case we can actually achieve the limiting value by the following code (obtained by the method of the second proof of Theorem 9): A 0 B 10 C 110 D 111 The average number of binary digits used in encoding a sequence of N symbols will be N 1 2 1 + 1 4 2 + 2 8 3 = 7 4N : It is easily seen that the binary digits 0, 1 have probabilities 1 2 , 1 2 so the H for the coded sequences is one bit per symbol. Since, on the average, we have 7 4 binary symbols per original letter, the entropies on a time basis are the same. The maximum possible entropy for the original set is log4 = 2, occurring when A, B, C, D have probabilities 1 4 , 1 4 , 1 4 , 1 4 . Hence the relative entropy is 7 8 . We can translate the binary sequences into the original set of symbols on a two-to-one basis by the following table: 00 A0 01 B0 10 C0 11 D0 18
This double process then encodes the original message into the same symbols but with an average compres- sion ratio As a second example consider a source which produces a sequence of 4's and B's with probability p for A and g for B.Ifp<g we have H=-logpP(1-p)-P =-plogp(1-p)(1-p)/p e ÷plog In such a case one can construct a fairly good coding of the message on a 0,I channel by sending a special sequence,say 0000,for the infrequent symbol 4 and then a sequence indicating the number of B's following it.This could be indicated by the binary representation with all numbers containing the special sequence deleted.All numbers up to 16 are represented as usual;16 is represented by the next binary number after 16 which does not contain four zeros,namely 17=10001.etc. It can be shown that as p0 the coding approaches ideal provided the length of the special sequence is properly adjusted. PART II:THE DISCRETE CHANNEL WITH NOISE 11.REPRESENTATION OF A NOISY DISCRETE CHANNEL We now consider the case where the signal is perturbed by noise during transmission or at one or the other of the terminals.This means that the received signal is not necessarily the same as that sent out by the transmitter.Two cases may be distinguished.If a particular transmitted signal always produces the same received signal,i.e.,the received signal is a definite function of the transmitted signal,then the effect may be called distortion.If this function has an inverse-no two transmitted signals producing the same received signal-distortion may be corrected,at least in principle,by merely performing the inverse functional operation on the received signal. The case of interest here is that in which the signal does not always undergo the same change in trans- mission.In this case we may assume the received signal E to be a function of the transmitted signal S and a second variable,the noise N. E=f(S.N) The noise is considered to be a chance variable just as the message was above.In general it may be repre- sented by a suitable stochastic process.The most general type of noisy discrete channel we shall consider is a generalization of the finite state noise-free channel described previously.We assume a finite number of states and a set of probabilities Pa(3,). This is the probability,if the channel is in state a and symbol i is transmitted,that symbol j will be received and the channel left in state B.Thus a and B range over the possible states,i over the possible transmitted signals and j over the possible received signals.In the case where successive symbols are independently per- turbed by the noise there is only one state,and the channel is described by the set of transition probabilities pi(j),the probability of transmitted symbol i being received as j. If a noisy channel is fed by a source there are two statistical processes at work:the source and the noise. Thus there are a number of entropies that can be calculated.First there is the entropy H(x)of the source or of the input to the channel(these will be equal if the transmitter is non-singular).The entropy of the output of the channel,i.e.,the received signal,will be denoted by H(y).In the noiseless case H(v)=H(x). The joint entropy of input and output will be H(xy).Finally there are two conditional entropies H(y)and H(x),the entropy of the output when the input is known and conversely.Among these quantities we have the relations H(x,y)=H(x)+Hx(y)=H(y)+Hy(x). All of these entropies can be measured on a per-second or a per-symbol basis. 19
This double process then encodes the original message into the same symbols but with an average compression ratio 7 8 . As a second example consider a source which produces a sequence of A’s and B’s with probability p for A and q for B. If p q we have H = log pp(1 p) 1p = plog p(1 p) (1p)=p := plog e p : In such a case one can construct a fairly good coding of the message on a 0, 1 channel by sending a special sequence, say 0000, for the infrequent symbol A and then a sequence indicating the number of B’s following it. This could be indicated by the binary representation with all numbers containing the special sequence deleted. All numbers up to 16 are represented as usual; 16 is represented by the next binary number after 16 which does not contain four zeros, namely 17 = 10001, etc. It can be shown that as p ! 0 the coding approaches ideal provided the length of the special sequence is properly adjusted. PART II: THE DISCRETE CHANNEL WITH NOISE 11. REPRESENTATION OF A NOISY DISCRETE CHANNEL We now consider the case where the signal is perturbed by noise during transmission or at one or the other of the terminals. This means that the received signal is not necessarily the same as that sent out by the transmitter. Two cases may be distinguished. If a particular transmitted signal always produces the same received signal, i.e., the received signal is a definite function of the transmitted signal, then the effect may be called distortion. If this function has an inverse — no two transmitted signals producing the same received signal — distortion may be corrected, at least in principle, by merely performing the inverse functional operation on the received signal. The case of interest here is that in which the signal does not always undergo the same change in transmission. In this case we may assume the received signal E to be a function of the transmitted signal S and a second variable, the noise N. E = f (S; N) The noise is considered to be a chance variable just as the message was above. In general it may be represented by a suitable stochastic process. The most general type of noisy discrete channel we shall consider is a generalization of the finite state noise-free channel described previously. We assume a finite number of states and a set of probabilities p;i( ; j): This is the probability, if the channel is in state and symbol i is transmitted, that symbol j will be received and the channel left in state . Thus and range over the possible states, i over the possible transmitted signals and j over the possible received signals. In the case where successive symbols are independently perturbed by the noise there is only one state, and the channel is described by the set of transition probabilities pi( j), the probability of transmitted symbol i being received as j. If a noisy channel is fed by a source there are two statistical processes at work: the source and the noise. Thus there are a number of entropies that can be calculated. First there is the entropy H (x) of the source or of the input to the channel (these will be equal if the transmitter is non-singular). The entropy of the output of the channel, i.e., the received signal, will be denoted by H (y). In the noiseless case H (y) = H (x). The joint entropy of input and output will be H (xy). Finally there are two conditional entropies Hx(y) and Hy(x), the entropy of the output when the input is known and conversely. Among these quantities we have the relations H (x; y) = H (x) + Hx(y) = H (y) + Hy(x): All of these entropies can be measured on a per-second or a per-symbol basis. 19