Reprinted with corrections from The Bell System Technical Journal.pdf_P11-P15

In Appendix 2, the following result is established: Theorem 2: The only H satisfying the three above assumptions is of the form: H = K n ∑ i=1 pi log pi where K is a positive constant. This theorem, and the assumptions required for its proof, are in no way necessary for the present theory. It is given chiefly to lend a certain plausibility to some of our later definitions. The real justification of these definitions, however, will reside in their implications. Quantities of the form H =∑ pi log pi (the constant K merely amounts to a choice of a unit of measure) play a central role in information theory as measures of information, choice and uncertainty. The form of H will be recognized as that of entropy as defined in certain formulations of statistical mechanics8 where pi is the probability of a system being in cell i of its phase space. H is then, for example, the H in Boltzmann’s famous H theorem. We shall call H = ∑ pi log pi the entropy of the set of probabilities p1;:::; pn. If x is a chance variable we will write H (x) for its entropy; thus x is not an argument of a function but a label for a number, to differentiate it from H (y) say, the entropy of the chance variable y. The entropy in the case of two possibilities with probabilities p and q = 1 p, namely H = ( plog p + qlogq) is plotted in Fig. 7 as a function of p. H BITS p 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1.0 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1.0 Fig. 7—Entropy in the case of two possibilities with probabilities p and (1 p). The quantity H has a number of interesting properties which further substantiate it as a reasonable measure of choice or information. 1. H = 0 if and only if all the pi but one are zero, this one having the value unity. Thus only when we are certain of the outcome does H vanish. Otherwise H is positive. 2. For a given n, H is a maximum and equal to logn when all the pi are equal (i.e., 1 n ). This is also intuitively the most uncertain situation. 8See, for example, R. C. Tolman, Principles of Statistical Mechanics, Oxford, Clarendon, 1938. 11

3. Suppose there are two events, x and y, in question with m possibilities for the first and n for the second. Let p(i; j) be the probability of the joint occurrence of i for the first and j for the second. The entropy of the joint event is H (x; y) = ∑ i; j p(i; j) log p(i; j) while H (x) = ∑ i; j p(i; j) log∑ j p(i; j) H (y) = ∑ i; j p(i; j) log∑ i p(i; j): It is easily shown that H (x; y) H (x) + H (y) with equality only if the events are independent (i.e., p(i; j) = p(i) p( j)). The uncertainty of a joint event is less than or equal to the sum of the individual uncertainties. 4. Any change toward equalization of the probabilities p1 ; p2 ;:::; pn increases H. Thus if p1 < p2 and we increase p1, decreasing p2 an equal amount so that p1 and p2 are more nearly equal, then H increases. More generally, if we perform any “averaging” operation on the pi of the form p0 i = ∑ j ai j pj where ∑i ai j = ∑j ai j = 1, and all ai j 0, then H increases (except in the special case where this transformation amounts to no more than a permutation of the pj with H of course remaining the same). 5. Suppose there are two chance events x and y as in 3, not necessarily independent. For any particular value i that x can assume there is a conditional probability pi( j) that y has the value j. This is given by pi( j) = p(i; j) ∑j p(i; j) : We define the conditional entropy of y, Hx(y) as the average of the entropy of y for each value of x, weighted according to the probability of getting that particular x. That is Hx(y) = ∑ i; j p(i; j) log pi( j) : This quantity measures how uncertain we are of y on the average when we know x. Substituting the value of pi( j) we obtain Hx(y) = ∑ i; j p(i; j) log p(i; j) + ∑ i; j p(i; j) log∑ j p(i; j) = H (x; y) H (x) or H (x; y) = H (x) + Hx(y): The uncertainty (or entropy) of the joint event x; y is the uncertainty of x plus the uncertainty of y when x is known. 6. From 3 and 5 we have H (x) + H (y) H (x; y) = H (x) + Hx(y): Hence H (y) Hx(y): The uncertainty of y is never increased by knowledge of x. It will be decreased unless x and y are independent events, in which case it is not changed. 12

7.THE ENTROPY OF AN INFORMATION SOURCE Consider a discrete source of the finite state type considered above.For each possible state i there will be a set of probabilities pi()of producing the various possible symbols j.Thus there is an entropy H for each state.The entropy of the source will be defined as the average of these H weighted in accordance with the probability of occurrence of the states in question: H=ΣPH =-∑Pp0)logp.U). 1i This is the entropy of the source per symbol of text.If the Markoff process is proceeding at a definite time rate there is also an entropy per second H=∑fH where f is the average frequency(occurrences per second)of state i.Clearly H=mH where m is the average number of symbols produced per second.Hor A measures the amount of informa- tion generated by the source per symbol or per second.If the logarithmic base is 2,they will represent bits per symbol or per second. If successive symbols are independent then H is simply-Ep,log p;where pi is the probability of sym- bol i.Suppose in this case we consider a long message of N symbols.It will contain with high probability about piN occurrences of the first symbol,p2N occurrences of the second,etc.Hence the probability of this particular message will be roughly p=PPAN pPaN.…phav or logp÷N∑pilog p logp÷-NH H÷logl/p N H is thus approximately the logarithm of the reciprocal probability of a typical long sequence divided by the number of symbols in the sequence.The same result holds for any source.Stated more precisely we have (see Appendix 3): Theorem 3:Given any e>0 and 6>0,we can find an No such that the sequences ofany length N>No fall into two classes: 1.A set whose total probability is less than e. 2.The remainder,all of whose members have probabilities satisfying the inequality logp-1 -H<6. N In other words we are almost certain to have e logp-1 N very close to H when N is large A closely related result deals with the number of sequences of various probabilities.Consider again the sequences of length N and let them be arranged in order of decreasing probability.We define n(g)to be the number we must take from this set starting with the most probable one in order to accumulate a total probability g for those taken. 13

7. THE ENTROPY OF AN INFORMATION SOURCE Consider a discrete source of the finite state type considered above. For each possible state i there will be a set of probabilities pi( j) of producing the various possible symbols j. Thus there is an entropy Hi for each state. The entropy of the source will be defined as the average of these Hi weighted in accordance with the probability of occurrence of the states in question: H = ∑ i PiHi = ∑ i; j Pipi( j) log pi( j) : This is the entropy of the source per symbol of text. If the Markoff process is proceeding at a definite time rate there is also an entropy per second H 0 = ∑ i fiHi where fi is the average frequency (occurrences per second) of state i. Clearly H 0 = mH where m is the average number of symbols produced per second. H or H 0 measures the amount of information generated by the source per symbol or per second. If the logarithmic base is 2, they will represent bits per symbol or per second. If successive symbols are independent then H is simply ∑ pi log pi where pi is the probability of symbol i. Suppose in this case we consider a long message of N symbols. It will contain with high probability about p1N occurrences of the first symbol, p2N occurrences of the second, etc. Hence the probability of this particular message will be roughly p = pp1N 1 pp2N 2 ppnN n or log p := N∑ i pi log pi log p := NH H := log1=p N : H is thus approximately the logarithm of the reciprocal probability of a typical long sequence divided by the number of symbols in the sequence. The same result holds for any source. Stated more precisely we have (see Appendix 3): Theorem 3: Given any > 0 and > 0, we can find an N0 such that the sequences of any length N N0 fall into two classes: 1. A set whose total probability is less than . 2. The remainder, all of whose members have probabilities satisfying the inequality log p1 N H < : In other words we are almost certain to have log p1 N very close to H when N is large. A closely related result deals with the number of sequences of various probabilities. Consider again the sequences of length N and let them be arranged in order of decreasing probability. We define n(q) to be the number we must take from this set starting with the most probable one in order to accumulate a total probability q for those taken. 13

Theorem 4: logn(g)=H Lim N-100 N when q does not equal 0 or 1. We may interpret logn(g)as the number of bits required to specify the sequence when we consider only the most probable sequnith a total probbility.Thenisthe numberof tper symbol for the specification.The theorem says that for large N this will be independent of g and equal to H.The rate of growth of the logarithm of the number of reasonably probable sequences is given by H,regardless of our interpretation of"reasonably probable."Due to these results,which are proved in Appendix 3,it is possible for most purposes to treat the long sequences as though there were just 2N of them,each with a probability 2-HN The next two theorems show that H and H can be determined by limiting operations directly from the statistics of the message sequences,without reference to the states and transition probabilities between states. Theorem 5:Let p(B)be the probability ofa sequence Bi ofsymbols from the source.Let Gv=-F∑p(B)logp(B) where the sum is over all sequences B,containing N symbols.Then Gy is a monotonic decreasing function ofN and Lim GN H. Theorem 6:Let p(Bi,Sj)be the probability of sequence Bi followed by symbol Sj and pB(Sj)= p(Bi,Sj)/p(Bi)be the conditional probability of S;after Bi.Let FN =->p(Bi,Sj)logpB,(Sj) where the sum is over all blocks Bi of N-1 symbols and over all symbols Si.Then Fy is a monotonic decreasing function ofN. FN=NGN-(N-1)GN-1, 1、 F GN-N FN≤Gw, and LimN0 FN =H. These results are derived in Appendix 3.They show that a series of approximations to H can be obtained by considering only the statistical structure of the sequences extending over 1,2,...,N symbols.Fy is the better approximation.In fact FN is the entropy of the Nth order approximation to the source of the type discussed above.If there are no statistical influences extending over more than N symbols,that is if the conditional probability of the next symbol knowing the preceding(N-1)is not changed by a knowledge of any before that,then FN =H.FN of course is the conditional entropy of the next symbol when the(N-1) preceding ones are known,while Gw is the entropy per symbol of blocks of N symbols. The ratio of the entropy of a source to the maximum value it could have while still restricted to the same symbols will be called its relative entropy.This is the maximum compression possible when we encode into the same alphabet.One minus the relative entropy is the redundancy.The redundancy of ordinary English, not considering statistical structure over greater distances than about eight letters,is roughly 50%.This means that when we write English half of what we write is determined by the structure of the language and half is chosen freely.The figure 50%was found by several independent methods which all gave results in 14

Theorem 4: Lim N!∞ logn(q) N = H when q does not equal 0 or 1. We may interpret logn(q) as the number of bits required to specify the sequence when we consider only the most probable sequences with a total probability q. Then logn(q) N is the number of bits per symbol for the specification. The theorem says that for large N this will be independent of q and equal to H. The rate of growth of the logarithm of the number of reasonably probable sequences is given by H, regardless of our interpretation of “reasonably probable.” Due to these results, which are proved in Appendix 3, it is possible for most purposes to treat the long sequences as though there were just 2HN of them, each with a probability 2HN. The next two theorems show that H and H 0 can be determined by limiting operations directly from the statistics of the message sequences, without reference to the states and transition probabilities between states. Theorem 5: Let p(Bi) be the probability of a sequence Bi of symbols from the source. Let GN = 1 N ∑ i p(Bi) log p(Bi) where the sum is over all sequences Bi containing N symbols. Then GN is a monotonic decreasing function of N and Lim N!∞GN = H : Theorem 6: Let p(Bi ; Sj ) be the probability of sequence Bi followed by symbol Sj and pBi (Sj ) = p(Bi ; Sj )=p(Bi) be the conditional probability of Sj after Bi. Let FN = ∑ i; j p(Bi ; Sj ) log pBi (Sj ) where the sum is over all blocks Bi of N 1 symbols and over all symbols Sj. Then FN is a monotonic decreasing function of N, FN = NGN (N 1)GN1; GN = 1 N N ∑ n=1 Fn; FN GN ; and LimN!∞ FN = H. These results are derived in Appendix 3. They show that a series of approximations to H can be obtained by considering only the statistical structure of the sequences extending over 1 ; 2;:::; N symbols. FN is the better approximation. In fact FN is the entropy of the Nth order approximation to the source of the type discussed above. If there are no statistical influences extending over more than N symbols, that is if the conditional probability of the next symbol knowing the preceding (N 1) is not changed by a knowledge of any before that, then FN = H. FN of course is the conditional entropy of the next symbol when the (N 1) preceding ones are known, while GN is the entropy per symbol of blocks of N symbols. The ratio of the entropy of a source to the maximum value it could have while still restricted to the same symbols will be called its relative entropy. This is the maximum compression possible when we encode into the same alphabet. One minus the relative entropy is the redundancy. The redundancy of ordinary English, not considering statistical structure over greater distances than about eight letters, is roughly 50%. This means that when we write English half of what we write is determined by the structure of the language and half is chosen freely. The figure 50% was found by several independent methods which all gave results in 14

this neighborhood.One is by calculation of the entropy of the approximations to English.A second method is to delete a certain fraction of the letters from a sample of English text and then let someone attempt to restore them.If they can be restored when 50%are deleted the redundancy must be greater than 50%.A third method depends on certain known results in cryptography. Two extremes of redundancy in English prose are represented by Basic English and by James Joyce's book"Finnegans Wake".The Basic English vocabulary is limited to 850 words and the redundancy is very high.This is reflected in the expansion that occurs when a passage is translated into Basic English.Joyce on the other hand enlarges the vocabulary and is alleged to achieve a compression of semantic content. The redundancy of a language is related to the existence of crossword puzzles.If the redundancy is zero any sequence of letters is a reasonable text in the language and any two-dimensional array of letters forms a crossword puzzle.If the redundancy is too high the language imposes too many constraints for large crossword puzzles to be possible.A more detailed analysis shows that if we assume the constraints imposed by the language are of a rather chaotic and random nature,large crossword puzzles are just possible when the redundancy is 50%.If the redundancy is 33%,three-dimensional crossword puzzles should be possible, etc. 8.REPRESENTATION OF THE ENCODING AND DECODING OPERATIONS We have yet to represent mathematically the operations performed by the transmitter and receiver in en- coding and decoding the information.Either of these will be called a discrete transducer.The input to the transducer is a sequence of input symbols and its output a sequence of output symbols.The transducer may have an internal memory so that its output depends not only on the present input symbol but also on the past history.We assume that the internal memory is finite,i.e.,there exist a finite number m of possible states of the transducer and that its output is a function of the present state and the present input symbol.The next state will be a second function of these two quantities.Thus a transducer can be described by two functions: ya=fx,an） an+1 =g(Xn,an) where xm is the nth input symbol, an is the state of the transducer when the nth input symbol is introduced, yn is the output symbol (or sequence of output symbols)produced whenx is introduced if the state is an If the output symbols of one transducer can be identified with the input symbols of a second,they can be connected in tandem and the result is also a transducer.If there exists a second transducer which operates on the output of the first and recovers the original input,the first transducer will be called non-singular and the second will be called its inverse. Theorem 7:The output of a finite state transducer driven by a finite state statistical source is a finite state statistical source,with entropy (per unit time)less than or equal to that of the input.If the transducer is non-singular they are equal. Let a represent the state of the source,which produces a sequence of symbolsxi;and let 3 be the state of the transducer,which produces,in its output,blocks ofsymbolsyi.The combined system can be represented by the "product state space"of pairs (a,B).Two points in the space (1,31)and (a2,B2),are connected by a line if a can produce an x which changes 8 to and this line is given the probability of that x in this case.The line is labeled with the block ofy;symbols produced by the transducer.The entropy of the output can be calculated as the weighted sum over the states.If we sum first on B each resulting term is less than or equal to the corresponding term for a,hence the entropy is not increased.If the transducer is non-singular let its output be connected to the inverse transducer.If and are the output entropies of the source, the first and second transducers respectively,thenH=H and thereforeH=H3. 15

this neighborhood. One is by calculation of the entropy of the approximations to English. A second method is to delete a certain fraction of the letters from a sample of English text and then let someone attempt to restore them. If they can be restored when 50% are deleted the redundancy must be greater than 50%. A third method depends on certain known results in cryptography. Two extremes of redundancy in English prose are represented by Basic English and by James Joyce’s book “Finnegans Wake”. The Basic English vocabulary is limited to 850 words and the redundancy is very high. This is reflected in the expansion that occurs when a passage is translated into Basic English. Joyce on the other hand enlarges the vocabulary and is alleged to achieve a compression of semantic content. The redundancy of a language is related to the existence of crossword puzzles. If the redundancy is zero any sequence of letters is a reasonable text in the language and any two-dimensional array of letters forms a crossword puzzle. If the redundancy is too high the language imposes too many constraints for large crossword puzzles to be possible. A more detailed analysis shows that if we assume the constraints imposed by the language are of a rather chaotic and random nature, large crossword puzzles are just possible when the redundancy is 50%. If the redundancy is 33%, three-dimensional crossword puzzles should be possible, etc. 8. REPRESENTATION OF THE ENCODING AND DECODING OPERATIONS We have yet to represent mathematically the operations performed by the transmitter and receiver in encoding and decoding the information. Either of these will be called a discrete transducer. The input to the transducer is a sequence of input symbols and its output a sequence of output symbols. The transducer may have an internal memory so that its output depends not only on the present input symbol but also on the past history. We assume that the internal memory is finite, i.e., there exist a finite number m of possible states of the transducer and that its output is a function of the present state and the present input symbol. The next state will be a second function of these two quantities. Thus a transducer can be described by two functions: yn = f (xn ; n) n+1 = g(xn; n) where xn is the nth input symbol, n is the state of the transducer when the nth input symbol is introduced, yn is the output symbol (or sequence of output symbols) produced when xn is introduced if the state is n. If the output symbols of one transducer can be identified with the input symbols of a second, they can be connected in tandem and the result is also a transducer. If there exists a second transducer which operates on the output of the first and recovers the original input, the first transducer will be called non-singular and the second will be called its inverse. Theorem 7: The output of a finite state transducer driven by a finite state statistical source is a finite state statistical source, with entropy (per unit time) less than or equal to that of the input. If the transducer is non-singular they are equal. Let represent the state of the source, which produces a sequence of symbols xi; and let be the state of the transducer, which produces, in its output, blocks of symbols yj. The combined system can be represented by the “product state space” of pairs (; ). Two points in the space (1 ; 1) and (2 ; 2), are connected by a line if 1 can produce an x which changes 1 to 2, and this line is given the probability of that x in this case. The line is labeled with the block of yj symbols produced by the transducer. The entropy of the output can be calculated as the weighted sum over the states. If we sum first on each resulting term is less than or equal to the corresponding term for , hence the entropy is not increased. If the transducer is non-singular let its output be connected to the inverse transducer. If H 0 1, H 0 2 and H 0 3 are the output entropies of the source, the first and second transducers respectively, then H 0 1 H 0 2 H 0 3 = H 0 1 and therefore H 0 1 = H 0 2. 15