2.2 Shannon's Information Measures13Definition 2.13.The entropy H(X) of a random uariableX is defined asH(X)= -p(r) log p(μ).(2.35)In the definitions of all information measures,we adopt the conventionthat summation is taken over the corresponding support. Such a conventionis necessary because p(r)log p() in (2.13) is undefined if p(r) = 0.The base of the logarithm in (2.13) can be chosen to be any convenientreal number greater than 1. We write H(X) as H(X) when the base of thelogarithm is a.When the base of the logarithm is 2, the unit for entropy isthe bit. When the base of the logarithm is e, the unit for entropy is the nat.When the base of the logarithm is an integer D ≥2, the unit for entropy isthe D-it (D-ary digit). In the context of source coding, the base is usuallytaken to be the size of the code alphabet. This will be discussed in Chapter 4.In computer science, a bit means an entity which can take the value O or 1.In information theory,the entropy of a random variable is measured in bits.The reader should distinguish these two meanings of a bit from each othercarefullyLet g(X) be any function of a random variable X.We will denote theexpectation of g(X) by Eg(X), i.e.,Eg(X) =p(r)g(r),(2.36)where the summation is over Sx.Then the definition of the entropy of arandom variable X can be written asH(X)=-Elogp(X).(2.37)Expressions of Shannon's information measures in terms of expectations willbe useful in subsequent discussions.The entropy H(X) of a random variable X is a functional of the prob-ability distribution p(r)which measures the average amount of informationcontained in X or, equivalently, the average amount of uncertainty removedupon revealing the outcome of X.Note that H(X) depends only on p(r), noton the actual values in X.Occasionally,wealso denoteH(X) byH(p)For 0<≤1, define(2.38)he() =-log - (1 -)log(1-)with the convention 0log0=0, so thathe(O)=hb(1)=0.With this conven-tion, hb()is continuous at=0 and =1.h is called the binary entropyfunction.For a binary random variable X with distribution (,1-].(2.39)H(X) = h().Figure 2.1 is the plot of h() versus in the base 2. Note that hb()achieves the maximum value 1 when =
2.2 Shannon’s Information Measures 13 Definition 2.13. The entropy H(X) of a random variable X is defined as H(X) = − x p(x)log p(x). (2.35) In the definitions of all information measures, we adopt the convention that summation is taken over the corresponding support. Such a convention is necessary because p(x)log p(x) in (2.13) is undefined if p(x) = 0. The base of the logarithm in (2.13) can be chosen to be any convenient real number greater than 1. We write H(X) as Hα(X) when the base of the logarithm is α. When the base of the logarithm is 2, the unit for entropy is the bit. When the base of the logarithm is e, the unit for entropy is the nat. When the base of the logarithm is an integer D ≥ 2, the unit for entropy is the D-it (D-ary digit). In the context of source coding, the base is usually taken to be the size of the code alphabet. This will be discussed in Chapter 4. In computer science, a bit means an entity which can take the value 0 or 1. In information theory, the entropy of a random variable is measured in bits. The reader should distinguish these two meanings of a bit from each other carefully. Let g(X) be any function of a random variable X. We will denote the expectation of g(X) by Eg(X), i.e., Eg(X) = x p(x)g(x), (2.36) where the summation is over SX. Then the definition of the entropy of a random variable X can be written as H(X) = −E log p(X). (2.37) Expressions of Shannon’s information measures in terms of expectations will be useful in subsequent discussions. The entropy H(X) of a random variable X is a functional of the probability distribution p(x) which measures the average amount of information contained in X or, equivalently, the average amount of uncertainty removed upon revealing the outcome of X. Note that H(X) depends only on p(x), not on the actual values in X . Occasionally, we also denote H(X) by H(p). For 0 ≤ γ ≤ 1, define hb(γ) = −γ log γ − (1 − γ)log(1 − γ) (2.38) with the convention 0log 0 = 0, so that hb(0) = hb(1) = 0. With this convention, hb(γ) is continuous at γ = 0 and γ = 1. hb is called the binary entropy function. For a binary random variable X with distribution {γ, 1 − γ}, H(X) = hb(γ). (2.39) Figure 2.1 is the plot of hb(γ) versus γ in the base 2. Note that hb(γ) achieves the maximum value 1 when γ = 1 2 .
142InformationMeasureshg()年0.50Fig.2.1.h()versusin thebase2.The definition of the joint entropy of two random variables is similar tothe definition of the entropy of a single random variable.Extension of thisdefinition to more than two random variables is straightforward.Definition 2.14.The joint entropy H(X,Y) of a pair of random variablesX andY is defined asH(X,Y)= -p(r,y)logp(r,y) =-Elog p(X,Y).(2.40)a.yFor two random variables, we define in thefollowing the conditional en-tropy of onerandom variable when the other random variable isgiven.Definition2.15.For random variables X and Y,the conditional entropy ofYqiven X is defined asH(YX)=-p(r,y)logp(ylr)=-Elogp(YIX).(2.41)z.yFrom (2.41),we can writeEp(uylz) og p(g/lr)H(YIX)=p(a)(2.42)The inner sum is the entropy of Y conditioning on a fixed r E Sx.Thus weare motivated to express H(Y|X) asH(YIX) =p(r)H(YIX = ),(2.43)
14 2 Information Measures hb(γ) γ 1 0 0.5 1 Fig. 2.1. hb(γ) versus γ in the base 2. The definition of the joint entropy of two random variables is similar to the definition of the entropy of a single random variable. Extension of this definition to more than two random variables is straightforward. Definition 2.14. The joint entropy H(X,Y ) of a pair of random variables X and Y is defined as H(X,Y ) = − x,y p(x,y)log p(x,y) = −E log p(X,Y ). (2.40) For two random variables, we define in the following the conditional entropy of one random variable when the other random variable is given. Definition 2.15. For random variables X and Y , the conditional entropy of Y given X is defined as H(Y |X) = − x,y p(x,y)log p(y|x) = −E log p(Y |X). (2.41) From (2.41), we can write H(Y |X) = x p(x) − y p(y|x)log p(y|x) . (2.42) The inner sum is the entropy of Y conditioning on a fixed x ∈ SX. Thus we are motivated to express H(Y |X) as H(Y |X) = x p(x)H(Y |X = x), (2.43)
152.2 Shannon's Information MeasureswhereH(YX =a)=-p(y/)logp(y/r).(2.44)4Observe that the right-hand sides of (2.13) and (2.44) have exactly thesameform.Similarly,forH(Y|X,Z),wewriteH(YIX,Z) = p(2)H(YIX,Z = 2),(2.45)whereH(YX,Z= 2)=-p(,l2)logp(y/r,2).(2.46)z.yProposition2.16.(2.47)H(X,Y) = H(X) +H(YIX)and(2.48)H(X,Y) = H(Y) +H(X|Y)Proof.Consider(2.49)H(X,Y) =--Elogp(X,Y)(2.50)= -Elog[p(X)p(Y|X))(2.51)=-Elogp(X)-Elogp(Y|X)= H(X) + H(Y|X).(2.52)Note that (2.50) is justified because the summation of the expectation is overSxY, and we have used the linearity of expectation3 to obtain (2.51).Thisproves (2.47),and (2.48)followsby symmetry.This proposition has the following interpretation.Consider revealing theoutcome of a pair of random variables X and Y in two steps: first the outcomeof X and then the outcome of Y.Then the proposition says that the totalamount of uncertainty removed upon revealing both X and Y is equal to thesum of the uncertainty removed upon revealing X (uncertainty removed in thefirst step) and the uncertainty removed upon revealing Y once X is known(uncertaintyremoved inthesecond step)Definition 2.17.For random uariables X and Y,the mutual informationbetweenXandYisdefinedasp(X,Y)p(r,y)I(X;Y)=p(r,y) log -Elog(2.53)p(r)p(y)p(X)p(Y)2.y3 See Problem 5 at the end of the chapter
2.2 Shannon’s Information Measures 15 where H(Y |X = x) = − y p(y|x)log p(y|x). (2.44) Observe that the right-hand sides of (2.13) and (2.44) have exactly the same form. Similarly, for H(Y |X,Z), we write H(Y |X,Z) = z p(z)H(Y |X,Z = z), (2.45) where H(Y |X,Z = z) = − x,y p(x,y|z)log p(y|x,z). (2.46) Proposition 2.16. H(X,Y ) = H(X) + H(Y |X) (2.47) and H(X,Y ) = H(Y ) + H(X|Y ). (2.48) Proof. Consider H(X,Y ) = −E log p(X,Y ) (2.49) = −E log[p(X)p(Y |X)] (2.50) = −E log p(X) − E log p(Y |X) (2.51) = H(X) + H(Y |X). (2.52) Note that (2.50) is justified because the summation of the expectation is over SXY , and we have used the linearity of expectation3 to obtain (2.51). This proves (2.47), and (2.48) follows by symmetry. This proposition has the following interpretation. Consider revealing the outcome of a pair of random variables X and Y in two steps: first the outcome of X and then the outcome of Y . Then the proposition says that the total amount of uncertainty removed upon revealing both X and Y is equal to the sum of the uncertainty removed upon revealing X (uncertainty removed in the first step) and the uncertainty removed upon revealing Y once X is known (uncertainty removed in the second step). Definition 2.17. For random variables X and Y , the mutual information between X and Y is defined as I(X; Y ) = x,y p(x,y)log p(x,y) p(x)p(y) = E log p(X,Y ) p(X)p(Y ) . (2.53) 3 See Problem 5 at the end of the chapter.
162Information MeasuresRemark I(X;Y)is symmetrical in X and Y.Proposition 2.18.The mutual information between a random variable Xand itself is equal to the entropy of X, i.e., I(X; X) =H(X).Proof. This can be seen by consideringp(X)I(X;X) = Elog(2.54)p(X)2(2.55)= -Elog p(X)= H(X).(2.56)Theproposition isproved.Remark The entropy of X is sometimes called the self-information of X.Proposition2.19.(2.57)I(X;Y) = H(X)-H(XY),I(X;Y) = H(Y) -H(Y|X),(2.58)andI(X;Y) = H(X) +H(Y)- H(X,Y),(2.59)provided that all the entropies and conditional entropies are finite (see Eram-ple 2.46 in Section2.8).The proof of this proposition is left as an exercise.From (2.57), we can interpret I(X;Y) as the reduction in uncertaintyabout X when Y is given or, equivalently,the amount of information aboutXprovided byY.SinceI(X;Y)issymmetrical in Xand Y,from (2.58),wecan as well interpret I(X;Y) as the amount of information about Y providedby X.Therelations between the (joint)entropies,conditional entropies,and mu-tual information fortwo randomvariables X and Y aregiven inPropositions2.16 and 2.19.These relations can be summarized by the diagram in Figure 2.2which is a variation of the Venn diagram.4 One can check that all the rela-tions between Shannon's information measures for X and Y which are shownin Figure2.2 are consistent with the relations given in Propositions 2.16 and2.19. This one-to-one correspondence between Shannon's information mea-sures and set theory is not just a coincidence for two random variables.WewilldiscussthisindepthwhenweintroducetheI-MeasureinChapter3.Analogous to entropy,there is a conditional version ofmutual informationcalled conditional mutual information.4 The rectangle representing the universal set in a usual Venn diagram is missinginFigure2.2
16 2 Information Measures Remark I(X; Y ) is symmetrical in X and Y . Proposition 2.18. The mutual information between a random variable X and itself is equal to the entropy of X, i.e., I(X; X) = H(X). Proof. This can be seen by considering I(X; X) = E log p(X) p(X)2 (2.54) = −E log p(X) (2.55) = H(X). (2.56) The proposition is proved. Remark The entropy of X is sometimes called the self-information of X. Proposition 2.19. I(X; Y ) = H(X) − H(X|Y ), (2.57) I(X; Y ) = H(Y ) − H(Y |X), (2.58) and I(X; Y ) = H(X) + H(Y ) − H(X,Y ), (2.59) provided that all the entropies and conditional entropies are finite (see Example 2.46 in Section 2.8). The proof of this proposition is left as an exercise. From (2.57), we can interpret I(X; Y ) as the reduction in uncertainty about X when Y is given or, equivalently, the amount of information about X provided by Y . Since I(X; Y ) is symmetrical in X and Y , from (2.58), we can as well interpret I(X; Y ) as the amount of information about Y provided by X. The relations between the (joint) entropies, conditional entropies, and mutual information for two random variables X and Y are given in Propositions 2.16 and 2.19. These relations can be summarized by the diagram in Figure 2.2 which is a variation of the Venn diagram.4 One can check that all the relations between Shannon’s information measures for X and Y which are shown in Figure 2.2 are consistent with the relations given in Propositions 2.16 and 2.19. This one-to-one correspondence between Shannon’s information measures and set theory is not just a coincidence for two random variables. We will discuss this in depth when we introduce the I-Measure in Chapter 3. Analogous to entropy, there is a conditional version of mutual information called conditional mutual information. 4 The rectangle representing the universal set in a usual Venn diagram is missing in Figure 2.2
172.2Shannon'sInformationMeasuresH(XY)H(XIYH(YIX)H(Y)H(XI(X;Y)Fig.2.2.Relationship between entropies and mutual information for two randomvariables.Definition2.20.For random variablesX.Y.andZ.themutual informationbetweenXandYconditioningonZisdefinedasp(X,YIZ)(X;YI2) = E p(r,y,2)1og(vla)=Elog(2.60)p(X|Z)p(Y|Z)p(r|2)p(y/2)2.9.2Remark I(X;Y|Z) is symmetrical in X and Y.Analogousto conditional entropy,wewriteI(X;Y(Z) = p(2)I(X;YIZ = 2),(2.61)2wherep(a,yl2)I(X; Y[Z = 2) = p(,9/2) 1og (2.62)p(r2)p(yz)F.ySimilarly,when conditioning on two randomvariables.wewriteI(X;YIZ,T) = p(t)I(X:YIZ.T = t)(2.63)wherep(r,ylz,t)I(X; YIZ,T = t) = p(r, ,2t) log (2.64)p(rz,t)p(ylz,t)E,y,2Conditional mutual information satisfies the same set of relations given inPropositions2.18and2.19formutualinformationexceptthatallthetermsare now conditioned on a random variable Z. We state these relations in thenext two propositions.The proofs are omitted
2.2 Shannon’s Information Measures 17 H (X,Y ) H (X|Y ) H (Y|X ) H (Y ) I(X;Y ) H (X ) Fig. 2.2. Relationship between entropies and mutual information for two random variables. Definition 2.20. For random variables X, Y , and Z, the mutual information between X and Y conditioning on Z is defined as I(X; Y |Z) = x,y,z p(x,y,z)log p(x,y|z) p(x|z)p(y|z) = E log p(X,Y |Z) p(X|Z)p(Y |Z) . (2.60) Remark I(X; Y |Z) is symmetrical in X and Y . Analogous to conditional entropy, we write I(X; Y |Z) = z p(z)I(X; Y |Z = z), (2.61) where I(X; Y |Z = z) = x,y p(x,y|z)log p(x,y|z) p(x|z)p(y|z) . (2.62) Similarly, when conditioning on two random variables, we write I(X; Y |Z,T) = t p(t)I(X; Y |Z,T = t), (2.63) where I(X; Y |Z,T = t) = x,y,z p(x,y,z|t)log p(x,y|z,t) p(x|z,t)p(y|z,t) . (2.64) Conditional mutual information satisfies the same set of relations given in Propositions 2.18 and 2.19 for mutual information except that all the terms are now conditioned on a random variable Z. We state these relations in the next two propositions. The proofs are omitted.