16 CHAPTER 2.ENTROPY AND MUTUAL INFORMATION It is sometimes called the information divergence/cross-entropy/Kullback entropy be- tween P(z)and Q(x). 。The divergence inequality:D(PIlQ)≥0 with equality iff P(r)=Q(z)for all e&. Proof.Let Sx={r:P()>0}be the support set of P(r).Then -DP=∑Ps (2.14) Q)->P()ioge = ≤Eoa-2Ps =0 .Conditional relative entropy:For joint PMFs P()and Q(),the conditional relative entropy D(》)=∑P∑P( P(wlr) =.)】 P(ul)] (2.15) .Chain rule for relative entropy: D(P(,)Q(红,)=D(P(lQ(e》+D(P(zlQ(ylr) .Example:Entropy written as relative entropy Suppose that X take values in with=K.Let U have the uniform distri- bution over Y.Then DRI)-∑Pxas号 logK-H(X) It follows that H(X)<log
16 CHAPTER 2. ENTROPY AND MUTUAL INFORMATION It is sometimes called the information divergence/cross-entropy/Kullback entropy between P(x) and Q(x). • In general, D(P||Q) ̸= D(Q||P), so that relative entropy does not have the symmetry required for a true “distance” measure. • The divergence inequality: D(P||Q) ≥ 0 with equality iff P(x) = Q(x) for all x ∈ X . Proof. Let SX = {x : P(x) > 0} be the support set of P(x). Then −D(P||Q) = ∑ x∈SX P(x) log Q(x) P(x) (2.14) (by IT − inequality) ≤ ∑ x P(x) [ Q(x) P(x) − 1 ] log e = [∑ x Q(x) − ∑ x P(x) ] log e ≤ [∑ x∈X Q(x) − ∑ x∈X P(x) ] log e = 0 • Conditional relative entropy: For joint PMFs P(x, y) and Q(x, y), the conditional relative entropy D(P(y|x)||Q(y|x)) = ∑ x P(x) ∑ y P(y|x) log P(y|x) Q(y|x) = EP(x,y) [ log P(y|x) Q(y|x) ] (2.15) • Chain rule for relative entropy: D(P(x, y)||Q(x, y)) = D(P(x)||Q(x)) + D(P(y|x)||Q(y|x)) • Example: Entropy written as relative entropy Suppose that X take values in X with |X | = K. Let U have the uniform distribution over X . Then D(PX||PU ) = ∑ x PX(x) log PX(x) PU (x) = ∑ x PX(x) log PX(x) 1/K = log K − H(X) It follows that H(X) ≤ log |X |
2.4.RELATIVE ENTROPY AND MUTUAL INFORMATION 17 2.4.2 Mutual information Mutual information is a measure of the amount of information that one r.v.contains about another r.v.It is the reduction in the uncertainty of one r.v.due to the knowledge of the other Definition 2.4.2 I(X:Y)=D(P(,P()P(y)) =名于Ps品 P,) be Poo -∑∑Pes (2.16) Properties of mutual information 1.Non-negativity of mutual information:I(X:Y)>0 with equality iff X and Y are independent. nofI0xy)=D(Pz,PP(》≥0,t证Pz)=PePe吧 2.Symmetry of mutual information:I(X:Y)=I(Y;X) xn-∑Pes瑞 =-∑P(,)logP(a)-∑P(z,logP(z) =H()-H(Y) (2.17) By symmetry,it also follows that I(X;Y)=H(Y)-H(YX).Since H(XY)= H(X)+H(YX).we have I(X;Y)=H(X)+H(Y)-H(XY)
2.4. RELATIVE ENTROPY AND MUTUAL INFORMATION 17 2.4.2 Mutual information Mutual information is a measure of the amount of information that one r.v. contains about another r.v. It is the reduction in the uncertainty of one r.v. due to the knowledge of the other. Definition 2.4.2. Consider two random variables X and Y with a joint pmf P(x, y) and marginal pmf P(x) and P(y). The (average) mutual information I(X; Y ) between X and Y is the relative entropy between P(x, y) and P(x)P(y), i.e., I(X; Y ) = D(P(x, y)||P(x)P(y)) = ∑ x∈X ∑ y∈Y P(x, y) log P(x, y) P(x)P(y) = Ep(x,y) [ log P(x, y) P(x)P(y) ] = ∑ x ∑ y P(x, y) log P(x|y) P(x) (2.16) Properties of mutual information: 1. Non-negativity of mutual information: I(X; Y ) ≥ 0 with equality iff X and Y are independent. Proof. I(X; Y ) = D(P(x, y)||P(x)P(y)) ≥ 0, with equality iff P(x, y) = P(x)P(y). 2. Symmetry of mutual information: I(X; Y ) = I(Y ; X) 3. I(X; Y ) = ∑ x ∑ y P(x, y) log P(x|y) P(x) = − ∑ x,y P(x, y) log P(x) − ( − ∑ x,y P(x, y) log P(x|y) ) = H(X) − H(X|Y ) (2.17) By symmetry, it also follows that I(X; Y ) = H(Y ) − H(Y |X). Since H(XY ) = H(X) + H(Y |X), we have I(X; Y ) = H(X) + H(Y ) − H(XY ) 4. I(X; X) = H(X) − H(X|X) = H(X). Hence, entropy is sometimes referred to as average self-information
18 CHAPTER 2.ENTROPY AND MUTUAL INFORMATION H(XY) HX) HOY) Figure 2.3:mutual information and conditional entropy 2.4.3 Conditional mutual information Definition 2.4.3.The conditional mutual information between the random variables X and Y,given the r.v.Z,is defined by I(X:YZ)=H(Z)-H(YZ) =∑∑∑P,u)o()P同 P(,) =s (2.18) It follows from the definition that I(X;Y1Z)=IY;XZ)≥0 with ec statistically ind P(,)=P()P()for each element in the joint sample space for which P()>0. We can visualize the situation in which I(X:YZ)=0 as a pair of channels in cascade as shown in Fig.2.4.We assume that the output of the 2nd channel depends statistically only on the input to the 2nd channel,i.e.p(ylz)=p(ylz,r),all r,y,z with p2,x)>0. I(x:YIZ)= X Channel 1 Z Y y=f(Z) Figure 2.4:Channels in cascade .An important property: I(X:YZ)=I(YZ:X) =I(x:Y)+I(x:ZY) =I(X:Z)+I(X;YZ) (2.19)
18 CHAPTER 2. ENTROPY AND MUTUAL INFORMATION H(X) H(Y) H(XY) H(X|Y) I(X;Y) H(Y|X) Figure 2.3: mutual information and conditional entropy 2.4.3 Conditional mutual information Definition 2.4.3. The conditional mutual information between the random variables X and Y , given the r.v. Z, is defined by I(X; Y |Z) = H(X|Z) − H(X|Y Z) = ∑ x ∑ y ∑ z P(x, y, z) log P(x, y|z) P(x|z)P(y|z) = Ep(x,y,z) [ log P(x|yz) P(x|z) ] (2.18) It follows from the definition that I(X; Y |Z) = I(Y ; X|Z) ≥ 0 with equality iff conditional on each Z, X and Y are statistically independent; i.e., P(x, y|z) = P(x|z)P(y|z) for each element in the joint sample space for which P(z) > 0. We can visualize the situation in which I(X; Y |Z) = 0 as a pair of channels in cascade as shown in Fig. 2.4. We assume that the output of the 2nd channel depends statistically only on the input to the 2nd channel,i.e. p(y|z) = p(y|z, x), all x, y, z with p(z, x) > 0. Mutiplying both sides by P(x|z), we obtain P(x, y|z) = P(x|z)P(y|z), so that I(X; Y |Z) = 0. Channel y f = (z) X Z Y Figure 2.4: Channels in cascade • An important property: I(X; Y Z) = I(Y Z; X) = I(X; Y ) + I(X;Z|Y ) = I(X;Z) + I(X; Y |Z) (2.19)
2.4.RELATIVE ENTROPY AND MUTUAL INFORMATION 19 Proof. I(X:YZ)=H(YZ)-H(YZX) =H(Y)+H(Z)-[H(X)+H()] =[H(Y)-H(YIX】+[H(ZY)-H(ZXY】 =I(X:Y)+I(X:ZIY (2.20) Theorem 2.4.1(Chain rule for mutual information). I.Y) CI(.Xn-1) Proof. I(X1,X2.XN:Y)=H(X1,X2,.XN)-H(X1:X2.XNY) =∑HXlK,X,Xn-)-∑Hx1x,X,Xn-1,y n=1 =∑I(Xn;Y1X1,Xn-) (2.21 n 2.4.4 Data processing inequality ndom ariables X,Z,Y inendnt X→Z→YifP(,)=Pe)P(zr)P(z) Theorem 2.4.2 (Data processing inequality).If XZY,then I(X;Z)≥I(X;Y) Proof.By the chain rule,we obtain I(X:YZ)=I(X:Y)+I(X:ZY) =I(X;Z)+I(X;YIZ) Since X and Y are independent given Z,we have I(X:YZ)=0.Thus I(X:Z)=I(X:Y)+IX;ZY) (2.22) From the non-negativity of mutual information,I(X;Z)20.Thus,(2.22)implies that I(X:Z)≥I(X:Y)
2.4. RELATIVE ENTROPY AND MUTUAL INFORMATION 19 Proof. I(X; Y Z) = H(Y Z) − H(Y Z|X) = H(Y ) + H(Z|Y ) − [H(Y |X) + H(Z|XY )] = [H(Y ) − H(Y |X)] + [H(Z|Y ) − H(Z|XY )] = I(X; Y ) + I(X;Z|Y ) (2.20) Theorem 2.4.1 (Chain rule for mutual information). I(X1, X2, . . . , XN ; Y ) = ∑ N n=1 I(Xn; Y |X1, . . . , Xn−1) Proof. I(X1, X2, . . . , XN ; Y ) = H(X1, X2, . . . , XN ) − H(X1, X2, . . . , XN |Y ) = ∑ N n=1 H(Xn|X1, X2, . . . , Xn−1) − ∑ N n=1 H(Xn|X1, X2, . . . , Xn−1, Y ) = ∑ n I(Xn; Y |X1, . . . , Xn−1) (2.21) 2.4.4 Data processing inequality The data processing inequality can be used to show that no clever manipulation of the data can improve the inferences that can be made from the data. Definition 2.4.4. Random variables X, Z, Y are said to form a Markov Chain in that order (denoted by X → Z → Y ) if the conditional distribution of Y depends only on Z and is conditionally independent of X. Specifically, X → Z → Y ifP(x, z, y) = P(x)P(z|x)P(y|z) Theorem 2.4.2 (Data processing inequality). If X → Z → Y , then I(X;Z) ≥ I(X; Y ) Proof. By the chain rule, we obtain I(X; Y Z) = I(X; Y ) + I(X;Z|Y ) = I(X;Z) + I(X; Y |Z) Since X and Y are independent given Z, we have I(X; Y |Z) = 0. Thus, I(X;Z) = I(X; Y ) + I(X;Z|Y ) (2.22) From the non-negativity of mutual information, I(X;Z|Y ) ≥ 0. Thus, (2.22) implies that I(X;Z) ≥ I(X; Y )
20 CHAPTER 2.ENTROPY AND MUTUAL INFORMATION From the symmetry,it follows that IY;2)≥I(x:Y) This theore m demonstrates that no processing of Z,deterministic or random,can increase the information that Z contains about X. Corollary 2.4.3.In particular,if Y=f(Z),we have I(X:Z)2I(X:f(Z)). Poof.X→Z→fZ)forms a Markov Chain. ◇ Corollary2.4.4.fX→Z→Y,then I(X:ZY)≤I(X;Z). Proof.Using the fact I(X:Y)20,the corollary follows immediately from(2.22). Expressing the data processing inequality in terms of entropies,we have H(x)-H(Z)H()-H(XY) →1(DX2)≤(XY (2.23 of a oh y vields the intu- itively satisfying result that this uncertainty or equivocation can never decrease as we go further from the input on a sequence of cascaded channels. ac0=(份)-1a Note that Pylx(01)=1 so that H(YIX=1)=0 Similarly,Pylx(00)=,so that 0YX=0=()=Ib Thus,H(Y1K)=是×1=是bt H(Y)=P(,)H(==) =0)+0)+四 = and H(XYZ)=H(X)+H(YX)+H(ZXY)=1+=2 bits. Because Py(1)=Py(0)=,we have H0m=压(份)=0s1 we see that H(YX)=(Y).However,H(YX =0)=1>H(Y). Furthermore,I(X;Y)=H(Y)-H(YX)=0.811-0.5=0.311 bits In words.the first c omponent of the random vector X.Y,Z]gives 0.311 bits of informa- tion about the 2nd component,and vice versa
20 CHAPTER 2. ENTROPY AND MUTUAL INFORMATION From the symmetry, it follows that I(Y ;Z) ≥ I(X; Y ) This theorem demonstrates that no processing of Z, deterministic or random, can increase the information that Z contains about X. Corollary 2.4.3. In particular, if Y = f(Z), we have I(X;Z) ≥ I(X; f(Z)). Proof. X → Z → f(Z) forms a Markov Chain. Corollary 2.4.4. If X → Z → Y , then I(X;Z|Y ) ≤ I(X;Z). Proof. Using the fact I(X; Y ) ≥ 0, the corollary follows immediately from (2.22). Expressing the data processing inequality in terms of entropies, we have H(X) − H(X|Z) ≥ H(X) − H(X|Y ) ⇒ H(X|Z) ≤ H(X|Y ) (2.23) The average uncertainty H(X|Z) about the input of a channel given the output is called the equivocation on the channel, and thus the above inequality yields the intuitively satisfying result that this uncertainty or equivocation can never decrease as we go further from the input on a sequence of cascaded channels. Example 2.4.1. Suppose that the random vector [X, Y, Z] is equally likely to take on values in {[000], [010], [100], [101]}. Then H(X) = H2 ( 1 2 ) = 1bit Note that PY |X(0|1) = 1 so that H(Y |X = 1) = 0 Similarly, PY |X(0|0) = 1 2 , so that H(Y |X = 0) = H2 ( 1 2 ) = 1bit Thus, H(Y |X) = 1 2 × 1 = 1 2 bit. H(Z|XY ) = ∑ x,y P(x, y)H(Z|X = x, Y = y) = 1 4 (0) + 1 4 (0) + 1 2 (1) = 1 2 bit and H(XY Z) = H(X) + H(Y |X) + H(Z|XY ) = 1 + 1 2 + 1 2 = 2 bits. Because PY (1) = 1 4 , PY (0) = 3 4 , we have H(Y ) = H2 ( 1 4 ) = 0.811 bits we see that H(Y |X) = 1 2 < H(Y ). However, H(Y |X = 0) = 1 > H(Y ). Furthermore, I(X; Y ) = H(Y ) − H(Y |X) = 0.811 − 0.5 = 0.311 bits In words, the first component of the random vector [X, Y, Z] gives 0.311 bits of information about the 2nd component, and vice versa