182InformationMeasuresProposition 2.2l.The mutual information between a random variableXand itself conditioning on a random variable Z is equal to the conditionalentropy of X given Z, i.e., I(X;X|Z) =H(XZ).Proposition 2.22.(2.65)I(X;Y|Z)= H(X|Z) -H(X/Y,Z)(2.66)I(X;Y|Z) = H(Y|Z) - H(Y|X,Z),and(2.67)I(X;YIZ) =H(XZ) +H(Y|Z)-H(X,YIZ)provided that all the conditional entropies are finite.RemarkAll Shannon's information measures are finite if the random vari-ables involved have finite alphabets.Therefore,Propositions 2.19 and 2.22apply provided that all the random variables therein have finite alphabets.To concludethis section,we showthat all Shannon's information measuresare special cases of conditional mutual information.Let @ be a degenerate ran-dom variable, i.e., @ takes a constant value with probability 1.Consider themutual information I(X;YZ).When X =Y and Z =@, I(X;YZ) becomesthe entropy H(X).When X =Y,I(X;Y|Z) becomes the conditional entropyH(XZ). When Z =@, I(X;YZ) becomes the mutual information I(X;Y).Thus all Shannon's information measures are special cases of conditional mu-tual information.2.3Continuity of Shannon's Information Measures forFixedFinite AlphabetsIn this section, we prove that for fixed finite alphabets, all Shannon's infor-mation measures are continuous functionals of the joint distribution of therandom variables involved.To formulate the notion of continuity,wefirstintroduce the variational distance as a distance measure between two probability distributions on a common alphabet.Definition2.23.Let p and g be two probability distributions on a commonalphabet X.The variational distance between p and q is defined asV(p,g) = Ip(a) -q(a)l(2.68)ZEX5 The variational distance is also referred to as the C' distance in mathematics
18 2 Information Measures Proposition 2.21. The mutual information between a random variable X and itself conditioning on a random variable Z is equal to the conditional entropy of X given Z, i.e., I(X; X|Z) = H(X|Z). Proposition 2.22. I(X; Y |Z) = H(X|Z) − H(X|Y,Z), (2.65) I(X; Y |Z) = H(Y |Z) − H(Y |X,Z), (2.66) and I(X; Y |Z) = H(X|Z) + H(Y |Z) − H(X,Y |Z), (2.67) provided that all the conditional entropies are finite. Remark All Shannon’s information measures are finite if the random variables involved have finite alphabets. Therefore, Propositions 2.19 and 2.22 apply provided that all the random variables therein have finite alphabets. To conclude this section, we show that all Shannon’s information measures are special cases of conditional mutual information. Let Φ be a degenerate random variable, i.e., Φ takes a constant value with probability 1. Consider the mutual information I(X; Y |Z). When X = Y and Z = Φ, I(X; Y |Z) becomes the entropy H(X). When X = Y , I(X; Y |Z) becomes the conditional entropy H(X|Z). When Z = Φ, I(X; Y |Z) becomes the mutual information I(X; Y ). Thus all Shannon’s information measures are special cases of conditional mutual information. 2.3 Continuity of Shannon’s Information Measures for Fixed Finite Alphabets In this section, we prove that for fixed finite alphabets, all Shannon’s information measures are continuous functionals of the joint distribution of the random variables involved. To formulate the notion of continuity, we first introduce the variational distance5 as a distance measure between two probability distributions on a common alphabet. Definition 2.23. Let p and q be two probability distributions on a common alphabet X . The variational distance between p and q is defined as V (p,q) = x∈X |p(x) − q(x)|. (2.68) 5 The variational distance is also referred to as the L1 distance in mathematics.
192.3 Continuity of Shannon's Information Measuresfor Fixed FiniteAlphabetsFora fixed finite alphabet X.let Pxbe the set of all distributions on XThen according to (2.13), the entropy of a distribution p on an alphabet X isdefined asH(p) =- p(r) logp(r),(2.69)ZESpwhereS,denotesthesupportof p andS,CX.Inorder forH(p)to becontinuous with respect to convergence in variational distance6 at aparticulardistribution pe Px, for any e> 0, there exists > 0 such that(2.70)[H(p)-H(g)/<forallqEPxsatisfyingV(p,q) <,(2.71)or equivalently,(2.72)lim H(p)= HH(p)where the convergence p'p is in variational distance.Since aloga→0as a→0,we define a function l:[0,oo)→R byfalogaif a>0l(a) =(2.73)ifa=o:i.e., l(a)is a continuous extension of aloga.Then (2.69)can berewritten asH(p) = - 1(p(r),(2.74)TEXwhere the summation above is over all r in X instead of Sp.Upon defining afunction l:Px→RforallrEXby(2.75)l(p) = I(p(r),(2.74) becomesH(p) = - / t(p)(2.76)TEXEvidently,l(p)is continuous in p (with respect to conivergence in variationaldistance). Since the summation in (2.76) involves a finite number of terms,we conclude that H(p) is a continuous functional of p.We now proceed to prove the continuity of conditional mutual informationwhich covers all cases of Shannon's information measures.ConsiderI(X;YZ)and let pxyz be the joint distribution of X, Y, and Z, where the alphabetsX,, and Z are assumed to be finite.From (2.47) and (2.67), we obtain(2.77)I(X:YIZ) = H(X,Z) + H(Y,Z)-H(X,Y,Z)-H(Z).6 Convergence in variational distance is the same as C'-convergence
2.3 Continuity of Shannon’s Information Measures for Fixed Finite Alphabets 19 For a fixed finite alphabet X , let PX be the set of all distributions on X . Then according to (2.13), the entropy of a distribution p on an alphabet X is defined as H(p) = − x∈Sp p(x)log p(x), (2.69) where Sp denotes the support of p and Sp ⊂ X . In order for H(p) to be continuous with respect to convergence in variational distance6 at a particular distribution p ∈ PX , for any > 0, there exists δ > 0 such that |H(p) − H(q)| < (2.70) for all q ∈ PX satisfying V (p,q) < δ, (2.71) or equivalently, lim p→p H(p ) = H lim p→p p = H(p), (2.72) where the convergence p → p is in variational distance. Since a log a → 0 as a → 0, we define a function l : [0,∞) → by l(a) = a log a if a > 0 0 if a = 0 , (2.73) i.e., l(a) is a continuous extension of a log a. Then (2.69) can be rewritten as H(p) = − x∈X l(p(x)), (2.74) where the summation above is over all x in X instead of Sp. Upon defining a function lx : PX → for all x ∈ X by lx(p) = l(p(x)), (2.75) (2.74) becomes H(p) = − x∈X lx(p). (2.76) Evidently, lx(p) is continuous in p (with respect to convergence in variational distance). Since the summation in (2.76) involves a finite number of terms, we conclude that H(p) is a continuous functional of p. We now proceed to prove the continuity of conditional mutual information which covers all cases of Shannon’s information measures. Consider I(X; Y |Z) and let pXY Z be the joint distribution of X, Y , and Z, where the alphabets X , Y, and Z are assumed to be finite. From (2.47) and (2.67), we obtain I(X; Y |Z) = H(X,Z) + H(Y,Z) − H(X,Y,Z) − H(Z). (2.77) 6 Convergence in variational distance is the same as L1-convergence.
202InformationMeasuresNote that each term on the right-hand side above is the unconditional entropyof the corresponding marginal distribution. Then (2.77) can be rewritten as(2.78)Ix:YIz(pxYz)=H(pxz)+H(pYz)-H(pxYz)-H(pz),where we have used Ix:Y/z(pxyz) to denote I(X;Y|Z). It follows thatlimIx:Yiz(pxyz)PXYz-PXY2lim[H(pxz) +H(pyz) -H(pxyz) -H(pz)](2.79)PxXYZ-PXYlimH(pxz)+lim..H(pyz)PxYz-pxYzPXYz-pxYz(2.80)limlimH(pz).H(pxyz)-PxYz-PXYzPxYZ-PXYZIt can readily be proved,for example, thatlim(2.81)Pxz=pxz,PxYz-Pxyso thatlimH(pxz)=HlimH(pxz)(2.82)pxPXYZ-PXY2PxYz-PxYby the continuity of H()when the alphabets involved are fixed and finite.The details are left as an exercise. Hence, we conclude thatlimIx;Y/z(pxyz)PxYz-pXY(2.83)=H(pxz)+H(pYz)-H(pxYz)-H(pz)(2.84)= Ix;Y(2(pxYz),i.e., Ix;Yiz(pxyz) is a continuous functional of pxyz.Since conditional mutual information covers all cases of Shannon's infor-mation measures, we haveproved that all Shannon's information measuresare continuous with respect to convergence in variational distance under theassumption that the alphabets are fixed and finite. It is not difficult to showthat under this assumption, convergence in variational distance is equivalentto C2-convergence,i.e.,convergence in Euclidean distance (see Problem 8).Itfollows that Shannon's information measures are also continuous with respectto C2-convergence.Thevariational distance.however,is more often used as adistance measure between two probability distributions because it can be di-rectly related with the informational divergence to be discussed in Section 2.5.Thecontinuity of Shannon's information measures we haveproved inthissection is rather restrictive and needs to be applied with caution. In fact, ifthe alphabets are not fixed, Shannon's information measures are everywherediscontinuous with respect to convergence in a number of commonly useddistancemeasures.Wereferthereaders toProblems 28-31fora discussion ofthese issues
20 2 Information Measures Note that each term on the right-hand side above is the unconditional entropy of the corresponding marginal distribution. Then (2.77) can be rewritten as IX;Y |Z(pXY Z) = H(pXZ) + H(pY Z) − H(pXY Z) − H(pZ), (2.78) where we have used IX;Y |Z(pXY Z) to denote I(X; Y |Z). It follows that lim p XY Z→pXY Z IX;Y |Z(p XY Z) = lim p XY Z→pXY Z [H(p XZ) + H(p Y Z) − H(p XY Z) − H(p Z)] (2.79) = lim p XY Z→pXY Z H(p XZ) + lim p XY Z→pXY Z H(p Y Z) − lim p XY Z→pXY Z H(p XY Z) − lim p XY Z→pXY Z H(p Z). (2.80) It can readily be proved, for example, that lim p XY Z→pXY Z p XZ = pXZ, (2.81) so that lim p XY Z→pXY Z H(p XZ) = H lim p XY Z→pXY Z p XZ = H(pXZ) (2.82) by the continuity of H(·) when the alphabets involved are fixed and finite. The details are left as an exercise. Hence, we conclude that lim p XY Z→pXY Z IX;Y |Z(p XY Z) = H(pXZ) + H(pY Z) − H(pXY Z) − H(pZ) (2.83) = IX;Y |Z(pXY Z), (2.84) i.e., IX;Y |Z(pXY Z) is a continuous functional of pXY Z. Since conditional mutual information covers all cases of Shannon’s information measures, we have proved that all Shannon’s information measures are continuous with respect to convergence in variational distance under the assumption that the alphabets are fixed and finite. It is not difficult to show that under this assumption, convergence in variational distance is equivalent to L2-convergence, i.e., convergence in Euclidean distance (see Problem 8). It follows that Shannon’s information measures are also continuous with respect to L2-convergence. The variational distance, however, is more often used as a distance measure between two probability distributions because it can be directly related with the informational divergence to be discussed in Section 2.5. The continuity of Shannon’s information measures we have proved in this section is rather restrictive and needs to be applied with caution. In fact, if the alphabets are not fixed, Shannon’s information measures are everywhere discontinuous with respect to convergence in a number of commonly used distance measures. We refer the readers to Problems 28–31 for a discussion of these issues.
212.4 Chain Rules2.4 ChainRulesIn this section, wepresent a collection of information identities known as thechain rules which are often used in information theory.Proposition 2.24 (ChainRuleforEntropy)H(X1,X2,..,Xn)=H(X|Xi,...,Xi-1).(2.85)i=1Proof.The chain rule for n = 2 has been proved in Proposition 2.16.Weprove the chain rule by induction on n. Assume (2.85) is true for n = m,where m>2. ThenH(Xi,...,Xm,Xm+1)(2.86)=H(Xi,...,Xm)+H(Xm+i|Xi,...,Xm)H(X/X1,...,X,-1)+H(Xm+1/Xi,..,Xm)(2.87)i=1m+1(2.88)H(XX,*--,X,-1),i=1where in (2.86)we have used (2.47) by letting X =(Xi,..:,Xm) and Y =Xm+1,and in (2.87)we haveused (2.85)for n =m. This proves the chainruleforentropy.The chain rule for entropy has the following conditional version.Proposition 2.25 (ChainRule for Conditional Entropy)H(Xi,X2,,Xn|Y) -H(X/Xi,..-,X,-1,Y).(2.89)i=lProof.Theproposition can beproved by consideringH(X1,X2,...,Xn/Y)(2.90)= H(X1, X2,..*,Xn,Y) - H(Y)(2.91)= H((X1,Y), X2,*.-,Xn) - H(Y)=H(X1,Y) +H(X,X1,..",Xi-1,Y)-H(Y)(2.92)i=2n=H(Xi/Y) +H(X,/Xi,...,Xi-1,Y)(2.93)1=2H(XX1,...,Xi-1,Y)(2.94)=1
2.4 Chain Rules 21 2.4 Chain Rules In this section, we present a collection of information identities known as the chain rules which are often used in information theory. Proposition 2.24 (Chain Rule for Entropy). H(X1,X2, ··· ,Xn) = n i=1 H(Xi|X1, ··· ,Xi−1). (2.85) Proof. The chain rule for n = 2 has been proved in Proposition 2.16. We prove the chain rule by induction on n. Assume (2.85) is true for n = m, where m ≥ 2. Then H(X1, ··· ,Xm,Xm+1) = H(X1, ··· ,Xm) + H(Xm+1|X1, ··· ,Xm) (2.86) = m i=1 H(Xi|X1, ··· ,Xi−1) + H(Xm+1|X1, ··· ,Xm) (2.87) = m +1 i=1 H(Xi|X1, ··· ,Xi−1), (2.88) where in (2.86) we have used (2.47) by letting X = (X1, ··· ,Xm) and Y = Xm+1, and in (2.87) we have used (2.85) for n = m. This proves the chain rule for entropy. The chain rule for entropy has the following conditional version. Proposition 2.25 (Chain Rule for Conditional Entropy). H(X1,X2, ··· ,Xn|Y ) = n i=1 H(Xi|X1, ··· ,Xi−1,Y ). (2.89) Proof. The proposition can be proved by considering H(X1,X2, ··· ,Xn|Y ) = H(X1,X2, ··· ,Xn,Y ) − H(Y ) (2.90) = H((X1,Y ),X2, ··· ,Xn) − H(Y ) (2.91) = H(X1,Y ) +n i=2 H(Xi|X1, ··· ,Xi−1,Y ) − H(Y ) (2.92) = H(X1|Y ) +n i=2 H(Xi|X1, ··· ,Xi−1,Y ) (2.93) = n i=1 H(Xi|X1, ··· ,Xi−1,Y ), (2.94)
222Information Measureswhere(2.90)and(2.93)followfromProposition2.16,while(2.92)followsfromProposition2.24.Alternatively, the proposition can be proved by consideringH(Xi,X2,...,XnJY)(2.95)-p(y)H(Xi,X2,...,Xn/Y = y)y-p(g) EH(X/X1.., Xi-1, Y = y)(2.96)i=1yp(y)H(XXi,.,Xi-1,Y =y)(2.97)i=13(2.98)H(X,X1,...,Xi-1,Y),i=1where(2.95)and(2.98)followfrom(2.43)and(2.45).respectively.and(2.96follows from an application of Proposition 2.24 to the joint distribution ofXi, X2,.-, Xn conditioning on [Y = y)]. This proof offers an explanation totheobservationthat(2.89)canbeobtaineddirectlyfrom(2.85)byconditioning onY in every term.Proposition 2.26 (Chain Rule for Mutual Information).nI(X1,X2,.--,Xn;Y)=I(X;YIX1,...,Xi-1).(2.99)i=1Proof.ConsiderI(X1,X2,..,Xn;Y)(2.100)= H(Xi,X2,..,Xn)-H(X1,X2,...,Xn/Y)[H(X,X1,...,X,-1) -H(X,X1,.-.,X,-1,Y)(2.101)i=1F-I(X;YIXi,..,Xi-1),(2.102)i=1where in (2.101),we have invoked both Propositions 2.24 and 2.25.The chainruleformutualinformation isproved.Proposition2.27(ChainRuleforConditional Mutual Information)ForrandomuariablesXi,X2,..-,Xn,Y,andZ,I(X1, X2,.., Xn;YZ) -I(Xi;YIXi,..,Xi-1,Z)(2.103)i=1
22 2 Information Measures where (2.90) and (2.93) follow from Proposition 2.16, while (2.92) follows from Proposition 2.24. Alternatively, the proposition can be proved by considering H(X1,X2, ··· ,Xn|Y ) = y p(y)H(X1,X2, ··· ,Xn|Y = y) (2.95) = y p(y) n i=1 H(Xi|X1, ··· ,Xi−1,Y = y) (2.96) = n i=1 y p(y)H(Xi|X1, ··· ,Xi−1,Y = y) (2.97) = n i=1 H(Xi|X1, ··· ,Xi−1,Y ), (2.98) where (2.95) and (2.98) follow from (2.43) and (2.45), respectively, and (2.96) follows from an application of Proposition 2.24 to the joint distribution of X1,X2, ··· ,Xn conditioning on {Y = y}. This proof offers an explanation to the observation that (2.89) can be obtained directly from (2.85) by conditioning on Y in every term. Proposition 2.26 (Chain Rule for Mutual Information). I(X1,X2, ··· ,Xn; Y ) = n i=1 I(Xi; Y |X1, ··· ,Xi−1). (2.99) Proof. Consider I(X1,X2, ··· ,Xn; Y ) = H(X1,X2, ··· ,Xn) − H(X1,X2, ··· ,Xn|Y ) (2.100) = n i=1 [H(Xi|X1, ··· ,Xi−1) − H(Xi|X1, ··· ,Xi−1,Y )] (2.101) = n i=1 I(Xi; Y |X1, ··· ,Xi−1), (2.102) where in (2.101), we have invoked both Propositions 2.24 and 2.25. The chain rule for mutual information is proved. Proposition 2.27 (Chain Rule for Conditional Mutual Information). For random variables X1,X2, ··· ,Xn, Y , and Z, I(X1,X2, ··· ,Xn; Y |Z) = n i=1 I(Xi; Y |X1, ··· ,Xi−1,Z). (2.103)