82 Information MeasuresFor more than two random variables, we distinguish between two types ofindependence.Definition 2.2(Mutual Independence).For n ≥ 3,random variablesXi,X2,..-,Xn are mutually independent if(2.2)P(T1,2,.,Tn)=p(1)p(2)..(n)for all ai,r2,..", an.Definition 2.3(PairwiseIndependence).For n≥3,random variablesXi,X2,":',Xn are pairwise independent if X, and X,areindependentfor alll≤i<j≤n.Note that mutual independence implies pairwise independence.We leaveit as an exercise for the reader to show that the converse is not true.Definition2.4 (Conditional Independence).For random uariables X,Yand Z,X is independent of Z conditioning on Y,denoted byXIZY, if(2.3)p(z,y,z)p(y)=p(r,y)p(y,z)for all r,y, and z, or equivalently,[(u = p(正,9)p(2) 许 p() >0 (2.4)p(r,y,z) =p(y)0otherwiseThe first definition of conditional independence above is sometimes moreconvenient to use because it is not necessary to distinguish between the casesp(y) > 0 and p(y) = 0. However, the physical meaning of conditional inde-pendence is more explicit in the second definition.Proposition 2.5.For random wariables X,Y, and Z,X IZY if and only if(2.5)p(t,y,2) = a(r,y)b(y,2)for all r,y,and z such that p(y)>0.Proof.The‘only if'part follows immediatelyfrom thedefinition of conditionalindependencein (2.4), so we will only prove the‘if'part.Assume(2.6)p(t,y,2) =a(r,y)b(y,2)for all a, y, and z such that p(y) > 0. Then for such , y, and z, we havep(r,) =p(z,y, 2) =a(r, y)b(y,2) =a(a,) b(y,2)(2.7)
8 2 Information Measures For more than two random variables, we distinguish between two types of independence. Definition 2.2 (Mutual Independence). For n ≥ 3, random variables X1,X2, ··· ,Xn are mutually independent if p(x1,x2, ··· ,xn) = p(x1)p(x2)··· p(xn) (2.2) for all x1,x2, ···, xn. Definition 2.3 (Pairwise Independence). For n ≥ 3, random variables X1,X2, ··· ,Xn are pairwise independent if Xi and Xj are independent for all 1 ≤ i<j ≤ n. Note that mutual independence implies pairwise independence. We leave it as an exercise for the reader to show that the converse is not true. Definition 2.4 (Conditional Independence). For random variables X,Y , and Z, X is independent of Z conditioning on Y , denoted by X ⊥ Z|Y , if p(x,y,z)p(y) = p(x,y)p(y,z) (2.3) for all x,y, and z, or equivalently, p(x,y,z) = p(x,y)p(y,z) p(y) = p(x,y)p(z|y) if p(y) > 0 0 otherwise . (2.4) The first definition of conditional independence above is sometimes more convenient to use because it is not necessary to distinguish between the cases p(y) > 0 and p(y) = 0. However, the physical meaning of conditional independence is more explicit in the second definition. Proposition 2.5. For random variables X,Y , and Z, X ⊥ Z|Y if and only if p(x,y,z) = a(x,y)b(y,z) (2.5) for all x, y, and z such that p(y) > 0. Proof. The ‘only if’ part follows immediately from the definition of conditional independence in (2.4), so we will only prove the ‘if’ part. Assume p(x,y,z) = a(x,y)b(y,z) (2.6) for all x, y, and z such that p(y) > 0. Then for such x, y, and z, we have p(x,y) = z p(x,y,z) = z a(x,y)b(y,z) = a(x,y) z b(y,z) (2.7)
92.1Independenceand Markov Chainsandp(y,2) =p(r, y,2) =a(r,3)b(y,2) = b(y,2) a(r,y).(2.8)Furthermore,(b(y, 2)p(g) =p(y,2) = ((Za(r,y))(2.9)>0Therefore,a(r,y) Eb(y,2)b(y,z) Ea(r,y)p(r,y)p(y,2)(2.10)p(y)a(a,y)b(y,z2)(2.11)= a(r,y)b(y,z)= p(r,y,z).(2.12)For , y, and z such that p(y) = 0, since0≤p(r,y,2) ≤p(y) =0,(2.13)we have(2.14)p(r, y,2) = 0Hence, X I ZY according to (2.4).The proof is accomplished.Definition2.6 (Markov Chain).For random wariables Xi,X2,...,Xnwheren≥3,Xi-→X2-→...-XnformsaMarkouchainifp(r1,T2,...,an)p(r2)p(13)...p(rn-1)(2.15)=p(1,2)p(2,3)(n-1,)for all 1,2,,Tn,orequivalently,p(r1,a2,..,an)【(1,2)p(32)..p(nn-1)i(2),p(3),,(n-1)>0(2.16)10otherwiseWe note thatX I zY is equivalent to the Markov chainX→Y→ZProposition 2.7.Xi-→X2-...→Xn forms a Markou chain if and onlyif Xn-→Xn-1.-+Xi forms a Markou chain
2.1 Independence and Markov Chains 9 and p(y,z) = x p(x,y,z) = x a(x,y)b(y,z) = b(y,z) x a(x,y). (2.8) Furthermore, p(y) = z p(y,z) = x a(x,y) z b(y,z) > 0. (2.9) Therefore, p(x,y)p(y,z) p(y) = a(x,y) z b(y,z) b(y,z) x a(x,y) x a(x,y) z b(y,z) (2.10) = a(x,y)b(y,z) (2.11) = p(x,y,z). (2.12) For x, y, and z such that p(y) = 0, since 0 ≤ p(x,y,z) ≤ p(y)=0, (2.13) we have p(x,y,z)=0. (2.14) Hence, X ⊥ Z|Y according to (2.4). The proof is accomplished. Definition 2.6 (Markov Chain). For random variables X1,X2, ··· ,Xn, where n ≥ 3, X1 → X2 →···→ Xn forms a Markov chain if p(x1,x2, ··· ,xn)p(x2)p(x3)··· p(xn−1) = p(x1,x2)p(x2,x3)··· p(xn−1,xn) (2.15) for all x1,x2, ···, xn, or equivalently, p(x1,x2, ··· ,xn) = p(x1,x2)p(x3|x2)··· p(xn|xn−1) if p(x2),p(x3), ··· ,p(xn−1) > 0 0 otherwise . (2.16) We note that X ⊥ Z|Y is equivalent to the Markov chain X → Y → Z. Proposition 2.7. X1 → X2 → ···→ Xn forms a Markov chain if and only if Xn → Xn−1 →···→ X1 forms a Markov chain.
102Information MeasuresProof.This follows directly from the symmetry in the definition of a Markovchain in(2.15).In the following, we state two basic properties of a Markov chain.Theproofs are left as an exercise.Proposition 2.8.Xi-X2...-→Xnforms a Markou chain if and only ifX1 - X2 - X3(X1, X2) - X3 - X4(2.17):(X1,X2,**",Xn-2) →Xn-1 →Xnform Markou chains.Proposition2.9.X,-→X2-→...-→Xnforms aMarkou chain if and onlyif(2.18)p(T1,2,.,n)=fi(1,2)f2(2,3)..fn-1(μn-1,n)for all 1,a2,*.,En suchthatp(r2),p(r3),...,p(zn-1)>0NotethatProposition 2.9isa generalization of Proposition 2.5.FromProposition2.9,onecanprovethefollowing importantpropertyof a Markovchain. Again, the details are left as an exercise.Proposition 2.10 (Markov Subchains). Let Nn = [1,2,..,n) and letX-X2-→...+Xn form aMarkou chain.For any subset a ofNn,denote(Xi,i E a) by Xa. Then for any disjoint subsets ai,a2,...,am of Nn suchthat(2.19)ki<k2<...<kmforallkEaj,j=l,2,..,m,(2.20)Xa →Xa2 -→...+Xanforms a Markou chain.That is, a subchain ofXi→X2-+Xn isalso..a Markou chain.Erample 2.11.Let Xi → X2 -→ ..→Xio form a Markov chain andα1 =1,2], α2=[4],a3=[6,8], and a4=[10] be subsets of Ni0.ThenProposition 2.10 saysthat(2.21)(Xi,X2) -→X4→(X6,Xs)→X10also forms a Markov chain
10 2 Information Measures Proof. This follows directly from the symmetry in the definition of a Markov chain in (2.15). In the following, we state two basic properties of a Markov chain. The proofs are left as an exercise. Proposition 2.8. X1 → X2 →···→ Xn forms a Markov chain if and only if X1 → X2 → X3 (X1,X2) → X3 → X4 . . . (X1,X2, ··· ,Xn−2) → Xn−1 → Xn (2.17) form Markov chains. Proposition 2.9. X1 → X2 →···→ Xn forms a Markov chain if and only if p(x1,x2, ··· ,xn) = f1(x1,x2)f2(x2,x3)··· fn−1(xn−1,xn) (2.18) for all x1,x2, ···, xn such that p(x2),p(x3), ··· ,p(xn−1) > 0. Note that Proposition 2.9 is a generalization of Proposition 2.5. From Proposition 2.9, one can prove the following important property of a Markov chain. Again, the details are left as an exercise. Proposition 2.10 (Markov Subchains). Let Nn = {1, 2, ··· ,n} and let X1 → X2 →···→ Xn form a Markov chain. For any subset α of Nn, denote (Xi,i ∈ α) by Xα. Then for any disjoint subsets α1,α2, ··· ,αm of Nn such that k1 < k2 < ··· < km (2.19) for all kj ∈ αj , j = 1, 2, ··· ,m, Xα1 → Xα2 →···→ Xαm (2.20) forms a Markov chain. That is, a subchain of X1 → X2 → ···→ Xn is also a Markov chain. Example 2.11. Let X1 → X2 → ··· → X10 form a Markov chain and α1 = {1, 2}, α2 = {4}, α3 = {6, 8}, and α4 = {10} be subsets of N10. Then Proposition 2.10 says that (X1,X2) → X4 → (X6,X8) → X10 (2.21) also forms a Markov chain
112.1IndependenceandMarkovChainsWe have been very careful in handling probability distributions with zeroprobability masses. In the rest of the section, we show that such distributionsare very delicate in general. We first prove the following property of a strictlypositive probability distribution involving four random variables.1Proposition 2.12.Let Xi,X2,X3, and X4 be random uariables such thatp(ri,a2,3,a)is strictlypositive.ThenX 1 X/(X2,X3))(2.22)= X1 1 (X3,X)/X2X1X3l(X2,X4) Proof. If Xi X4l(X2,X3), thenP(1, 2, T3, 2) = P(1, 2, 3)(±2, 3, 24)(2.23)P(2,3)Ontheotherhand,ifX,1Xal(X2,X4),thenp(1, 2, T3, ) = P(±1,22, 24)p(2, T3, 4)(2.24)p(2,14)Equating (2.23)and (2.24),wehaveP(,22, 3) = P(42, 3)p(1, 2, 4)(2.25)p(12,E4)Therefore,P(1, 2) =p(1, 2, 3)(2.26)T3= P(2, r3)p(1, 2, 24)(2.27)p(12, 4)= P(r2)p(t1, T2, 4)(2.28)p(r2,r)orP(1,2,14)(11,T2)(2.29)p(12,14)p(c2)Hence from (2.24)(,, ))(22)(,(2.30)p(r2)P(12,74)foralla1,2,3,and4,i.e,Xi1(X3,Xa)/X2.1Proposition 2.12 is called the intersection axiom in Bayesian networks.See [287]
2.1 Independence and Markov Chains 11 We have been very careful in handling probability distributions with zero probability masses. In the rest of the section, we show that such distributions are very delicate in general. We first prove the following property of a strictly positive probability distribution involving four random variables.1 Proposition 2.12. Let X1,X2,X3, and X4 be random variables such that p(x1,x2,x3,x4) is strictly positive. Then X1 ⊥ X4|(X2,X3) X1 ⊥ X3|(X2,X4) ⇒ X1 ⊥ (X3,X4)|X2. (2.22) Proof. If X1 ⊥ X4|(X2,X3), then p(x1,x2,x3,x4) = p(x1,x2,x3)p(x2,x3,x4) p(x2,x3) . (2.23) On the other hand, if X1 ⊥ X3|(X2,X4), then p(x1,x2,x3,x4) = p(x1,x2,x4)p(x2,x3,x4) p(x2,x4) . (2.24) Equating (2.23) and (2.24), we have p(x1,x2,x3) = p(x2,x3)p(x1,x2,x4) p(x2,x4) . (2.25) Therefore, p(x1,x2) = x3 p(x1,x2,x3) (2.26) = x3 p(x2,x3)p(x1,x2,x4) p(x2,x4) (2.27) = p(x2)p(x1,x2,x4) p(x2,x4) (2.28) or p(x1,x2,x4) p(x2,x4) = p(x1,x2) p(x2) . (2.29) Hence from (2.24), p(x1,x2,x3,x4) = p(x1,x2,x4)p(x2,x3,x4) p(x2,x4) = p(x1,x2)p(x2,x3,x4) p(x2) (2.30) for all x1,x2,x3, and x4, i.e., X1 ⊥ (X3,X4)|X2. 1 Proposition 2.12 is called the intersection axiom in Bayesian networks. See [287].
122Information MeasuresIf p(r1,2,3,r4) = 0 for some 1, 2, 73, and 4, i.e., p is not strictlypositive, the arguments in the above proof are not valid. In fact, the propo-sition may not hold in this case.For instance, let Xr = Y, X2 = Z, andXg = X,- (Y,Z), where Y and Z are independent random variables. ThenX1Xl(X2,X3),X1 Xl(X2,X),butX(X3,X)/X2.Notethatfor this construction, p is not strictly positive because p(1, 2,r3,4)=0 if(1,2)or±4(,2)The above example is somewhat counter-intuitive because it appears thatProposition 2.12 should hold for all probability distributions via a continuityargument? which would go like this. For any distribution p, let (pk) be asequence of strictly positive distributions such that Pk - p and pk satisfies(2.23) and (2.24) for all k, i.e.,(2.31)((2)=()(,)and(2.32)P(1,2,,)(2,)=(,2,)(2,,)Then by the proposition, Pk also satisfies (2.30), i.e.,(2.33)(23,)(2)=(,2)(2,,)Lettingk oo, wehave(2.34)P(T1, 2, 3, T4)p(2) =p(1,2)p(2, 3, 4)for all 1,2,3, and 4, i.e., Xi + (X3,Xa)/X2. Such an argument wouldbe valid if there always exists a sequence (pk) as prescribed. However, theexistence of the distribution p(r1,2,3, 4) constructed immediately afterProposition 2.12 simply says that it is not always possible to find such asequence(pk).Therefore, probability distributions which are not strictly positive canbe very delicate. For strictly positive distributions, we see from Proposi-tion 2.5 that their conditional independence structures are closely related tothe factorization problem of such distributions, which has been investigatedby Chan [59].2.2Shannon'sInformationMeasuresWe begin this section by introducing the entropy of a random variable. Aswe will see shortly, all Shannon's information measures can be expressed aslinearcombinationsofentropies.2See Section 2.3 for a more detailed discussion on continuous functionals
12 2 Information Measures If p(x1,x2,x3,x4) = 0 for some x1,x2,x3, and x4, i.e., p is not strictly positive, the arguments in the above proof are not valid. In fact, the proposition may not hold in this case. For instance, let X1 = Y , X2 = Z, and X3 = X4 = (Y,Z), where Y and Z are independent random variables. Then X1 ⊥ X4|(X2,X3), X1 ⊥ X3|(X2,X4), but X1 ⊥ (X3,X4)|X2. Note that for this construction, p is not strictly positive because p(x1,x2,x3,x4) = 0 if x3 = (x1,x2) or x4 = (x1,x2). The above example is somewhat counter-intuitive because it appears that Proposition 2.12 should hold for all probability distributions via a continuity argument2 which would go like this. For any distribution p, let {pk} be a sequence of strictly positive distributions such that pk → p and pk satisfies (2.23) and (2.24) for all k, i.e., pk(x1,x2,x3,x4)pk(x2,x3) = pk(x1,x2,x3)pk(x2,x3,x4) (2.31) and pk(x1,x2,x3,x4)pk(x2,x4) = pk(x1,x2,x4)pk(x2,x3,x4). (2.32) Then by the proposition, pk also satisfies (2.30), i.e., pk(x1,x2,x3,x4)pk(x2) = pk(x1,x2)pk(x2,x3,x4). (2.33) Letting k → ∞, we have p(x1,x2,x3,x4)p(x2) = p(x1,x2)p(x2,x3,x4) (2.34) for all x1,x2,x3, and x4, i.e., X1 ⊥ (X3,X4)|X2. Such an argument would be valid if there always exists a sequence {pk} as prescribed. However, the existence of the distribution p(x1,x2,x3,x4) constructed immediately after Proposition 2.12 simply says that it is not always possible to find such a sequence {pk}. Therefore, probability distributions which are not strictly positive can be very delicate. For strictly positive distributions, we see from Proposition 2.5 that their conditional independence structures are closely related to the factorization problem of such distributions, which has been investigated by Chan [59]. 2.2 Shannon’s Information Measures We begin this section by introducing the entropy of a random variable. As we will see shortly, all Shannon’s information measures can be expressed as linear combinations of entropies. 2 See Section 2.3 for a more detailed discussion on continuous functionals