2.2.JOINT ENTROPY AND CONDITIONAL ENTROPY 2.2 Joint entropy and conditional entropy (X,Y)can be considered to be a single vector-valued random variable H(XY)≌-logP(X,Y] =P(,y)log P(r,y) (2.4) (z.v)ESxr Definition 2.2.2.The conditional entropy of the diserete random variable X,given that the event Y=y occurs,is defined as H(XIY=)=->P(xly)log P(ly) =E[-l0g P(XIY)IY =y] (2.5) Definition 2.2.3.If (X,Y)~P(),then the conditional entropy of the discrete random variable X,given the discrete random variable Y,is defined as HxIY=∑Pw)I(XIY=) =-∑P(y)∑P(xly)logP(l) =- =E-logP(XIY (2.6) Notice that H(YX)(XY). Theorem 2.2.1.H(XY)=H(X)+H(YX)=H(Y)+(XIY) Proof. H(XY)=-∑P(红,ogP(z,) =-∑∑P,osP国+o P() =-∑P()logP(m)-∑∑P(z,y)logP(l =H(X)+H(YX) (2.7) Corollary 2.2.2.H(XY|Z)=H(XZ)+H(YXZ)
2.2. JOINT ENTROPY AND CONDITIONAL ENTROPY 11 2.2 Joint entropy and conditional entropy We now extend the definition of the entropy of a single random variable to a pair of random variables. (X, Y ) can be considered to be a single vector-valued random variable. Definition 2.2.1. The joint entropy H(XY ) of a pair of discrete random variables (X, Y ) with a joint distribution P(x, y) is defined as H(XY ) , E[− log P(X, Y )] = − ∑ x∈X ∑ y∈Y P(x, y) log P(x, y) = − ∑ (x,y)∈SXY P(x, y) log P(x, y) (2.4) Definition 2.2.2. The conditional entropy of the discrete random variable X, given that the event Y = y occurs, is defined as H(X|Y = y) = − ∑ x∈X P(x|y) log P(x|y) = E [− log P(X|Y )|Y = y] (2.5) Definition 2.2.3. If (X, Y ) ∼ P(x, y), then the conditional entropy of the discrete random variable X, given the discrete random variable Y , is defined as H(X|Y ) = ∑ y∈Y P(y)H(X|Y = y) = − ∑ y∈Y P(y) ∑ x∈X P(x|y) log P(x|y) = − ∑ x∈X ∑ y∈Y P(x, y) log P(x|y) = E[− log P(X|Y )] (2.6) Notice that H(Y |X) ̸= H(X|Y ). Theorem 2.2.1. H(XY ) = H(X) + H(Y |X) = H(Y ) + H(X|Y ) Proof. H(XY ) = − ∑ x ∑ y P(x, y) log P(x, y) = − ∑ x ∑ y P(x, y) [log P(x) + log P(y|x)] = − ∑ x P(x) log P(x) − ∑ x ∑ y P(x, y) log P(y|x) = H(X) + H(Y |X) (2.7) Corollary 2.2.2. H(XY |Z) = H(X|Z) + H(Y |XZ)
12 CHAPTER 2.ENTROPY AND MUTUAL INFORMATION We now generalize the above theorem to a more general case. Theorem 2.2.3 (Chain rule for entropy).Let X1,X2.XN be discrete random vari- ables drawn according to P(1,z2.,N).Then Proof. HX1,X2,.,XN)=E-logP(X1,X2,·,XN川 =E-log IP(XnlX1,.Xn-1) (2.8) -含n =∑H(X.lX:,.,Xn-i) n=1 ◇ If X are independent of each other,then H(X1:X2.XN)=>H(Xn). (2.9) n=1 Similarly,we have H(K1,X2,.,XwY)=H(XnlX1,.,Xn-1,Y) 2.3 Properties of the entropy function Let X be a discrete random variable with alphabet={k,=1,2,. the pmf of by三Prx全e花Then the entropy fcam2g written as where p=(p1.p2,.PK)is a K-vector of probabilities. Property 2.3.1.H(p)>0 (Non-negativity of entropy) Poof.Since0≤pw≤l,we have logp≤0.Hence,.H(p)≥0. ◇ Definition 2.3.1.A function f()is said to be converU over an interval (a,b)if for every x1,x2∈(a,b)and0≤入≤l, f(A1+(1-)x2)≤f(x)+(1-)fx2) A function f is said to be strictly conve if equality holds only=1
12 CHAPTER 2. ENTROPY AND MUTUAL INFORMATION We now generalize the above theorem to a more general case. Theorem 2.2.3 (Chain rule for entropy). Let X1, X2, . . . , XN be discrete random variables drawn according to P(x1, x2, . . . , xN ). Then H(X1, X2, . . . , XN ) = ∑ N n=1 H(Xn|X1, X2, . . . , Xn−1) Proof. H(X1, X2, · · · , XN ) = E[− log P(X1, X2, · · · , XN )] = E [ − log ∏ N n=1 P(Xn|X1, · · · , Xn−1) ] (2.8) = ∑ N n=1 E[− log P(Xn|X1, · · · , Xn−1)] = ∑ N n=1 H(Xn|X1, · · · , Xn−1) If Xn are independent of each other, then H(X1, X2, · · · , XN ) = ∑ N n=1 H(Xn). (2.9) Similarly, we have H(X1, X2, · · · , XN |Y ) = ∑ N n=1 H(Xn|X1, · · · , Xn−1, Y ). 2.3 Properties of the entropy function Let X be a discrete random variable with alphabet X = {xk, k = 1, 2, · · · , K}. Denote the pmf of X by pk = P r(X = xk), xk ∈ X . Then the entropy H(X) of X can be written as H(p) = − ∑ K k=1 pk log pk where p = (p1, p2, · · · , pK) is a K-vector of probabilities. Property 2.3.1. H(p) ≥ 0 (Non-negativity of entropy) Proof. Since 0 ≤ pk ≤ 1, we have log pk ≤ 0. Hence, H(p) ≥ 0. Definition 2.3.1. A function f(x) is said to be convex-∪ over an interval (a, b) if for every x1, x2 ∈ (a, b) and 0 ≤ λ ≤ 1, f(λx1 + (1 − λ)x2) ≤ λf(x1) + (1 − λ)f(x2) A function f is said to be strictly convex-∪ if equality holds only when λ = 0 or λ = 1
2.3.PROPERTIES OF THE ENTROPY FUNCTION 13 凸区域:若对于区域D中任意两点ā和3,ā∈D,3∈D,有 Aa+(1-A)3eD,0≤A≤1 则称D是凸区域。 凸函数:若定义在凸区域D上的函数f(x)满足 f(Aā+(1-a)≤f(a)+(1-)f( ∀a,3eD,0≤≤1 则称函数f(x)为凸-∩函数 m2h作衙e Theorem 2.3.2 (Jensen's inequality).If f is a conver-U function and X is a random variable,then E[f(x1≥f(E[X) Proof.For a two mass point distribution,the inequality becomes Pif(1)+p2f(x2)2 f(pi1+p2z2) which follows directly from the definition of convex function em is true for distributions withk-1mass points.Then for=1.2.we have ∑f)=mf)+1-)∑kf) ≥(+a-n是ka i=1 =f∑P) (2.10) =1 ◇ Property 2.3.2.H(p)is the conver function of p. IT-inequality:For a positive real number r, logr≤(r-1)loge (2.11) with equality if and only if (iff)r =1. Proof.Let f(r)=Inr-(r-1).Then we have fe)=-1adf"()=-<0 We can see that f(r)is convex-in the interval.Moreover,the maximize value is zero,which is achieved with r 1.Therefore, lnr≤r-1 with equality iffr=1.Equation(2.11)follows immediately
2.3. PROPERTIES OF THE ENTROPY FUNCTION 13 凸区域:若对于区域D中任意两点α¯ 和β¯, ¯α ∈ D, β¯ ∈ D, 有 λα¯ + (1 − λ)β¯ ∈ D, 0 ≤ ∀λ ≤ 1 则称D是凸区域。 凸函数:若定义在凸区域D上的函数f(x)满足 f(λα¯ + (1 − α)β¯) ≤ λf(¯α) + (1 − λ)f(β¯), ∀α, ¯ β¯ ∈ D, 0 ≤ λ ≤ 1. 则称函数f(x)为凸- ∩ 函数. Theorem 2.3.1. If the function f has a second derivative which is non-negative (resp. positive) everywhere (f ′′(x) ≥ 0), then the function is convex-∪ (resp. strictly convex). Theorem 2.3.2 (Jensen’s inequality). If f is a convex-∪ function and X is a random variable, then E[f(x)] ≥ f(E[X]) Proof. For a two mass point distribution, the inequality becomes p1f(x1) + p2f(x2) ≥ f(p1x1 + p2x2) which follows directly from the definition of convex function. Suppose that the theorem is true for distributions with k − 1 mass points. Then writing pi ′ = pi 1−pk for i = 1, 2, . . . , k − 1, we have ∑ k i=1 pif(xi) = pkf(xk) + (1 − pk) ∑ k−1 i=1 p ′ i f(xi) ≥ pkf(xk) + (1 − pk)f( ∑ k−1 i=1 p ′ ixi) ≥ f ( pkxk + (1 − pk) ∑ k−1 i=1 p ′ ixi ) = f( ∑ k i=1 pixi) (2.10) Property 2.3.2. H(p) is the convex-∩ function of p. IT-inequality: For a positive real number r, log r ≤ (r − 1) log e (2.11) with equality if and only if (iff) r = 1. Proof. Let f(r) = ln r − (r − 1). Then we have f ′ (r) = 1 r − 1 and f ′′(r) = − 1 r 2 < 0. We can see that f(r) is convex-∩ in the interval r > 0. Moreover, the maximize value is zero, which is achieved with r = 1. Therefore, ln r ≤ r − 1 with equality iff r = 1. Equation (2.11) follows immediately
14 CHAPTER 2.ENTROPY AND MUTUAL INFORMATION r-1 Figure2.2:IT-inequality曲线 Theorem 2.3.3.If the discrete r.v.X has K possible values,then H(X)log K,with equality fP()=for all x∈. Proof. H(X)-logK=->P()log P()-log K ES, =∑Pasa-s =∑P)1ogKP 1 x (yn)≤∑PKP-]oge (2.12) =[∑-∑ra间e =(1-1)1oge=0 where equality holds in (2.12)iff KP()=1 for all z E Sz. ◇ Theorem 2.3.(Conditioning reduces entropy).For any two discreter.'X and Y, H(XY)≤H(X) with equality iff X and Y are independent. (We can also use the relationship I(X;Y)=H(X)-H(XY)20 to obtain H(X)2 H())
14 CHAPTER 2. ENTROPY AND MUTUAL INFORMATION r−1 1 lnr 0 r Figure 2.2: IT-inequality曲线 Theorem 2.3.3. If the discrete r.v. X has K possible values, then H(X) ≤ log K, with equality iff P(x) = 1 K for all x ∈ X . Proof. H(X) − log K = − ∑ x∈Sx P(x) log P(x) − log K = ∑ x P(x) [ log 1 P(x) − log K ] = ∑ x P(x) log 1 KP(x) (by IT-inequality) ≤ ∑ x P(x) [ 1 KP(x) − 1 ] log e (2.12) = [∑ x 1 K − ∑ x P(x) ] log e ≤ [∑ x∈X 1 K − 1 ] log e = (1 − 1) log e = 0 where equality holds in (2.12) iff KP(x) = 1 for all x ∈ Sx. Theorem 2.3.4 (Conditioning reduces entropy). For any two discrete r.v.’s X and Y , H(X|Y ) ≤ H(X) with equality iff X and Y are independent. (We can also use the relationship I(X; Y ) = H(X)−H(X|Y ) ≥ 0 to obtain H(X) ≥ H(X|Y ))
2.4.RELATIVE ENTROPY AND MUTUAL INFORMATION Proof. HXIY)-H(X)=-∑P(e,)log P()+∑Pe)ogPe) (E.)ES:v Es. =∑Pe,losP P(r) £Pus2 P(E,) r≤罗P(29-小 (∑PaPe-∑re,)loge ≤P(x)P()-1loge =(1-1)loge =0 Note that,however H(XY=y)may exceed H(X). Theorem(Independence bound on entro)Letedra to P(X1,X2.XN).Then H(K,X2,Xw)≤∑HX) n=1 with equality if the n are independent. Proof.By the chain rule for entropies, N 0)=空- s∑Hx) (2.13) 2.4 Relative entropy and mutual information 2.4.1 Relative entropy The relative entropy is a measure of the"istance"between two distributions. Definition 2.4.1.The relative entropy (or Kullback Leiber distance)between two pmf P(r)and Q(r)is defined as
2.4. RELATIVE ENTROPY AND MUTUAL INFORMATION 15 Proof. H(X|Y ) − H(X) = − ∑ (x,y)∈Sxy P(x, y) log P(x|y) + ∑ x∈Sx P(x) log P(x) = ∑ x,y P(x, y) log P(x) P(x|y) = ∑ x,y P(x, y) log P(x)P(y) P(x, y) (by IT-inequality) ≤ ∑ x,y P(x, y) ( P(x)P(y) P(x, y) − 1 ) log e = (∑ x,y P(x)P(y) − ∑ x,y P(x, y) ) log e ≤ ∑ x∈X,y∈Y P(x)P(y) − 1 log e = (1 − 1) log e = 0 Note that, however H(X|Y = y) may exceed H(X). Theorem 2.3.5 (Independence bound on entropy). Let X1, . . . , XN be drawn according to P(X1, X2, . . . , XN ). Then H(X1, X2, . . . , XN ) ≤ ∑ N n=1 H(Xn) with equality iff the Xn are independent. Proof. By the chain rule for entropies, H(X1, X2, . . . , XN ) = ∑ N n=1 H(Xn|X1, . . . , Xn − 1) ≤ ∑ N n=1 H(Xn) (2.13) 2.4 Relative entropy and mutual information 2.4.1 Relative entropy The relative entropy is a measure of the “distance” between two distributions. Definition 2.4.1. The relative entropy (or Kullback Leiber distance) between two pmf P(x) and Q(x) is defined as D(P||Q) = ∑ x∈X P(x) log P(x) Q(x) = Ep [ log P(x) Q(x) ]