26 CHAPTER 2.ENTROPY AND MUTUAL INFORMATION Definition 2.8.3.If X,Y have a joint pdf p(r,y),then the conditional differential entropy is h(xY)=-p(c,y)log p(-lu)drdy Since p(,y)=p(xly)p(y),we can obtain h(xY)=h(r)+h(x) h(X)+h(YX) (2.36) h(X1,X2.Xn)=log((2me)"K1)bits (2.37) Proof.The pdf of12.is p(X)= W2rKT-aPKa-a Then h(X:2.)=-/p(x)-(x-m)"K-I(x-m)-In(V2)"Kdx =∑∑a-m6m+ePW =号∑24-m6mK-a+号W =2∑∑KKw+号2✉IK =i∑KK-ln+h2)rK =∑+2)K风 =号+2n(2xrK =与lh(2xe)"K nats =log2((2xe)"[KI) bits (K是对称矩阵:K=K) 2.8.2 Relative entropy and mutual information Do)=plos号
26 CHAPTER 2. ENTROPY AND MUTUAL INFORMATION Definition 2.8.3. If X, Y have a joint pdf p(x, y), then the conditional differential entropy is h(X|Y ) = − ∫ ∫ p(x, y) log p(x|y)dxdy Since p(x, y) = p(x|y)p(y), we can obtain h(XY ) = h(Y ) + h(X|Y ) = h(X) + h(Y |X) (2.36) Theorem 2.8.1. Let X1, X2, . . . , Xn have a multivariate normal distribution with mean m and covariance matrix K. Then h(X1, X2, . . . , Xn) = 1 2 log ((2πe)n |K|) bits (2.37) Proof. The pdf of X1, X2, . . . , Xn is p(X) = 1 ( √ 2π) n|K| 1 2 e − 1 2 (x−m) T K−1 (x−m) Then h(X1, X2, . . . , Xn) = − ∫ p(x) [ − 1 2 (x − m) T K−1 (x − m) − ln(√ 2π) n |K| 1 2 ] dx = 1 2 E ∑ i ∑ j (xi − mi)(K−1 )ij (xj − mj ) + 1 2 ln(2π) n |K| = 1 2 ∑ i ∑ j E [(xi − mi)(xj − mj )] (K−1 )ij + 1 2 ln(2π) n |K| = 1 2 ∑ j ∑ i Kji(K−1 )ij + 1 2 ln(2π) n |K| = 1 2 ∑ j (KK−1 )jj + 1 2 ln(2π) n |K| = 1 2 ∑ j Ijj + 1 2 ln(2π) n |K| = n 2 + 1 2 ln(2π) n |K| = 1 2 ln(2πe)n |K| nats = 1 2 log2 ((2πe)n |K|) bits (K是对称矩阵: Kij = Kji) 2.8.2 Relative entropy and mutual information Definition 2.8.4. The relative entropy D(p||q) between two densities p and q is defined by D(p||q) = ∫ p log p q
2.8.DIFFERENTIAL ENTROPY 27 Note that D(plq)is finite only if the support set of p is contained in the support set of q. Definition 2.8.5.The mutual information I(X:Y)between two random vriables with joint density p is defined as 1(:)bedg (2.38) It is clear that I(X;Y)=h(x)-h(XY) =h(Y)-h(x)=h(x)+h(Y)-h(xY) (2.39) and I(X:Y)=D(p(z y)lp()p(y)). The (average)conditional mutual information is given by 2.8.3 Properties (1)D(p(r)lg())20 with equality if p()=g(r)almost everywhere (a.e.). Proof.Let S be the support set of p.Then -Dw=人pogg≤n(g-)lgat=人th-1s0 ◇ Corollary 2.8.2.I(X:Y)>0with and Y are independent (2)h()h()with equality ff and Y are independent (3)Chain rule for differential entropy Proof.Follows directly from the definitions (4)h(X+cd)=h(X) h(ax)=h(X)+loglal Proof.Let Y=ax.Then pr(y)=px ()and pr(y)== h(ax)=-/py(y)logpr(u)dy =-可Px(份)log x() =- Jx(份)ogx(台)-osa/aPx(份) =-/px(z)logpx(z)+loglal h(X)+log lal
2.8. DIFFERENTIAL ENTROPY 27 Note that D(p||q) is finite only if the support set of p is contained in the support set of q. Definition 2.8.5. The mutual information I(X; Y ) between two random variables with joint density p(x, y) is defined as I(X; Y ) = ∫ ∞ −∞ ∫ ∞ −∞ p(x, y) log p(x, y) p(x)p(y) dxdy (2.38) It is clear that I(X; Y ) = h(X) − h(X|Y ) = h(Y ) − h(Y |X) = h(X) + h(Y ) − h(XY ) (2.39) and I(X; Y ) = D (p(x, y)||p(x)p(y)). • The (average) conditional mutual information is given by I(X; Y |Z) = ∫ ∫ ∫ p(x, y, z) p(x, y|z) p(x|z)p(y|z) dxdydz 2.8.3 Properties (1) D(p(x)||q(x)) ≥ 0 with equality if p(x) = q(x) almost everywhere (a.e.). Proof. Let S be the support set of p. Then −D(p||q) = ∫ S p log q p dx ≤ ∫ S p ( q p − 1 ) log edx = ∫ S q(x)dx − 1 ≤ 0 Corollary 2.8.2. I(X; Y ) ≥ 0 with equality iff X and Y are independent. (2) h(X|Y ) ≤ h(X) with equality iff X and Y are independent. (3) Chain rule for differential entropy h(X1, X2, . . . , XN ) = ∑ N n=1 h(Xn|X1, X2, . . . , Xn−1) Proof. Follows directly from the definitions. (4) h(X + c) = h(X) h(aX) = h(X) + log |a| Proof. Let Y = aX. Then pY (y) = 1 |a| pX ( y a ) , and pY (y) = pX(x) |f ′(x)| ,x = y a . h(aX) = − ∫ pY (y) log pY (y)dy = − ∫ 1 |a| pX (y a ) log 1 |a| pX (y a ) dy = − [∫ 1 |a| pX (y a ) log pX (y a ) − log |a| ∫ 1 |a| pX (y a ) dy ] = − ∫ pX(x) log pX(x) + log |a| = h(X) + log |a|
28 CHAPTER 2.ENTROPY AND MUTUAL INFORMATION vector with joint pdf f(),and c be a column vector Let A be a nonsingular n x n matrix.Then we have h(AX)=h(X)+log ldet(A)l, where ldet(A)I is the absolute value of the determinant. ral,if Y=f(X),where f is a one-to-one correspondence between X (Y)h(X)+E 1og (⑤)Data the I(X:Z)≥I(X:Y). 限罚272com名eR Suppose that This implies that the average mutual information between two ensembles is invariant to any reversible transformation of one of the outcomes. Figure2.:信息处理定理 (6) I(X:YZ)=I(X;Y)+I(X:ZIY) (2.40) =I(X;Z)+I(X:Y]Z) (2.41) with a spe h(X)≤log2meo2 Proof.Suppose that q()is the density of a N(m,2)variable X.Then Let p()be any density of a random variable with mean mp and variance We have r阳)set=gvo+aee-mPu =log +2)[(-m2)+(m-m) =gv2a+2a+(mp-m月
28 CHAPTER 2. ENTROPY AND MUTUAL INFORMATION Similarly, let X be a random vector with joint pdf f(x), and c be a column vector in R n . Then h(X + c) = h(X). Let A be a nonsingular n × n matrix. Then we have h(AX) = h(X) + log |det(A)|, where |det(A)| is the absolute value of the determinant. • In general, if Y = f(X), where f is a one-to-one correspondence between X and Y , then h(Y ) = h(X) + E [ log dy dx ] (5) Data processing theorem: If X is the input to a channel, Z the output, and Y a transformation of Z (i.e., Y = f(Z)), then I(X;Z) ≥ I(X; Y ). Suppose that Y is a reversible transformation of Z, so that z = f −1 (y). Then I(X; Y ) ≥ I(X;Z). Combining these two equations, we have I(X; Y ) = I(X;Z). This implies that the average mutual information between two ensembles is invariant to any reversible transformation of one of the outcomes. Channel y f = (z) X Z Y Figure 2.6: 信息处理定理 (6) I(X; Y Z) = I(X; Y ) + I(X;Z|Y ) (2.40) = I(X;Z) + I(X; Y |Z) (2.41) (7) The Gaussian distribution has the largest differential entropy of any distribution with a specified mean and variance. In other words, if the random variable X has mean m and variance σ 2 < ∞ then h(X) ≤ 1 2 log 2πeσ 2 Proof. Suppose that q(x) is the density of a N(m, σ2 ) variable X. Then log q(x) = log ( 1 √ 2πσ e − (x−m) 2 2σ2 ) = − log √ 2πσ − (x − m) 2 2σ 2 Let p(x) be any density of a random variable with mean mp and variance σ 2 p . We have − ∫ ∞ −∞ p(x) log q(x)dx = log √ 2πσ + 1 2σ 2 ∫ ∞ −∞ p(x)(x − m) 2dx = log √ 2πσ + 1 2σ 2 ∫ ∞ −∞ p(x) [(x − mp) + (mp − m)]2 dx = log √ 2πσ + 1 2σ 2 [ σ 2 p + (mp − m) 2 ]
2.8.DIFFERENTIAL ENTROPY Thus,if p(r)and p'(r)have the same mean and the same variance,then -eset=-rage地 With this equation,we obtain 0≤D(pllg) =-hp)-(")bog)uz =-h(p)+h(g) ◇ ·A summery:: ve).the Gau 2 the Ga The above argument can be extended to a random vector The 2.8.3.Let the ro vector KR has zero mean and covariance h(X)≤2log(2re)K with equality f~N(0,K). (8)If the he in -M.M then h()≤1og20,with Proof.See [textbook,pp.30]-Method 1 Method 2:Let p(r)be the uniform density over a,b]satisfying p(r)dz=1.Let ()be an arbitrary density with q()dr=1.Then hz,q(x】-hz,p(r】=- q()logq()dz+p()logp(r)dr = 广brd)+nols(6a) =-elstew+s(a)e q(z)logq()dr+q()logp()dr -asss[a-d rb 其中“≤”by Jensen's inequality. ◇
2.8. DIFFERENTIAL ENTROPY 29 Thus, if p(x) and p ′ (x) have the same mean and the same variance, then − ∫ ∞ −∞ p(x) log q(x)dx = − ∫ ∞ −∞ p ′ (x) log q(x)dx With this equation, we obtain 0 ≤ D(p||q) = ∫ ∞ −∞ p(x) log p(x)dx − ∫ ∞ −∞ p(x) log q(x)dx = −h(p) − ∫ ∞ −∞ q(x) log q(x)dx = −h(p) + h(q) • A summery: { given σ 2 , the Gaussian 分布 has 最大熵 given h(X), the Gaussian 分布 has the smallest σ 2 (average power) The above argument can be extended to a random vector. Theorem 2.8.3. Let the random vector X ∈ Rn has zero mean and covariance K = E[ XXT ] i.e., Kij = E [XiXj ] , 1 ≤ i, j ≤ n. Then h(X) ≤ 1 2 log(2πe)n |K| with equality iff X ∼ N (0, K). (8) If the random variable X takes on values in the interval [−M, M], i.e., x ∈ [−M, M], then h(X) ≤ log(2M), with equality iff X has a uniform distribution over [−M, M]. Proof. See [textbook, pp.30] —–Method 1 Method 2: Let p(x) be the uniform density over [a, b] satisfying ∫ b a p(x)dx = 1. Let q(x) be an arbitrary density with ∫ b a q(x)dx = 1. Then h [x, q(x)] − h [x, p(x)] = − ∫ b a q(x) log q(x)dx + ∫ b a p(x) log p(x)dx = − ∫ b a q(x) log q(x)dx + ∫ b a p(x) log ( 1 b − a ) dx = − ∫ b a q(x) log q(x)dx + [ log ( 1 b − a ) ∫ b a p(x)dx ] = − ∫ b a q(x) log q(x)dx + [ log ( 1 b − a ) ∫ b a q(x)dx ] = − ∫ b a q(x) log q(x)dx + ∫ b a q(x) log p(x)dx = ∫ b a q(x) log p(x) q(x) dx ≤ log [∫ b a q(x) p(x) q(x) dx ] = 0 其中“≤” by Jensen’s inequality
30 CHAPTER 2.ENTROPY AND MUTUAL INFORMATION
30 CHAPTER 2. ENTROPY AND MUTUAL INFORMATION