2.5.SUFFICIENT STATISTICS 21 2.5 Sufficient Statistics 品 Suppose we have a family of pmfs 0→X→T(X),andI(0:T(X)≤I(e:xX) A statistic T(X)is called sufficient for if it contains all the information inabout Definition 2.5.1.A function T(X)is said to be a sufficient statistic relative to the family {fo(r)}if x is independent of 0 given T(X);i.e.,0XT(X)forms a Markov Chain. It means that I(;X)=I(;T(X))for all distributions on 0. Example 2.5.1.Let X1,.Xn,XiE{0,1),be an i.i.d sequence of coin tosses of a coin with unknown parameter=Pr(Xi=1).Given n,the number of I's is a sufficient statistics for 0.Here,I(X1,.Xn)=Xi. Example 2.5.2.If X~N(0,1),i.e.,if =宏学=N ling to this distribution,then Xn= X;is a sufficient statistic for 0. 2.6 Fano's inequality Suppose that we wish to estimate a r.v.X based on a correlated r.v.Y.Fano's inequality relates the probability of error in estimating X to its conditional entropy H(XY). H(P)+PIog(lX1-1)≥H(XIR)≥H(XY) (2.24) Proof.Applying chain rule on H(X,EX),where error r.v. E={位证X we have H(E,)=H(X)+()=H(X) H(E,)=H(E)+H(EX) ≤H(E)+H(IEX) =H(Pe)+P(E=1)H(XX,E=1)+P(E=0)H(XX,E=0) =H(P)+PH(X1X,E=1)+(1-P)H(XX,E=0) H(Pe)+PH(XX,E=1)+0 ≤H(P)+P.logX-1)X取除之外的其它X-1个值 ≤1+Pe log la1 (2.25)
2.5. SUFFICIENT STATISTICS 21 2.5 Sufficient Statistics Suppose we have a family of pmfs {fθ(x)} indexed by θ, and let X be a sample from a distribution in this family. Let T(X) be any statistic (like the sample mean or variance). Then θ → X → T(X), and I(θ; T(X)) ≤ I(θ; X) A statistic T(X) is called sufficient for θ if it contains all the information in X about θ. Definition 2.5.1. A function T(X) is said to be a sufficient statistic relative to the family {fθ(x)} if X is independent of θ given T(X); i.e., θ → X → T(X) forms a Markov Chain. It means that I(θ; X) = I(θ; T(X)) for all distributions on θ. Example 2.5.1. Let X1, . . . , Xn, Xi ∈ {0, 1}, be an i.i.d sequence of coin tosses of a coin with unknown parameter θ = P r(Xi = 1). Given n, the number of 1’s is a sufficient statistics for θ. Here, T(X1, . . . , Xn) = ∑n i=1 Xi. Example 2.5.2. If X ∼ N(0, 1), i.e., if fθ(x) = 1 √ 2π e − (x−θ) 2 2 = N(θ, 1) and X1, . . . , Xn are drawn independently according to this distribution, then X¯ n = T(X1, . . . , Xn) = 1 n ∑n i=1 Xi is a sufficient statistic for θ. 2.6 Fano’s inequality Suppose that we wish to estimate a r.v. X based on a correlated r.v. Y . Fano’s inequality relates the probability of error in estimating X to its conditional entropy H(X|Y ). Theorem 2.6.1. For any r.v.’s X and Y , we try to guess X by Xˆ = g(Y ). The error probability Pe = P(Xˆ ̸= X) satisfies H(Pe) + Pe log(|X | − 1) ≥ H(X|Xˆ) ≥ H(X|Y ) (2.24) Proof. Applying chain rule on H(X, E|Xˆ), where error r.v. E = { 1, if Xˆ ̸= X 0, if Xˆ = X we have H(E, X|Xˆ) = H(X|Xˆ) + H(E|XXˆ) = H(X|Xˆ) H(E, X|Xˆ) = H(E|Xˆ) + H(X|EXˆ) ≤ H(E) + H(X|EXˆ) = H(Pe) + P(E = 1)H(X|X, E ˆ = 1) + P(E = 0)H(X|X, E ˆ = 0) = H(Pe) + PeH(X|X, E ˆ = 1) + (1 − Pe)H(X|X, E ˆ = 0) = H(Pe) + PeH(X|X, E ˆ = 1) + 0 ≤ H(Pe) + Pe log(|X | − 1) X取除Xˆ之外的其它|X | − 1个值 ≤ 1 + Pe log |X | (2.25)
22 CHAPTER 2.ENTROPY AND MUTUAL INFORMATION Thus we can obtain H(Pe)+P.log(1)H() By the data processing theorem,we have I(X;)≤I(X;Y)→H(X|r)≥H(XY) Then (2.24)follows. 物理解释:观察到Y(即已知Y)的条件下,对X还存在的不确定性可分为两部分 ·判断猜测结果是否正确,其不确定性为H(P): ,则这时X可能取除之外的其它X 2.7 Convex functions(互信息的凸性) 2.7.1 Concavity of entropy Theorem 2.7.1.Entropy H(X)is concave in P(r). ar.v. Px(e)=0P(x)+(1-0)P(e)z then H(X)≥0H(X1)+(1-)H(X2) Proof.Let Z be a binary r.v.with P(Z=0)=0.Let x-{花经 Then H(X)>H(XIZ) =0H(X12=0)+1-)H(X1Z=1) =H(X)+(1-)H(X2) (2.26) H(0D+(1-)B)≥0H(D)+(1-)H(B) 2.7.2 Concavity of mutual information Theorem 2.7.2.For a fired transition probability P(y),I(X;Y)is a concave(conver- n)function of P()
22 CHAPTER 2. ENTROPY AND MUTUAL INFORMATION Thus we can obtain H(Pe) + Pe log(|X | − 1) ≥ H(X|Xˆ). By the data processing theorem, we have I(X; Xˆ) ≤ I(X; Y ) ⇒ H(X|Xˆ) ≥ H(X|Y ). Then (2.24) follows. 物理解释:观察到Y (即已知Y )的条件下,对X还存在的不确定性可分为两部分: • 判断猜测结果是否正确,其不确定性为H(Pe); • 如果判决是错的(with probability Pe),则这时X可能取除Xˆ之外的其它|X | − 1个值。为了确定是哪一个,所需的信息量≤ log(|X | − 1). 2.7 Convex functions (互信息的凸性) 2.7.1 Concavity of entropy Theorem 2.7.1. Entropy H(X) is concave in P(x). That is, if X1, X2 are r.v.’s defined on X with distribution P1(x) and P2(x), respectively. For any θ ∈ [0, 1], consider a r.v. X with PX(x) = θP1(x) + (1 − θ)P2(x) ∀x then H(X) ≥ θH(X1) + (1 − θ)H(X2) Proof. Let Z be a binary r.v. with P(Z = 0) = θ. Let X = { X1 if Z = 0 X2 if Z = 1 Then H(X) ≥ H(X|Z) = θH(X|Z = 0) + (1 − θ)H(X|Z = 1) = θH(X1) + (1 − θ)H(X2) (2.26) or H(θP1 + (1 − θ)P2) ≥ θH(P1) + (1 − θ)H(P2) 2.7.2 Concavity of mutual information Theorem 2.7.2. For a fixed transition probability P(y|x), I(X; Y ) is a concave (convex- ∩) function of P(x)
2.7.CONVEX FUNCTIONS(互信息的凸性) 23 Proof.Construct X1,X2,X,and Z as above.Consider I(XZ;Y)=I(X:Y)+(Z:YX) =1(Z:Y)+IX;Y1Z) (2.27) e,e P(oe)P(U I(X:Y)≥I(X:YIZ) =0I(X:YZ=0)+(1-0)IX:Y1Z=1) (2.28)) =I(X1;Y)+(1-0)I(X2;Y) (2.29) ◇ 0 Channel P(YX) K-1 Figure2.5:给定信道下互信息的凸性 Theorem 2.7.3.For a fired input distribution P(),I(X:Y)is conver-U in P(). channels with P(ul) and B().When feed with Now let one channel be chosen randomly according to a binary r.v.Z that is inde- pendent of X,and denote the output by Y. I(X:YZ)=I(x:Z)+I(X:YZ) =I(X:Y)+I(X:ZY) (2.30) Thus,I(X;Y)<I(X;YZ)=0I(x:)+(1-0)I(X:Ya). 2.7.3互信息的凸性一另一证明方法 Theorem 2.7.4.Let(X,Y)~P()=P(r)P().The mutual information I(X;Y) cd Pon o时P国or d P)and a coner unction o时P@ Proof. I(X;Y)=H(Y)-H(YX) =HY)-∑Pm)HYK=) (2.31)
2.7. CONVEX FUNCTIONS (互信息的凸性) 23 Proof. Construct X1, X2, X, and Z as above. Consider I(XZ; Y ) = I(X; Y ) + I(Z; Y |X) = I(Z; Y ) + I(X; Y |Z) (2.27) Conditioned on X, r.v.’s Y and Z are independent, i.e., P(y|x, z) = P(y|x). Using I(Y ;Z|X) = 0, we have I(X; Y ) ≥ I(X; Y |Z) = θI(X; Y |Z = 0) + (1 − θ)I(X; Y |Z = 1) (2.28) = θI(X1; Y ) + (1 − θ)I(X2; Y ) (2.29) . . . X Channel P(Y|X) 0 1 K-1 . . . Y 0 1 J-1 Z Figure 2.5: 给定信道下互信息的凸性 Theorem 2.7.3. For a fixed input distribution P(x), I(X; Y ) is convex-∪ in P(y|x). Proof. Consider a r.v. X and two channels with P1(y|x) and P2(y|x). When feed with X, the outputs of the two channels are denoted by Y1 and Y2. Now let one channel be chosen randomly according to a binary r.v. Z that is independent of X, and denote the output by Y . I(X; Y Z) = I(X;Z) + I(X; Y |Z) = I(X; Y ) + I(X;Z|Y ) (2.30) Thus, I(X; Y ) < I(X; Y |Z) = θI(X; Y1) + (1 − θ)I(X; Y2). 2.7.3 互信息的凸性—另一证明方法 Theorem 2.7.4. Let (X, Y ) ∼ P(x, y) = P(x)P(y|x). The mutual information I(X; Y ) is a convex-∩ function of P(x) for fixed P(y|x) and a convex-∪ function of P(y|x) for fixed P(x). Proof. I(X; Y ) = H(Y ) − H(Y |X) = H(Y ) − ∑ x P(x)H(Y |X = x) (2.31)
24 CHAPTER 2.ENTROPY AND MUTUAL INFORMATION convex-n function of P() is a linear function of P() is a conve The 2nd term in (231)is a limear functionl of nfunction of P() The difference is a convex-function of P() To prove the 2nd part.we fix P(x)and consider two different conditional distribu- tions P(ylr)and Pa(ylr). Let B(x)=AB(x)+(1-A)B() Then 乃(红,)=A(e,)+(1-P(, and P(g)=λn()+(1-)P() where乃(c,)=乃(yl)P(r)and Pa(z,y)=P(lr)P(r If we let Qa(,y)=P(z)P(y),then we have Q(,)=P([AB()+(1-)P()】 =AQ1(x,)+((1-A)Q2(x,) (2.32) Since I(X:Y)=D(PllQa)and D(PlQ)is a convex function of (P.Q).it follows that I(X;Y)is a convex-U function of P(y). 2.8 Differential Entropy Differential entropy is the entropy of a continuous random variable. Let X be a continuous random variable with cumulative distribution function F(r)= Pr{x<).The probability density function of X(if it exists)is given by ) xe(-00,+0o) The set of where p()>0is called the support set of X. Definition 2.8.1.The differential entropy h(X)(or H(X))of a continuous random variable X with a density p()is defined by h(X)=)los)og p(e)ds (2.33) where S is the support set of the random variable Motivation for the definition:Think of quantizing the X axis into intervals of length Az,then the quantized outcomes form a discrete ensemble with B(<x≤xi+△x)≈p(x)△x
24 CHAPTER 2. ENTROPY AND MUTUAL INFORMATION { if P(y|x) is fixed, then P(y) is a linear function of P(x) H(Y ) is a convex-∩ function of P(y) ⇒ H(Y ) is a convex-∩ function of P(x) The 2nd term in (2.31) is a linear function of P(x) } ⇒ The difference is a convex-∩ function of P(x). To prove the 2nd part, we fix P(x) and consider two different conditional distributions P1(y|x) and P2(y|x). Let Pλ(y|x) = λP1(y|x) + (1 − λ)P2(y|x) Then Pλ(x, y) = λP1(x, y) + (1 − λ)P2(x, y) and Pλ(y) = λP1(y) + (1 − λ)P2(y) where P1(x, y) = P1(y|x)P(x) and P2(x, y) = P2(y|x)P(x). If we let Qλ(x, y) = P(x)Pλ(y), then we have Qλ(x, y) = P(x)[λP1(y) + (1 − λ)P2(y)] = λQ1(x, y) + (1 − λ)Q2(x, y) (2.32) Since I(X; Y ) = D(Pλ||Qλ) and D(P||Q) is a convex function of (P, Q). it follows that I(X; Y ) is a convex-∪ function of P(y|x). 2.8 Differential Entropy Differential entropy is the entropy of a continuous random variable. Let X be a continuous random variable with cumulative distribution function F(x) = Pr{X ≤ x}. The probability density function of X (if it exists) is given by p(x) = dF(x) dx x ∈ (−∞, +∞) The set of x where p(x) > 0 is called the support set of X. Definition 2.8.1. The differential entropy h(X) (or Hc(X)) of a continuous random variable X with a density p(x) is defined by h(X) = ∫ S p(x) log 1 p(x) dx = − ∫ S p(x) log p(x)dx (2.33) where S is the support set of the random variable. • Motivation for the definition: Think of quantizing the X axis into intervals of length △x, then the quantized outcomes form a discrete ensemble with Pr(xi < x ≤ xi + △x) ≈ p(xi)△x
2.8.DIFFERENTIAL ENTROPY The self-information of anis given by o The entropy of the quantizedr.v.isH△z(X))=-∑eep(c)△r log p(m)△x. As△r→O,we have m,Ha)=-pr)log p(r)d-r-im,log△rj =- (2.34 oaches oo as△0.Fo the p continuous ensemble. Example 2.8.1 (Uniform distribution).Consider a random variable X distributed u- niformly over (a,b).Its differential entropy is h()=-Balogadx log(b-a) Notice that the differential entropy are not necessarily positive,not necessarily finite. Example 2.8.2(Gaussian distribution).X~N(0,2).The p.df of x is The differential entropy is then given by h=-制 2geae =2血2m02+2 -ln2xo2+e nats (2.35) log2 2rea2 bits (log2重=lh重.log2e) From (2.35),the variance of X can be expressed as which is sometimes called the“熵功率”. 2.8.1 Joint and conditional differential entropy Definition 2.8.2.The differential entropy of a set.of random vari bles with density p(工1,r2,xn)is defined as h(X1,X2,.Xn)=-p(.n)logp(1,22.n)dzidr2.drn
2.8. DIFFERENTIAL ENTROPY 25 The self-information of an X is given by log 1 p(xi)△x . The entropy of the quantized r.v. is H△x(X) = − ∑+∞ i=−∞ p(xi)△x log p(xi)△x. As △x → 0, we have lim △x→0 H△x(x) = − ∫ ∞ −∞ p(x) log p(x)dx − lim △x→0 log △x ∫ ∞ −∞ p(x)dx = − ∫ ∞ −∞ p(x) log p(x)dx − lim △x→0 log △x (2.34) It is seen that the 2nd term in the RHS of (2.35) approaches ∞ as △x → 0. For the purpose of calculation, the first term is often used to define the entropy of a continuous ensemble. Example 2.8.1 (Uniform distribution). Consider a random variable X distributed uniformly over (a, b). Its differential entropy is h(x) = − ∫ b a 1 b − a log 1 b − a dx = log(b − a) Notice that the differential entropy are not necessarily positive, not necessarily finite. Example 2.8.2 (Gaussian distribution). X ∼ N (0, σ2 ). The p.d.f of X is p(x) = 1 √ 2πσ exp ( − x 2 2σ 2 ) The differential entropy is then given by h(x) = − ∫ ∞ −∞ p(x) [ − ln √ 2πσ − x 2 2σ 2 ] dx = 1 2 ln 2πσ2 + 1 2σ 2 ∫ ∞ −∞ p(x)x 2dx = 1 2 ln 2πσ2 + 1 2 = 1 2 ln 2πσ2 + 1 2 ln e = 1 2 ln 2πeσ 2 nats (2.35) = 1 2 log2 2πeσ 2 bits (log2 Φ = ln Φ · log2 e) From (2.35), the variance of X can be expressed as σ 2 = 1 2πe e 2h(X) which is sometimes called the “熵功率”. 2.8.1 Joint and conditional differential entropy Definition 2.8.2. The differential entropy of a set {X1, X2, . . . , Xn} of random variables with density p(x1, x2, . . . , xn) is defined as h(X1, X2, . . . , Xn) = − ∫ p(x1, x2, . . . , xn) log p(x1, x2, . . . , xn)dx1dx2 . . . dxn