152.2. Distribution functionsThis allows us to express “n-dimensional" rectangles in Rn succinctly:I =(a,b) =(xeR" : a<x<b) for anya, beRnTheinterior and closure ofI arerespectivelyI° = (a,b) =(xeRn: a<x<b)andi-[a,b]=(xeR":a≤x≤b)and the boundary of I is the"(n -1)-dimensional"relative complement01=1-10Finally letthe 2n“corners"ofI (a subset of Rn)bedenoted bythe cartesianproductaxb=x=i(ai,bi].Definition 2.1 For x distributed on R", the distribution function (d.f.)of x is the function F : Rn -→ [0,1], where F(t) = P(x ≤ t), Vt E Rn.This is denoted x ~ F or x ~ Fx.A d.f. is automatically right-continuous; thus, if it is known on any densesubset D C Rn, it is determined everywhere. This is becausefor any t e Rn,a sequence dn may be chosen in D descending to t: dn It.From the d.f. may be computed the probability of any rectangleP(a<x≤b) = (-1)Na(t)F(t), Va<btEaxbwhere Na(t) =n-, (ai,t:) counts the number of t,'s that are a,'s.The borel subsets of Rn comprise the smallest o-algebra containing therectanglesBn =o(a,b) : a,be R").The class gn of all countable disjoint unions of rectangles contains all theopen subsets of R", and if we let G = i=i(a, b] denote a generic elementin this class, it followsthatXP(xEG) =P(ai<x≤bi))i=1By the Caratheodory extension theorem (C.E.T.),the probability of ageneral borel set A e Bn is then uniquely determined by the formulaPx(A)=P(xEA)= inf P(xEG)C
2.2. Distribution functions 15 This allows us to express “n-dimensional” rectangles in Rn succinctly: I = (a, b] = {x ∈ Rn : a < x ≤ b} for any a, b ∈ R¯ n. The interior and closure of I are respectively I◦ = (a, b) = {x ∈ Rn : a < x < b} and ¯I = [a, b] = {x ∈ Rn : a ≤ x ≤ b} and the boundary of I is the “(n − 1)-dimensional” relative complement ∂I = ¯I − I◦. Finally let the 2n “corners” of I (a subset of R¯ n) be denoted by the cartesian product a × b = ×n i=1{ai, bi}. Definition 2.1 For x distributed on Rn, the distribution function (d.f.) of x is the function F : R¯ n → [0, 1], where F(t) = P(x ≤ t), ∀t ∈ R¯ n. This is denoted x ∼ F or x ∼ Fx. A d.f. is automatically right-continuous; thus, if it is known on any dense subset D ⊂ Rn, it is determined everywhere. This is because for any t ∈ R¯ n, a sequence dn may be chosen in D descending to t: dn ↓ t. From the d.f. may be computed the probability of any rectangle P(a < x ≤ b) = t∈a×b (−1)Na(t) F(t), ∀a < b, where Na(t) = n i=1 δ(ai, ti) counts the number of ti’s that are ai’s. The borel subsets of Rn comprise the smallest σ-algebra containing the rectangles Bn = σ ((a, b] : a, b ∈ Rn). The class Gn of all countable disjoint unions of rectangles contains all the open subsets of Rn, and if we let G = ∞ i=1(ai, bi] denote a generic element in this class, it follows that P(x ∈ G) = ∞ i=1 P(ai < x ≤ bi). By the Caratheodory extension theorem (C.E.T.), the probability of a general borel set A ∈ Bn is then uniquely determined by the formula Px(A) ≡ P(x ∈ A) = inf A⊂G P(x ∈ G)
162. Random vectors2.3Equals-in-distributionDefinition 2.2 x and y are equidistributed (identically distributed)denoted x=y,iff Px(A)=Py(A), VAEBn.On the basis of the previous section, it should be clear that for any denseDCR":Proposition 2.1 (C.E.T) x = y → Fx(t) = Fy(t), Vt E D.Although at first glance, looks like nothing more than a convenientshorthand symbol,there is an immediate consequence of the definition.deceptively simple to state and prove, that has powerful application in thesequel.Let g : Rn →2 where 2 is a completely arbitrary space.Proposition 2.2 (Invariance) x 兰y → g(x) 兰 g(y).ProofP(g(x) E B) = P (x Eg-1(B)) = P (y E g-1(B)) = P(g(y) E B) 口Example 2.1d=a= yi,i=l..,nx→ nj = yj, i,j= ,.nIr =Ily', for an ri, i= ,..ni=1i=1etc.2.4Discrete distributionsDefinition2.3Theprobabilityfunction(p.f.)of xis thefunctionp: Rn -→[0,1] where p(t) = P(x = t), Vt e Rn.The p.f. may be evaluated directly from the d.f.:p(t) = lim P(sm<x≤t),Smttwhere Sm ↑ t means s1 < s2 < ... and sm -→ t as m -→ 0o. The subsetD=p-1(O)c where the p.f.is nonzero may contain at most a countablenumber of points.D is known as the discrete part of x, and x is said to bediscrete if it is "concentrated"on D:Definition 2.4 x is discrete if P(xED)=1
16 2. Random vectors 2.3 Equals-in-distribution Definition 2.2 x and y are equidistributed (identically distributed), denoted x d = y, iff Px(A) = Py(A), ∀A ∈ Bn. On the basis of the previous section, it should be clear that for any dense D ⊂ Rn: Proposition 2.1 (C.E.T) x d = y ⇐⇒ Fx(t) = Fy(t), ∀t ∈ D. Although at first glance, d = looks like nothing more than a convenient shorthand symbol, there is an immediate consequence of the definition, deceptively simple to state and prove, that has powerful application in the sequel. Let g : Rn → Ω where Ω is a completely arbitrary space. Proposition 2.2 (Invariance) x d = y =⇒ g(x) d = g(y). Proof. P (g(x) ∈ B) = P x ∈ g−1(B) = P y ∈ g−1(B) = P (g(y) ∈ B). ✷ Example 2.1 x d = y =⇒ xi d = yi, i = 1,.,n =⇒ xixj d = yiyj , i, j = 1,.,n =⇒ n i=1 xri i d = n i=1 yri i , for any ri, i = 1,.,n =⇒ etc. 2.4 Discrete distributions Definition 2.3 The probability function (p.f.) of x is the function p : R¯ n → [0, 1] where p(t) = P(x = t), ∀t ∈ R¯ n. The p.f. may be evaluated directly from the d.f.: p(t) = limsm↑t P(sm < x ≤ t), where sm ↑ t means s1 < s2 < ··· and sm → t as m → ∞. The subset D = p−1(0)c where the p.f. is nonzero may contain at most a countable number of points. D is known as the discrete part of x, and x is said to be discrete if it is “concentrated” on D: Definition 2.4 x is discrete iff P(x ∈ D)=1
172.5. Expected valuesOne may verify thatx is discrete P(x EA)=p(t), VAEBntEAnDThus, the distribution of x is entirely determined by its p.f. if and only ifit is discrete, and in this case, we may simply write x ~p or x ~ px.2.5 Expected valuesFor any event A.we may consider the indicator functionJ1,xEAIA(x) =L0, xA.It is clear that IA(x) is itself a discrete random variable, referred to as aBernoullitrial.for whichP(IA(x) = 1) = Px(A) and P(IA(x) = 0) = 1 - Px(A)This is denoted IA(x) ~ Bernoulli (Px(A)) and we define E IA(x) =Px(A)For any k mutually disjoint and exhaustive events Ai,..., Ak and k realnumbers ai,...,ak, we may form the simple functions(x) = a1IA, (x) +:: +aIA (x)Obviously, s(x) is also discrete withP(s(x)=a)=Px(A), i=1....,kBy requiring that E be linear, we (are forced to) defineEs(x)=aiPx(A)+...+aPx(A).The most general function for which we need ever compute an expectedvalue may be directly expressed as a limit of a sequence of simple functions.Such a function g(x) is said to be measurable and we may explicitly writeg(x) = lim sn(x),N→owhere convergence holds pointwise, i.e., for every fixed x. If g(x) is non-negative, it can be proven that we may always choose the sequence of simplefunctions tobethemselves non-negative and nondecreasing as a sequencewhereupon we defineE g(x)= lim E sn(x)= supE sn(x).Then, in general, we write g(x)as the difference of its positive and negativepartsg(x) = g+(x) - g-(x)
2.5. Expected values 17 One may verify that x is discrete ⇐⇒ P(x ∈ A) = t∈A∩D p(t), ∀A ∈ Bn. Thus, the distribution of x is entirely determined by its p.f. if and only if it is discrete, and in this case, we may simply write x ∼ p or x ∼ px. 2.5 Expected values For any event A, we may consider the indicator function IA(x) = 1, x ∈ A 0, x ∈ A. It is clear that IA(x) is itself a discrete random variable, referred to as a Bernoulli trial, for which P (IA(x) = 1) = Px(A) and P (IA(x) = 0) = 1 − Px(A). This is denoted IA(x) ∼ Bernoulli (Px(A)) and we define E IA(x) = Px(A). For any k mutually disjoint and exhaustive events A1,.,Ak and k real numbers a1,.,ak, we may form the simple function s(x) = a1IA1 (x) + ··· + akIAk (x). Obviously, s(x) is also discrete with P (s(x) = ai) = Px(Ai), i = 1,., k. By requiring that E be linear, we (are forced to) define E s(x) = a1Px(A1) + ··· + akPx(Ak). The most general function for which we need ever compute an expected value may be directly expressed as a limit of a sequence of simple functions. Such a function g(x) is said to be measurable and we may explicitly write g(x) = lim N→∞ sN (x), where convergence holds pointwise, i.e., for every fixed x. If g(x) is nonnegative, it can be proven that we may always choose the sequence of simple functions to be themselves non-negative and nondecreasing as a sequence whereupon we define E g(x) = lim N→∞ E sN (x) = sup N E sN (x). Then, in general, we write g(x) as the difference of its positive and negative parts g(x) = g+(x) − g−(x)
182.Random vectorsdefined by[g(x),g(x) ≥ 0g+(x)10,g(x) < 0,[-g(x), g(x)≤0g~(x)[0,g(x) < 0,and finish bydefining了 E g+(x) - E g-(x), if E g+(x)< 00 or E g-(x) < 00E g(x) =“undefined,"otherwise.Wemay sometimes use the Leibniz notationE g(x) = / g(t)dPx(t) = / g(t)dF(t)One should verify the fundamental inequality |E g(x)|≤E Ig(x)Let denote convergence of a monotonically nondecreasing sequence.Something is said to happen for almost all x if it fails to happen on a setA suchthat Px(A)=O.Thetwo maintheorems concerning“continuityofEarethefollowing:Proposition2.3(Monotoneconvergencetheorem(M.C.T.))Sup-pose 0 ≤ gi(x) ≤ g2(x) ≤ .... If gn(x) + g(x), for almost all x, thenE gN(x) +E g(x).Proposition 2.4 (Dominated convergence theorem (D.C.T.)) Ifgn(x) → g(x), for almost all x, and Ign(x)l ≤ h(x) with E h(x) < o0,then E |gn(x) - g(x)→0 and, thus, also E gn(x) →E g(x).It should be clear by the process whereby expectation is defined (in stages)that we haveProposition 2.5 x 兰 y E g(x) = E g(y), Vg measurable.2.6MeanandvarianceConsider the "linear functional"t'x =n-,tr, for each (fixed) t e Rnand the “euclidean norm" (length) x| = (r-1 2)/2. By any of threeequivalent ways, for p > 0 one may say that the pth moment of x is finite:EtxP<oo,VtRnErP<oo,i=l....,nExP<00.To show this, one must realize that [ril ≤|x| ≤ Zn, [ril and C = [x Rn:E|xp< oo) is a linear space (v.Problem2.14.3).From the simple inequalitya≤1+ap,Va≥0and0<r≤p,if we leta=xandtakeexpectations,weget Ex<1+ExP.Hence,if for
18 2. Random vectors defined by g+(x) = g(x), g(x) ≥ 0 0, g(x) < 0, g−(x) = −g(x), g(x) ≤ 0 0, g(x) < 0, and finish by defining E g(x) = E g+(x) − E g−(x), if E g+(x) < ∞ or E g−(x) < ∞ “undefined,” otherwise. We may sometimes use the Leibniz notation E g(x) = g(t)dPx(t) = g(t)dF(t). One should verify the fundamental inequality |E g(x)| ≤ E |g(x)|. Let ↑ denote convergence of a monotonically nondecreasing sequence. Something is said to happen for almost all x if it fails to happen on a set A such that Px(A) = 0. The two main theorems concerning “continuity” of E are the following: Proposition 2.3 (Monotone convergence theorem (M.C.T.)) Suppose 0 ≤ g1(x) ≤ g2(x) ≤ ···. If gN (x) ↑ g(x), for almost all x, then E gN (x) ↑ E g(x). Proposition 2.4 (Dominated convergence theorem (D.C.T.)) If gN (x) → g(x), for almost all x, and |gN (x)| ≤ h(x) with E h(x) < ∞, then E |gN (x) − g(x)| → 0 and, thus, also E gN (x) → E g(x). It should be clear by the process whereby expectation is defined (in stages) that we have Proposition 2.5 x d = y ⇐⇒ E g(x) = E g(y), ∀g measurable. 2.6 Mean and variance Consider the “linear functional” t x = n i=1 tixi for each (fixed) t ∈ Rn, and the “euclidean norm” (length) |x| = n i=1 x2 i 1/2 . By any of three equivalent ways, for p > 0 one may say that the pth moment of x is finite: E |t x| p < ∞, ∀t ∈ Rn ⇐⇒ E |xi| p < ∞, i = 1,.,n ⇐⇒ E |x| p < ∞. To show this, one must realize that |xi|≤|x| ≤ n i=1 |xi| and Lp = {x ∈ Rn : E |x| p < ∞} is a linear space (v. Problem 2.14.3). From the simple inequality ar ≤ 1 + ap, ∀a ≥ 0 and 0 < r ≤ p, if we let a = |x| and take expectations, we get E |x| r ≤ 1 + E |x| p. Hence, if for
192.6.Mean and variancep > O, the pth moment of x is finite, then also the rth moment is finite, forany0<r<p.A product-moment of order pfor x = (ri,..., n)'is defined byEI,i≥0,i=1..n,Zpi=.i=1=1A useful inequality to determine that a product-moment is finite is Holder'sinequality:Proposition 2.6(Holder's inequality)For any univariate random va-riables r and y,11E[yl≤(E[a1)/r(E)/,>1, -1From this inequality, if the pth moment of x e Rn is finite, then all product-moments of order p are also finite.This can be verified for n =2.as Holder'sinequalitygivesE P≤(E [P)P1/P. (E [2/P)p2/p, i ≥0, =1,2, p1 + p2 =p.The conclusion for general n follows by induction.If the first moment of x is finite we define the mean of x byμ=E ×(E ri)=(μ).If the second moment of x is finite,we definethe uariance of x byx def (cov(ri,j) = (og) .E= var x In general, we define the expected value of any multiply indexed arrayof univariate random variables, =(rijk..),componentwise by E =(E rijk..).Vectors and matrices are thus only special cases and it is obviousthat=E(x-μ)(x-μ)=Exx-μpIt is also obvious that for any A e Rm,E Ax = Aμ and var Ax = AEA'.In particular, E t'x = t'μ and var t'x = t't ≥ O, Vt E R". Now, thereader should verifythatmoregenerallycov(s'x,t'x)= s'Etand that considered as a function of s and t, the left-hand side defines a(pseudo)innerproduct.Thus,isautomaticallypositivesemidefinite,>O.Butbythis,wemayimmediatelywrite =HDH'withH orthogonaland D=diag(),wherethecolumns of H comprise an orthonormal basisof “eigenvectors"and the components of ^≥o list the corresponding
2.6. Mean and variance 19 p > 0, the pth moment of x is finite, then also the rth moment is finite, for any 0 < r ≤ p. A product-moment of order p for x = (x1,.,xn) is defined by E n i=1 xpi i , pi ≥ 0, i = 1, . . . , n, n i=1 pi = p. A useful inequality to determine that a product-moment is finite is H¨older’s inequality: Proposition 2.6 (H¨older’s inequality) For any univariate random variables x and y, E |xy| ≤ (E |x| r) 1/r · (E |y| s) 1/s , r> 1, 1 r + 1 s = 1. From this inequality, if the pth moment of x ∈ Rn is finite, then all productmoments of order p are also finite. This can be verified for n = 2, as H¨older’s inequality gives E |xp1 1 xp2 2 | ≤ (E |x1| p) p1/p · (E |x2| p) p2/p , pi ≥ 0, i = 1, 2, p1 + p2 = p. The conclusion for general n follows by induction. If the first moment of x is finite we define the mean of x by µ = E x def = (E xi)=(µi). If the second moment of x is finite, we define the variance of x by Σ = var x def = (cov(xi, xj )) = (σij ). In general, we define the expected value of any multiply indexed array of univariate random variables, ξ = (xijk···), componentwise by E ξ = (E xijk···). Vectors and matrices are thus only special cases and it is obvious that Σ = E (x − µ)(x − µ) = E xx − µµ . It is also obvious that for any A ∈ Rm n , E Ax = Aµ and var Ax = AΣA . In particular, E t x = t µ and var t x = t Σt ≥ 0, ∀t ∈ Rn. Now, the reader should verify that more generally cov(s x, t x) = s Σt and that considered as a function of s and t, the left-hand side defines a (pseudo) inner product. Thus, Σ is automatically positive semidefinite, Σ ≥ 0. But by this, we may immediately write Σ = HDH with H orthogonal and D = diag(λ), where the columns of H comprise an orthonormal basis of “eigenvectors” and the components of λ ≥ 0 list the corresponding