2 Point estimation (Point) Estimation refers to our attempt to give a numerical value to 0. Let ( S, F, p(O) be the probability space of reference with X a r.v. defined on this space. The following statistical model is postulated ()更={f(x:),∈},R (ii)x=(X1, X2, .,Xn)' is a random sample from f(a: 0) Estimation in the context of this statistical model takes the form of constructing a mapping h(): a-0, where x is the observation space and h( )is a Borel function. The composite function (a statistic)0= h(x):S-0 is called an estimator and its value h(a), a E r an estimate of 0. It is important to distin- guish between the two because the former is a random variable and the latter is a real number Example Let f(r; 0)=[1/v2TJeap(-)(x-0)2), 0 R, and x be a random sample from f(a: 8). Then &=Rn and the following function define estimators of 8 2.2=k∑1X,k=1,2,…,n-1 (X1+Xn) It is obvious that we can construct infinitely many such estimators. However constructing "good"estimators is not so obvious. It is clear we need some criteria to choose between theses estimators. In other words. we need to formalize what we mean by a good"estimator 2.1 Finite sample properties of estimator 2.1.1 Unbiasedness An estimator is constructed with the sole aim of providing us the"most represen tative values "of 0 in the parameter space 0, based on the available information in the form of statistical model. Given that the estimator 0= h(x) is a rv
2 Point Estimation (P oint) Estimation refers to our attempt to give a numerical value to θ. Let (S, F,P(·)) be the probability space of reference with X a r.v. defined on this space. The following statistical model is postulated: (i) Φ = {f(x; θ), θ ∈ Θ}, Θ ⊆ R; (ii)x ≡ (X1, X2, ..., Xn) ′ is a random sample from f(x; θ). Estimation in the context of this statistical model takes the form of constructing a mapping h(·) : X → Θ, where X is the observation space and h(·) is a Borel function. The composite function (a statistic) θˆ ≡ h(x) : S → Θ is called an estimator and its value h(x), x ∈ X an estimate of θ. It is important to distinguish between the two because the former is a random variable and the latter is a real number. Example: Let f(x; θ) = [1/ √ 2π]exp{−1 2 (x − θ) 2}, θ ∈ R, and x be a random sample from f(x; θ). Then X = R n and the following function define estimators of θ: 1. θˆ 1 = 1 n Pn i=1 Xi , 2. θˆ 2 = 1 k Pk i=1 Xi , k = 1, 2, ..., n − 1; 3. θˆ 3 = 1 n (X1 + Xn). It is obvious that we can construct infinitely many such estimators. However, constructing ”good” estimators is not so obvious. It is clear we need some criteria to choose between theses estimators. In other words, we need to formalize what we mean by a ”good” estimator. 2.1 Finite sample properties of estimator 2.1.1 Unbiasedness An estimator is constructed with the sole aim of providing us the ”most representative values” of θ in the parameter space Θ, based on the available information in the form of statistical model. Given that the estimator θˆ = h(x) is a r.v. 6
(being a Borel function a random vector x)any information of what we mean by a"most representative values "must be in terms of the distribution of 0, say f(e). The obvious property to require a ' good'estimator 8 of 8 to satisfy is that f(e) is centered around 8 Definition 7. An estimator 0 of 0 is said to be an unbiased estimator of e if E(O)=6f6=0 That is, the distribution of 0 has mean equal to the unknown parameter to estimated Note that an alternative, but equivalent, way to define e(8)is E(6 h(x)f(x;6) where f(a: 0)=f(a1, 2,,In; 0) is the distribution of the sample, x It must be remembered that unbiasedness is a property based on the distri- bution of 0. This distribution is often called sampling distribution of 0 in order distinguish it from any other distribution of function of r v's 1.2 Effie Although unbiasedness seems at first sight to be a highly desirable property it turns out in most situations there are too many unbiased estimators for this prop- erty to be used as the sole criterion for judging estimators. The question which naturally arises is "How can we choose among unbiased estimators ". Given that the variance is a measure of dispersion, intuition suggests that the estimator with the smallest variance is in a sense better because its distribution is more c Definition 8: An unbiased estimator 0 of 0 is said to be relatively more efficient than some
(being a Borel function a random vector x) any information of what we mean by a ”most representative values” must be in terms of the distribution of θˆ, say f(θˆ). The obvious property to require a ’good’ estimator θˆ of θ to satisfy is that f(θˆ) is centered around θ. Definition 7: An estimator θˆ of θ is said to be an unbiased estimator of θ if E(θˆ) = Z ∞ −∞ θˆf(θˆ)dθˆ = θ. That is, the distribution of θˆ has mean equal to the unknown parameter to be estimated. Note that an alternative, but equivalent, way to define E(θˆ) is E(θˆ) = Z ∞ −∞ · · · Z ∞ −∞ h(x)f(x; θ)dx where f(x; θ) = f(x1, x2, ..., xn; θ) is the distribution of the sample, x. It must be remembered that unbiasedness is a property based on the distribution of θˆ. This distribution is often called sampling distribution of θˆ in order to distinguish it from any other distribution of function of r.v.’s. 2.1.2 Efficiency Although unbiasedness seems at first sight to be a highly desirable property it turns out in most situations there are too many unbiased estimators for this property to be used as the sole criterion for judging estimators. The question which naturally arises is ” How can we choose among unbiased estimators ?”. Given that the variance is a measure of dispersion, intuition suggests that the estimator with the smallest variance is in a sense better because its distribution is more ’concentrated’ around θ. Definition 8: An unbiased estimator θˆ of θ is said to be relatively more efficient than some 7
other unbiased estimator e if Var(e)< var(e) In the case of biased estimators relative efficiency can be defined in terms of the mean square error(Mse)which take the form MSE(O,60)=E(O-60)2=E(O-E(0)+E()-0) =Var(0)+[Bias(0,0)2 the cross-product term being zero. Bias(0, 00)=e(0)-0o is the bias of 0 relative to the value 0o For any two estimator 8 and 8 of 8 if Mse(0,0)< MSE(0, 0), 0Ee with strict inequality holding for some 0E e, 0 is said to be inadmissible Using the concept of relatively efficiency we can compare different estimator we happen to consider. This is, however, not very satisfactory since there might be much better estimators in terms of MSE for which we know nothing about In order to avoid choosing the better of two inefficient estimators we need some absolute measure of efficiency. Such a measure is provided by the cramer- Rao lower bound Definition 9: The equation CR0)= is the Cramer- Rao lower bound, where f(a, 0) is the distribution of the sample and b(e) the bias. It can be shown that for any estimator 0 "of 8 MSE(,6)≥CB(6 under the following regularity condition on (a). The set A=f: f(a: 0)>0 does not depend on 0
other unbiased estimator θ˜ if V ar(θˆ) < V ar(θ˜). In the case of biased estimators relative efficiency can be defined in terms of the mean square error (MSE) which take the form MSE(θˆ, θ0) = E(θˆ − θ0) 2 = E(θˆ − E(θˆ) + E(θˆ) − θ0) 2 = V ar(θˆ) + [Bias(θˆ, θ0)]2 , the cross-product term being zero. Bias(θˆ, θ0) = E(θˆ) − θ0 is the bias of θˆ relative to the value θ0. For any two estimator θˆ and θ˜ of θ if MSE(θˆ, θ) ≤ MSE(θ˜, θ), θ ∈ Θ with strict inequality holding for some θ ∈ Θ, θ˜ is said to be inadmissible. Using the concept of relatively efficiency we can compare different estimator we happen to consider. This is, however, not very satisfactory since there might be much better estimators in terms of MSE for which we know nothing about. In order to avoid choosing the better of two inefficient estimators we need some absolute measure of efficiency. Such a measure is provided by the CramerRao lower bound. Definition 9: The equation CR(θ) = h 1 + dB(θ) dθ i2 E ∂ log f(x;θ) ∂θ 2 is the Cramer−Rao lower bound, where f(x, θ) is the distribution of the sample and B(θ) the bias. It can be shown that for any estimator θ ∗ of θ MSE(θ ∗ , θ) ≥ CR(θ) under the following regularity condition on Φ: (a). The set A = {x : f(x; θ) > 0} does not depend on θ; 8
(b). For each 0 E 0 the distribution a log f(a: 0)1/(a0 ), i=1, 2, 3 exist for all ∈ (c).0< El(a/a0)log f(a; 0)12<oo for all 0E 8 In the case of unbiased estimators the inequality takes the form (e”)≥|E alog f(a: 8) the inverse of the lower bound is called Fishers in formation number and is denoted by In(0).2 Definition 10(multi-parameters's Cramer-Rao Theorem An unbiased estimator 0 of 0 is said to be fully efficient if alog f(a: 0)/alog f(a: 8 E 00 a2 log f(c: 8 0006 where In()= a log f(a: 0)/alog f(a: 0) 06 is called the sample information matriz. Proof: (for the case that 0 is 1 x 1) Given that f( 1, 2,.,n; 8) is the joint density function of the sample, it pos- sesses the property that f( 0)dcr.dan =1 or, more compactly, ∫(x;O)dm=1 2It must bear in mind that the information matrix is a function of sample size n
(b). For each θ ∈ Θ the distribution [∂ i log f(x; θ)]/(∂θ i ), i = 1, 2, 3 exist for all x ∈ X ; (c). 0 < E[(∂/∂θ) log f(x; θ)]2 < ∞ for all θ ∈ θ. In the case of unbiased estimators the inequality takes the form V ar(θ ∗ ) ≥ " E ∂ log f(x; θ) ∂θ 2 #−1 ; the inverse of the lower bound is called F isher′ s information number and is denoted by In(θ).2 Definition 10 (multi-parameters’s Cram´er-Rao Theorem): An unbiased estimator θˆ of θ is said to be fully efficient if V ar(θˆ) = E ∂ log f(x; θ) ∂θ ∂ log f(x; θ) ∂θ ′−1 = E − ∂ 2 log f(x; θ) ∂θ∂θ ′ −1 where In(θ) = E ∂ log f(x; θ) ∂θ ∂ log f(x; θ) ∂θ ′ = E − ∂ 2 log f(x; θ) ∂θ∂θ ′ , is called the sample information matrix. Proof: (for the case that θ is 1 × 1): Given that f(x1, x2, ..., xn; θ) is the joint density function of the sample, it possesses the property that Z ∞ −∞ · · · Z ∞ −∞ f(x1, x2, ..., xn; θ)dx1...dxn = 1, or, more compactly, Z ∞ −∞ f(x; θ)dx = 1. 2 It must bear in mind that the information matrix is a function of sample size n. 9
If we assume that the domain of x is independent of 0((a) this permits straight forward differentiation inside the integral sign) and that the derivative af(/a0 exist. Then differentiating the above equation with respect to 8 results in f(m:;6) 06 Is equation can be reexpressed as aIn f(a: 8) d f(x:;)da=0 f(t) f'(t 06 dt Therefore, it simply states that OIn f(a: 8) i.e., the expectation of the derivative of the natural logarithm of the likelihood function of a random sample from a regular density is zero Likewise, differentiating (1)wrt. 8 again provides a2Inf(a: 0 f(a: 0)dar anf(a: 0)af(a; e)d a-In f(a; 0) f(x:;6)da+ aIn f(a; 0) 06 f(x:;6)d aIn f(a: 0) aiN f(a: 0) E Now consider the estimator h(x of 0 whose expectation is E(h(x))=/h(x)f(a; 0)d Differentiating(2)wrt. 0 we obtain dE(h(x)) h(x) 0f(x; 06 h(x) aIn f(a: 0 06 ∫(x;6)dc cou/h(x), nf(r: 0)(since E/INf(a: 02=0) 06
If we assume that the domain of x is independent of θ ((a) this permits straightforward differentiation inside the integral sign) and that the derivative ∂f(·)/∂θ exist. Then differentiating the above equation with respect to θ results in Z ∞ −∞ ∂f(x; θ) ∂θ dx = 0. (1) This equation can be reexpressed as Z ∞ −∞ ∂ ln f(x; θ) ∂θ f(x; θ)dx = 0 ( d dt ln f(t) = f ′ (t) f(t) ). Therefore, it simply states that E ∂ ln f(x; θ) ∂θ = 0, i.e., the expectation of the derivative of the natural logarithm of the likelihood function of a random sample from a regular density is zero. Likewise, differentiating (1) w.r.t. θ again provides Z ∞ −∞ ∂ 2 ln f(x; θ) ∂θ 2 f(x; θ)dx + Z ∞ −∞ ∂ ln f(x; θ) ∂θ ∂f(x; θ) ∂θ dx = Z ∞ −∞ ∂ 2 ln f(x; θ) ∂θ 2 f(x; θ)dx + Z ∞ −∞ ∂ ln f(x; θ) ∂θ 2 f(x; θ)dx. That is V ar ∂ ln f(x; θ) ∂θ = −E ∂ 2 ln f(x; θ) ∂θ 2 . Now consider the estimator h(x) of θ whose expectation is E(h(x)) = Z h(x)f(x; θ)dx. (2) Differentiating (2) w.r.t. θ we obtain ∂E(h(x)) ∂θ = Z h(x) ∂f(x; θ) ∂θ dx = Z h(x) ∂ ln f(x; θ) ∂θ f(x; θ)dx = cov h(x), ∂ ln f(x; θ) ∂θ (since E ∂ ln f(x; θ) ∂θ = 0). 10