Chapter 2 Estimation 2.1 Example Let's start with an example.Suppose that Y is the fuel consumption of a particular model of car in m.p.g. Suppose that the predictors are 1.X1-the weight of the car 2.X2-the horse power 3.X3-the no.of cylinders X3 is discrete but that's OK.Using country of origin,say,as a predictor would not be possible within the current development (we will see how to do this later in the course).Typically the data will be available in the form of an array like this V1X11X12X13 2X21X22X23 yn Xnl Xn2 Xn3 where n is the number of observations or cases in the dataset. 2.2 Linear Model One very general form for the model would be Y=f(Yi,X2,X3)+8 where f is some unknown function and is the error in this representation which is additive in this instance. Since we usually don't have enough data to try to estimate f directly,we usually have to assume that it has some more restricted form,perhaps linear as in Y=B0+B1X1+2X+β3X+e where Bi,i=0,1,2,3 are unknown parameters.Bo is called the intercept term.Thus the problem is reduced to the estimation of four values rather than the complicated infinite dimensional f. In a linear model the parameters enter linearly-the predictors do not have to be linear.For example Y=β0+B1Xi+2logX+e 16
Chapter 2 Estimation 2.1 Example Let’s start with an example. Suppose that Y is the fuel consumption of a particular model of car in m.p.g. Suppose that the predictors are 1. X1 — the weight of the car 2. X2 — the horse power 3. X3 — the no. of cylinders. X3 is discrete but that’s OK. Using country of origin, say, as a predictor would not be possible within the current development (we will see how to do this later in the course). Typically the data will be available in the form of an array like this y1 x11 x12 x13 y2 x21 x22 x23 ✁✂✁✂✁ ✁✂✁✂✁ yn xn1 xn2 xn3 where n is the number of observations or cases in the dataset. 2.2 Linear Model One very general form for the model would be Y ☎ f ✁ X1 X2 X3 ✂✁ ε where f is some unknown function and ε is the error in this representation which is additive in this instance. Since we usually don’t have enough data to try to estimate f directly, we usually have to assume that it has some more restricted form, perhaps linear as in Y ☎ β0 β1X1 β2X2 β3X3 ε where βi , i ☎ 0 1 2 3 are unknown parameters. β0 is called the intercept term. Thus the problem is reduced to the estimation of four values rather than the complicated infinite dimensional f . In a linear model the parameters enter linearly — the predictors do not have to be linear. For example Y ☎ β0 β1X1 β2 logX2 ε 16
2.3.MATRIX REPRESENTATION 17 is linear but Y=Bo+B1X+8 is not.Some relationships can be transformed to linearity-for exampley=Bo can be linearized by taking logs.Linear models seem rather restrictive but because the predictors can transformed and combined in any way,they are actually very flexible.Truly non-linear models are rarely absolutely necessary and most often arise from a theory about the relationships between the variables rather than an empirical investigation. 2.3 Matrix Representation Given the actual data,we may write yi=Bo+B1X1i+B2x2i+B3x3i+8i i=1,...,n but the use of subscripts becomes inconvenient and conceptually obscure.We will find it simpler both notationally and theoretically to use a matrix/vector representation.The regression equation is written as y=XB+8 where y=(vi...vn)T,=(1...n)T,B=(Bo...B3)T and 1 X11X12 X13 X21 X22 X23 Xnl Xn2 Xn3 The column of ones incorporates the intercept term.A couple of examples of using this notation are the simple no predictor,mean only model y=u+ ()(-(e》 We can assume that Eg=0 since if this were not so,we could simply absorb the non-zero expectation for the error into the mean u to get a zero expectation.For the two sample problem with a treatment group having the response y1,...,m with mean uy and control group having response =1,...,n with mean u we have 1 ym 21 Em+n 2.4 Estimatingβ We have the regression equation y=XB+8-what estimate of B would best separate the systematic com- ponent XB from the random component E.Geometrically speaking,y IR"while BE IRP where p is the number of parameters(if we include the intercept then p is the number of predictors plus one)
2.3. MATRIX REPRESENTATION 17 is linear but Y ☎ β0 β1X β2 1 ε is not. Some relationships can be transformed to linearity — for example y ☎ β0x β 1 ε can be linearized by taking logs. Linear models seem rather restrictive but because the predictors can transformed and combined in any way, they are actually very flexible. Truly non-linear models are rarely absolutely necessary and most often arise from a theory about the relationships between the variables rather than an empirical investigation. 2.3 Matrix Representation Given the actual data, we may write yi ☎ β0 β1x1i β2x2i β3x3i εi i ☎ 1 ✂✁✂✁✂✁✄ n but the use of subscripts becomes inconvenient and conceptually obscure. We will find it simpler both notationally and theoretically to use a matrix/vector representation. The regression equation is written as y ☎ Xβ ε where y ☎ ✁ y1 ✁✂✁✂✁ yn ✂ T , ε ☎ ✁ ε1 ✁✂✁✂✁ εn ✂ T , β ☎ ✁ β0 ✁✂✁✂✁ β3 ✂ T and X ☎ ✁ ✁✂ 1 x11 x12 x13 1 x21 x22 x23 ✁✂✁✂✁ ✁✂✁✂✁ 1 xn1 xn2 xn3 ✄✆☎☎ ✝ The column of ones incorporates the intercept term. A couple of examples of using this notation are the simple no predictor, mean only model y ☎ µ ε ✂ y1 ✁✂✁✂✁ yn ✄ ✝ ☎ ✂ 1 ✁✂✁✂✁ 1 ✄ ✝ µ ✂ ε1 ✁✂✁✂✁ εn ✄ ✝ We can assume that Eε ☎ 0 since if this were not so, we could simply absorb the non-zero expectation for the error into the mean µ to get a zero expectation. For the two sample problem with a treatment group having the response y1 ✂✁✂✁✂✁✄ ym with mean µy and control group having response z1 ✂✁✂✁✂✁ zn with mean µz we have ✁ ✁ ✁ ✁ ✁ ✁✂ y1 ✁✂✁✂✁ ym z1 ✁✂✁✂✁ zn ✄✆☎☎ ☎ ☎ ☎ ☎ ✝ ☎ ✁ ✁ ✁ ✁ ✁ ✁✂ 1 0 ✁✂✁✂✁ 1 0 0 1 ✁ ✁ 0 1 ✄✆☎☎ ☎ ☎ ☎ ☎ ✝ ✞ µy µz ✟ ✁ ✁ ✁ ✁✂ ε1 ✁✂✁✂✁ ✁✂✁✂✁ ✁✂✁✂✁ εm✠n ✄✆☎☎ ☎ ☎ ✝ 2.4 Estimating β We have the regression equation y ☎ Xβ ε - what estimate of β would best separate the systematic component Xβ from the random component ε. Geometrically speaking, y ✡ IR n while β ✡ IR p where p is the number of parameters (if we include the intercept then p is the number of predictors plus one)
2.5.LEAST SQUARES ESTIMATION 18 y in n dimensions Residual in n-p dimensions Space spanned by X Fitted in p dimensions Figure 2.1:Geometric representation of the estimation B.The data vector Y is projected orthogonally onto the model space spanned by X.The fit is represented by projection=XB with the difference between the fit and the data represented by the residual vector 2. The problem is to find B such that XB is close to y.The best choice of B is apparent in the geometrical representation shown in Figure 2.4. B is in some sense the best estimate of B within the model space.The response predicted by the model is=XB or Hy where H is an orthogonal projection matrix.The difference between the actual response and the predicted response is denoted by -the residuals. The conceptual purpose of the model is to represent,as accurately as possible,something complex-y which is n-dimensional-in terms of something much simpler-the model which is p-dimensional.Thus if our model is successful,the structure in the data should be captured in those p dimensions,leaving just random variation in the residuals which lie in an n-p dimensional space.We have Data Systematic Structure+Random Variation n dimensions p dimensions+(n-p)dimensions 2.5 Least squares estimation The estimation of B can be considered from a non-geometric point of view.We might define the best estimate of B as that which minimizes the sum of the squared errors,e'8.That is to say that the least squares estimate ofβ,calledβminimizes ∑e号=e'ε=0y-X)'0-X) Expanding this out,we get yy-2BXTy+BTXTXB Differentiating with respect to B and setting to zero,we find that B satisfies xTXB=XTy These are called the normal equations.We can derive the same result using the geometric approach.Now provided TX is invertible B=(XTX)-XTy XB x(xTx)-xTy Hy
2.5. LEAST SQUARES ESTIMATION 18 Space spanned by X Fitted in p dimensions y in n dimensions Residual in n−p dimensions Figure 2.1: Geometric representation of the estimation β. The data vector Y is projected orthogonally onto the model space spanned by X. The fit is represented by projection yˆ ☎ X ˆβ with the difference between the fit and the data represented by the residual vector εˆ. The problem is to find β such that Xβ is close to Y. The best choice of ˆβ is apparent in the geometrical representation shown in Figure 2.4. ˆβ is in some sense the best estimate of β within the model space. The response predicted by the model is yˆ ☎ X ˆβ or Hy where H is an orthogonal projection matrix. The difference between the actual response and the predicted response is denoted by εˆ — the residuals. The conceptual purpose of the model is to represent, as accurately as possible, something complex — y which is n-dimensional — in terms of something much simpler — the model which is p-dimensional. Thus if our model is successful, the structure in the data should be captured in those p dimensions, leaving just random variation in the residuals which lie in an n p dimensional space. We have Data ☎ Systematic Structure Random Variation n dimensions ☎ p dimensions ✁ n p ✂ dimensions 2.5 Least squares estimation The estimation of β can be considered from a non-geometric point of view. We might define the best estimate of β as that which minimizes the sum of the squared errors, ε T ε. That is to say that the least squares estimate of β, called ˆβ minimizes ∑ε 2 i ☎ ε T ε ☎ ✁ y Xβ ✂ T ✁ y Xβ ✂ Expanding this out, we get y T y 2βX T y β TX TXβ Differentiating with respect to β and setting to zero, we find that ˆβ satisfies X TX ˆβ ☎ X T y These are called the normal equations. We can derive the same result using the geometric approach. Now provided X TX is invertible ˆβ ☎ ✁ X TX ✂✁1X T y X ˆβ ☎ X ✁ X TX ✂ 1X T y ☎ Hy
2.6.EXAMPLES OF CALCULATING B 19 H=X(X)-XT is called the"hat-matrix"and is the orthogonal projection ofy onto the space spanned by X.H is useful for theoretical manipulations but you usually don't want to compute it explicitly as it is an n×n matrix. Predicted values:=Hy=XB. .Residuals:yB=y=(I-H)y .Residual sum of squares:=y (I-H)(I-Hy=yT(I-H)y Later we will show that the least squares estimate is the best possible estimate of B when the errors are uncorrelated and have equal variance-i.e.var &=621. 2.6 Examples of calculating B 1.Wheny=u+,1 and B=uso T=17 1 =n so B=(xTX)-xTy=-1y= 2.Simple linear regression(one predictor) 片=0+Bx,+e ()(:))() We can now apply the formula but a simpler approach is to rewrite the equation as 片=0a+Bx+β(x-x)+E so now -( Now work through the rest of the calculation to reconstruct the familiar estimates,i.e. B- (-xy (x-习 In higher dimensions,it is usually not possible to find such explicit formulae for the parameter estimates unless XX happens to be a simple form. 2.7 Why is B a good estimate? 1.It results from an orthogonal projection onto the model space.It makes sense geometrically. 2.If the errors are independent and identically normally distributed,it is the maximum likelihood esti- mator.Loosely put,the maximum likelihood estimate is the value of B that maximizes the probability of the data that was observed. 3.The Gauss-Markov theorem states that it is best linear unbiased estimate.(BLUE)
2.6. EXAMPLES OF CALCULATING ˆβ 19 H ☎ X ✁ X TX ✂ 1X T is called the “hat-matrix” and is the orthogonal projection of y onto the space spanned by X. H is useful for theoretical manipulations but you usually don’t want to compute it explicitly as it is an n n matrix. Predicted values: yˆ ☎ Hy ☎ X ˆβ. Residuals: εˆ ☎ y X ˆβ ☎ y yˆ ☎ ✁ I H ✂ y Residual sum of squares: εˆ T εˆ ☎ y T ✁ I H ✂ ✁ I H ✂ y ☎ y T ✁ I H ✂ y Later we will show that the least squares estimate is the best possible estimate of β when the errors ε are uncorrelated and have equal variance - i.e. var ε ☎ σ 2 I. 2.6 Examples of calculating βˆ 1. When y ☎ µ ε, X ☎ 1 and β ☎ µ so X TX ☎ 1 T 1 ☎ n so ˆβ ☎ ✁ X TX ✂ 1X T y ☎ 1 n 1 T y ☎ ¯y 2. Simple linear regression (one predictor) yi ☎ α βxi εi ✂ y1 ✁✂✁✂✁ yn ✄ ✝ ☎ ✂ 1 x1 ✁✂✁✂✁ 1 xn ✄ ✝ ✞ α β ✟ ✂ ε1 ✁✂✁✂✁ εn ✄ ✝ We can now apply the formula but a simpler approach is to rewrite the equation as yi ☎ α✁ ✂ ✄✆☎ ✝ α β¯x β ✁ xi ¯x✂ εi so now X ☎ ✂ 1 x1 ¯x ✁✂✁✂✁ 1 xn ¯x ✄ ✝ X TX ☎ ✞ n 0 0 ∑ n i✞1 ✁ xi ¯x ✂ 2 ✟ Now work through the rest of the calculation to reconstruct the familiar estimates, i.e. ˆβ ☎ ∑ ✁ xi ¯x ✂ yi ∑ ✁ xi ¯x ✂ 2 In higher dimensions, it is usually not possible to find such explicit formulae for the parameter estimates unless X TX happens to be a simple form. 2.7 Why is βˆa good estimate? 1. It results from an orthogonal projection onto the model space. It makes sense geometrically. 2. If the errors are independent and identically normally distributed, it is the maximum likelihood estimator. Loosely put, the maximum likelihood estimate is the value of β that maximizes the probability of the data that was observed. 3. The Gauss-Markov theorem states that it is best linear unbiased estimate. (BLUE)
2.8.GAUSS-MARKOV THEOREM 20 2.8 Gauss-Markov Theorem First we need to understand the concept of an estimable function.A linear combination of the parameters =cB is estimable if and only if there exists a linear combination ay such that Eay=cB VB Estimable functions include predictions of future observations which explains why they are worth consid- ering.If X is of full rank (which it usually is for observational data),then all linear combinations are estimable. Gauss-Markov theorem Suppose Ee=0 and var g=o21.Suppose also that the structural part of the model,EY=XB is correct. Let y=cB be an estimable function,then in the class of all unbiased linear estimates of y,=cB has the minimum variance and is unique. Proof: We start with a preliminary calculation: Suppose ay is some unbiased estimate of cB so that Ea'y =cTB VB aTXB CTB VB which means that a=c.This implies that c must be in the range space of T which in turn implies that c is also in the range space of TX which means there exists a A such that C =XYA cB=XTXB=λTxTy Now we can show that the least squares estimator has the minimum variance-pick an arbitrary es- timable function a'y and compute its variance: var (a'y)=var (a'y-cTB+cTB) =var(ay-λX'y+cB) var (a'y-X'y)+var (c B)+2cov(a'y-nxTy,ATXTy) but cov(ay-λTX2TXTy)=(a-λTXT)2INm =(aX-XTX)o2I八 =(c'-c)o2=0 so var (a'y)var (a'y-AX'y)+var (cB) Now since variances cannot be negative,we see that var(ay≥var(c'B) In other words cB has minimum variance.It now remains to show that it is unique.There will be equality in above relation if var (ay-ATXTy)=0 which would require that a-T=0 which means that ay=XTy=cB so equality occurs only if ay=cB so the estimator is unique
2.8. GAUSS-MARKOV THEOREM 20 2.8 Gauss-Markov Theorem First we need to understand the concept of an estimable function. A linear combination of the parameters ψ ☎ c T β is estimable if and only if there exists a linear combination a T y such that Ea T y ☎ c T β β Estimable functions include predictions of future observations which explains why they are worth considering. If X is of full rank (which it usually is for observational data), then all linear combinations are estimable. Gauss-Markov theorem Suppose Eε ☎ 0 and var ε ☎ σ 2 I. Suppose also that the structural part of the model, EY ☎ Xβ is correct. Let ψ ☎ c T β be an estimable function, then in the class of all unbiased linear estimates of ψ, ψˆ ☎ c T ˆβ has the minimum variance and is unique. Proof: We start with a preliminary calculation: Suppose a T y is some unbiased estimate of c T β so that Ea T y ☎ c T β β a TXβ ☎ c T β β which means that a TX ☎ c T . This implies that c must be in the range space of X T which in turn implies that c is also in the range space of X TX which means there exists a λ such that c ☎ X TXλ c T ˆβ ☎ λ TX TX ˆβ ☎ λ TX T y Now we can show that the least squares estimator has the minimum variance — pick an arbitrary estimable function a T y and compute its variance: var ✁ a T y ✂ ☎ var ✁ a T y c T ˆβ c T ˆβ ✂ ☎ var ✁ a T y λ TX T y c T ˆβ ✂ ☎ var ✁ a T y λ TX T y ✂ var ✁ c T ˆβ✂ 2cov ✁ a T y λ TX T y λ TX T y ✂ but cov ✁ a T y λ TX T y λ TX T y ✂ ☎ ✁ a T λ TX T ✂ σ 2 IXλ ☎ ✁ a TX λX TX ✂ σ 2 Iλ ☎ ✁ c T c T ✂ σ 2 Iλ ☎ 0 so var ✁ a T y ✂ ☎ var ✁ a T y λ TX T y ✂ var ✁ c T ˆβ ✂ Now since variances cannot be negative, we see that var ✁ a T y ✂✂✁ var ✁ c T ˆβ ✂ In other words c T ˆβ has minimum variance. It now remains to show that it is unique. There will be equality in above relation if var ✁ a T y λ TX T y ✂ ☎ 0 which would require that a T λ TX T ☎ 0 which means that a T y ☎ λ TX T y ☎ c T ˆβ so equality occurs only if a T y ☎ c T ˆβ so the estimator is unique