Ch. 6 The Linear model under ideal conditions The(multiple) linear model is used to study the relationship between a dependent variable(Y) and several independent variables(X1, X2, ,Xk). That is ∫(X1,X2,…,Xk)+ E assume linear function 1X1+B2X2+…+6kXk+E xB+ where Y is the dependent or explained variable, x=X1 X2., Xk are the independent or the explanatory variables and B=[B1 B2..Bk are unknown coefficients that we are interested in learning about, either through estimation or through hypothesis testing. The term a is an unobservable random disturbance Suppose we have a sample of size T(allowing for non-random)observations on the scalar dependent variable Yt and the vector of explanatory variables Y=x13+et,t=1,2,…,T In matrix form. this relationship is written as YI X X X B1 Y? X X B2 Bk x1「 x2| +E where y is 1× 1 vector, X is an T× k matrix with rows x and e is an T×1 vector with element at Recall from Chapter 2 that we cannot postulate the probability model p if the sample on-random. The probability model must be defined in terms of their sample joint distribution
Ch. 6 The Linear Model Under Ideal Conditions The (multiple) linear model is used to study the relationship between a dependent variable (Y) and several independent variables (X1, X2, ..., Xk). That is Y = f(X1, X2, ..., Xk) + ε assume linear function = β1X1 + β2X2 + ... + βkXk + ε = x ′β + ε where Y is the dependent or explained variable, x = [X1 X2.....; Xk] ′ are the independent or the explanatory variables and β = [β1 β2..... βk] ′ are unknown coefficients that we are interested in learning about, either through estimation or through hypothesis testing. The term ε is an unobservable random disturbance. Suppose we have a sample of size T (allowing for non-random) observations 1 on the scalar dependent variable Yt and the vector of explanatory variables xt = (Xt1, Xt2, ..., Xtk) ′ , i.e. Yt = x ′ tβ + εt , t = 1, 2, ..., T. In matrix form, this relationship is written as y = Y1 Y2 . . . YT = X11 X12 . . . X1k X21 X22 . . . X2T . . . . . . . . . . . . . . . . . . XT1 XT2 . . . XT k β1 β2 . . . βk + ε1 ε2 . . . εT = x ′ 1 x ′ 2 . . . x ′ T β1 β2 . . . βk + ε = Xβ + ε, where y is T × 1 vector, X is an T × k matrix with rows x ′ t and ε is an T × 1 vector with element εt . 1Recall from Chapter 2 that we cannot postulate the probability model Φ if the sample is non-random. The probability model must be defined in terms of their sample joint distribution. 1
Our goal is to regard last equation as a parametric probability and sampling model, and try to inference the unknown Bi s and the parameters contained in E 1 The Probability Model: Gauss Linear Model ume that e nN(0, 2), if X are not stochastic, then by results from"func- tions of random variables”(n→ n transformation) we have y~N(X月,∑) That is, we have specified a probability and sampling model for y to be ( Probability and Sampling Model) X11X12 0i012 X21X2 02102 XTI XT2 OT1 OT2 ≡N(X,∑) That is the sample joint density function is f(y:6)=(2x)m-2exp(-1/2)(y-XB)-y-XB), where 0=(B1, B2, Bk, 01, 012, ,T). It is easily seen that the number of pa- rameters in 0 is large than the sample size, T. Therefore, some restrictions must be imposed in the probability and sampling model for the purpose of estimation as we shall see in the subsequence One kind of restriction on 0 is that 2 is a scalar matrix, then maximize the likelihood of the sample model f(0: x)(wr t. B)is equivalent to minimize the equation(y-XBy'(y-XB)(e'e=E,et constitutes the foundation of ordinary least square estimation To generalize the discussions so far, we have made the following assumption that (a) The model y=XB+E is correct; (no problem of model misspecification)
Our goal is to regard last equation as a parametric probability and sampling model, and try to inference the unknown βi ’s and the parameters contained in ε. 1 The Probability Model: Gauss Linear Model Assume that ε ∼ N(0, Σ), if X are not stochastic, then by results from ”functions of random variables” (n ⇒ n transformation) we have y ∼ N(Xβ, Σ). That is, we have specified a probability and sampling model for y to be (Probability and Sampling Model) y ∼ N X11 X12 . . . X1k X21 X22 . . . X2T . . . . . . . . . . . . . . . . . . XT1 XT2 . . . XT k β1 β2 . . . βk , σ 2 1 σ12 . . . σ1T σ21 σ 2 2 . . . σ2T . . . . . . . . . . . . . . . . . . σT1 σT2 . . . σ2 T ≡ N(Xβ, Σ), That is the sample joint density function is f(y; θ) = (2π) −T /2 |Σ| −1/2 exp(−1/2)(y − Xβ) ′Σ −1 (y − Xβ), where θ = (β1, β2, ..., βk, σ2 1 , σ12, ..., σ2 T ) ′ . It is easily seen that the number of parameters in θ is large than the sample size, T. Therefore, some restrictions must be imposed in the probability and sampling model for the purpose of estimation as we shall see in the subsequence. One kind of restriction on θ is that Σ is a scalar matrix, then maximize the likelihood of the sample model f(θ; x) (w.r.t. β) is equivalent to minimize the equation (y − Xβ) ′ (y − Xβ) (=ε ′ε = PT t=1 ε 2 t , a sums of squared residuals), this constitutes the foundation of ordinary least square estimation. To generalize the discussions so far, we have made the following assumptions that (a) The model y = Xβ + ε is correct; (no problem of model misspecification) 2
(b)X is nonstochastic:(therefore, regression comes first from experimental sci- (c)E(e)=0; (can easily be satisfied by adding a constant in the regression (d)Var(e)=E(ee)=a. I; (disturbance have same variance and are not auto- correlated) (e)Rank(X)=h; (for model identification) (f)e is normal distribute The above six assumptions are usually called the classical ordinary least squares assumption or the ideal conditions 2 Estimation: Ordinary Least Squares Estima tor 2.1 Estimation of B Let us first consider the ordinary least square estimator(OLS) which is the value for B that minimizes the sum of squared errors denoted as SSE (or residuals, remember the principal of estimation at Ch. 3) Sse(B)=(y-xBy-XB ∑(m-x13)2 XB+6xX The first order conditions for a minimum are aSSE(B) =-2xy+2xX=0. If X'X is nonsingular(which is satisfied by the assumption(e) of ideal condition and Ch 1 Sec. 3.5), the system of k equations in k unknown can be uniquely solved for the ordinary least squares(OLS)estimator Xt xt t=1 t=1
(b) X is nonstochastic; (therefore, regression comes first from experimental science) (c) E(ε) = 0; (can easily be satisfied by adding a constant in the regression) (d) V ar(ε) = E(εε′ ) = σ 2 · I; (disturbance have same variance and are not autocorrelated) (e) Rank(X) = k; (for model identification) (f) ε is normal distributed. The above six assumptions are usually called the classical ordinary least squares assumption or the ideal conditions. 2 Estimation: Ordinary Least Squares Estimator 2.1 Estimation of β Let us first consider the ordinary least square estimator (OLS) which is the value for β that minimizes the sum of squared errors denoted as SSE (or residuals, remember the principal of estimation at Ch. 3) SSE(β) = (y − Xβ) ′ (y − Xβ) = X T t=1 (yt − x ′ tβ) 2 = y ′y − 2y ′Xβ + β ′X′Xβ. The first order conditions for a minimum are ∂SSE(β) ∂β = −2X′y + 2X′Xβ = 0. If X′X is nonsingular (which is satisfied by the assumption (e) of ideal condition and Ch.1 Sec. 3.5), the system of k equations in k unknown can be uniquely solved for the ordinary least squares (OLS) estimator βˆ = (X′X) −1X′y = "X T t=1 xtx ′ t #−1 X T t=1 x ′ tyt . (1) 3
To ensure that B is indeed a solution of minimization, we require that asSE(B) 2XX must be a positive definite matrix. This condition is satisfied by assumption 5 and Chl. Sec. 5. 6.1 Denote the T x 1 vector e, of least squares residual be then it is obvious that Xe=X(y-XB=Xy-xx(xxXy=0 i.e., the regressors is orthogonal to the OLS residual. Therefore, if one of the regressors is a constant term. the sum of the residuals is zero since the first element of X'e would be e et=0(a scalar 2.2 Estimation of a2 At this moment, we arrive at the following notation: y= X6+E X6+ o estimate the variance of e, 0, a simple and intuitive idea is that to use infor- nation fre
To ensure that βˆ is indeed a solution of minimization, we require that ∂ 2SSE(β) ∂β∂β ′ = 2X′X must be a positive definite matrix. This condition is satisfied by assumption 5 and Ch1. Sec. 5.6.1. Denote the T × 1 vector e, of least squares residual be e = y − Xβˆ, then it is obvious that X′ e = X′ (y − Xβˆ) = X′y − X′X(X′X) −1X′y = 0, (2) i.e., the regressors is orthogonal to the OLS residual. Therefore, if one of the regressors is a constant term, the sum of the residuals is zero since the first element of X′e would be 1 1 . . . 1 e1 e2 . . . eT = X T t=1 et = 0. (a scalar) 2.2 Estimation of σ 2 At this moment, we arrive at the following notation: y = Xβ + ε = Xβˆ + e. To estimate the variance of ε, σ 2 , a simple and intuitive idea is that to use information from sample e. 4
The matrix Mx=I-X(X'X)X' is symmetric and idempotent.Furthermore MXX=0 Lemma e= Mxy=MxE. That is we can interpret Mx as a matrix that produces the vector of least square residuals in the regression of y on X y-XB X(Xx)X (I-X(XX)Xy Mxy Mxⅹ+M Using the fact that Mx is symmetric and idempotent we have Lemma e'e=EMXMXE=EMxE Theorem 1 (e'e)=02(T-k) Proof E(ee)= E(E'Mxe) Etrace(EMxE(since E'MxE, is a scalar, equals its trace) Etrace(MxEE) trace E(MxEE(Why?
Lemma: The matrix MX = I − X(X′X) −1X′ is symmetric and idempotent . Furthermore, MXX = 0. Lemma: e = MXy = MXε. That is we can interpret MX as a matrix that produces the vector of least square residuals in the regression of y on X. Proof: e = y − Xβˆ = y − X(X′X) −1X′y = (I − X(X′X) −1X′ )y = MXy = MXXβ + MXε = MXε. Using the fact that MX is symmetric and idempotent we have Lemma: e ′e = ε ′M′ XMXε = ε ′MXε. Theorem 1: E(e ′e) = σ 2 (T − k). Proof: E(e ′ e) = E(ε ′MXε) = E[trace (ε ′MXε)] (since ε ′MXε, is a scalar, equals its trace) = E[trace (MXεε′ )] = trace E(MXεε′ )] (W hy ?) = trace (MXσ 2 IT ) = σ 2 trace (MX), 5