2.9.MEAN AND VARIANCE OF B 21 Implications The Gauss-Markov theorem shows that the least squares estimate B is a good choice,but if the errors are correlated or have unequal variance,there will be better estimators.Even if the errors behave but are non-normal then non-linear or biased estimates may work better in some sense.So this theorem does not tell one to use least squares all the time,it just strongly suggests it unless there is some strong reason to do otherwise. Situations where estimators other than ordinary least squares should be considered are 1.When the errors are correlated or have unequal variance,generalized least squares should be used. 2.When the error distribution is long-tailed,then robust estimates might be used.Robust estimates are typically not linear in y. 3.When the predictors are highly correlated(collinear),then biased estimators such as ridge regression might be preferable. 2.9 Mean and Variance of B Now B=(XTX),xTy so .Mean EB=(XTX),IXTXB=B(unbiased) ·varB=(KTX),XTσ2 LX(XTX),1=(XTX),1σ2 Note that since B is a vector,(T),o2 is a variance-covariance matrix.Sometimes you want the standard error for a particular component which can be picked out as in se(B)=(X6. 2.10 Estimating o2 Recall that the residual sum of squares was=(/-H)y.Now after some calculation,one can show that E=o2(n-p)which shows that 2、e7e n-p is an unbiased estimate of o2.n-p is the degrees of freedom of the model.Actually a theorem parallel to the Gauss-Markov theorem shows that it has the minimum variance among all quadratic unbiased estimators of o2. 2.11 Goodness of Fit How well does the model fit the data?One measure is R2,the so-called coefficient of determination or percentage of variance explained R2=1-G-2 RSS Total SS(corrected for mean)
2.9. MEAN AND VARIANCE OF ˆβ 21 Implications The Gauss-Markov theorem shows that the least squares estimate ˆβ is a good choice, but if the errors are correlated or have unequal variance, there will be better estimators. Even if the errors behave but are non-normal then non-linear or biased estimates may work better in some sense. So this theorem does not tell one to use least squares all the time, it just strongly suggests it unless there is some strong reason to do otherwise. Situations where estimators other than ordinary least squares should be considered are 1. When the errors are correlated or have unequal variance, generalized least squares should be used. 2. When the error distribution is long-tailed, then robust estimates might be used. Robust estimates are typically not linear in y. 3. When the predictors are highly correlated (collinear), then biased estimators such as ridge regression might be preferable. 2.9 Mean and Variance of βˆ Now ˆβ ☎ ✁ X TX ✂ 1X T y so Mean E ˆβ ☎ ✁ X TX ✂ 1X TXβ ☎ β (unbiased) var ˆβ ☎ ✁ X TX ✂ 1X Tσ 2 IX ✁ X TX ✂ 1 ☎ ✁ X TX ✂ 1σ 2 Note that since ˆβ is a vector, ✁ X TX ✂ 1σ 2 is a variance-covariance matrix. Sometimes you want the standard error for a particular component which can be picked out as in se ✁ ˆβi ✂ ☎ ✁ XTX ✂ 1 ii σˆ. 2.10 Estimating σ 2 Recall that the residual sum of squares was εˆ T εˆ ☎ y T ✁ I H ✂ y. Now after some calculation, one can show that Eεˆ T εˆ ☎ σ 2 ✁ n p ✂ which shows that σˆ 2 ☎ εˆ T εˆ n p is an unbiased estimate of σ 2 . n p is the degrees of freedom of the model. Actually a theorem parallel to the Gauss-Markov theorem shows that it has the minimum variance among all quadratic unbiased estimators of σ 2 . 2.11 Goodness of Fit How well does the model fit the data? One measure is R 2 , the so-called coefficient of determination or percentage of variance explained R 2 ☎ 1 ∑ ✁ yˆi yi ✂ 2 ∑ ✁ yi ¯y ✂ 2 ☎ 1 RSS Total SS ✁ corrected for mean ✂
2.11.GOODNESS OF FIT 22 8 0.0 0.2 0.4 0.6 0.8 1.0 X Figure 2.2:Variation in the response y when x is known is denoted by dotted arrows while variation in y when x is unknown is shown with the solid arrows The range is 0-R2-1-values closer to I indicating better fits.For simple linear regression R2=12where ris the correlation between x and y.An equivalent definition is R2= 09-况 Σ0%-两 The graphical intuition behind R2 is shown in Figure 2.2.Suppose you want to predict y.If you don't know x,then your best prediction is y but the variability in this prediction is high.If you do know x,then your prediction will be given by the regression fit.This prediction will be less variable provided there is some relationship betweenx and y.R2 is one minus the ratio of the sum of squares for these two predictions. Thus for perfect predictions the ratio will be zero and R2 will be one. Warning:R2 as defined here doesn't make any sense if you do not have an intercept in your model.This is because the denominator in the definition of R2 has a null model with an intercept in mind when the sum of squares is calculated.Alternative definitions of R2 are possible when there is no intercept but the same graphical intuition is not available and the R2's obtained should not be compared to those for models with an intercept.Beware of high R2's reported from models without an intercept. What is a good value of R2?It depends on the area of application.In the biological and social sciences, variables tend to be more weakly correlated and there is a lot of noise.We'd expect lower values for R2 in these areas-a value of 0.6 might be considered good.In physics and engineering,where most data comes from closely controlled experiments,we expect to get much higher R2's and a value of 0.6 would be considered low.Of course,I generalize excessively here so some experience with the particular area is necessary for you to judge your R2's well. An alternative measure of fit is 6.This quantity is directly related to the standard errors of estimates of B and predictions.The advantage is that 6 is measured in the units of the response and so may be directly interpreted in the context of the particular dataset.This may also be a disadvantage in that one
2.11. GOODNESS OF FIT 22 0.0 0.2 0.4 0.6 0.8 1.0 −0.2 0.2 0.6 1.0 x y Figure 2.2: Variation in the response y when x is known is denoted by dotted arrows while variation in y when x is unknown is shown with the solid arrows The range is 0 R 2 1 - values closer to 1 indicating better fits. For simple linear regression R 2 ☎ r 2 where r is the correlation between x and y. An equivalent definition is R 2 ☎ ∑ ✁ yˆi ¯y✂ 2 ∑ ✁ yi ¯y✂ 2 The graphical intuition behind R 2 is shown in Figure 2.2. Suppose you want to predict y. If you don’t know x, then your best prediction is ¯y but the variability in this prediction is high. If you do know x, then your prediction will be given by the regression fit. This prediction will be less variable provided there is some relationship between x and y. R 2 is one minus the ratio of the sum of squares for these two predictions. Thus for perfect predictions the ratio will be zero and R 2 will be one. Warning: R 2 as defined here doesn’t make any sense if you do not have an intercept in your model. This is because the denominator in the definition of R 2 has a null model with an intercept in mind when the sum of squares is calculated. Alternative definitions of R 2 are possible when there is no intercept but the same graphical intuition is not available and the R 2 ’s obtained should not be compared to those for models with an intercept. Beware of high R 2 ’s reported from models without an intercept. What is a good value of R 2 ? It depends on the area of application. In the biological and social sciences, variables tend to be more weakly correlated and there is a lot of noise. We’d expect lower values for R 2 in these areas — a value of 0.6 might be considered good. In physics and engineering, where most data comes from closely controlled experiments, we expect to get much higher R 2 ’s and a value of 0.6 would be considered low. Of course, I generalize excessively here so some experience with the particular area is necessary for you to judge your R 2 ’s well. An alternative measure of fit is σˆ. This quantity is directly related to the standard errors of estimates of β and predictions. The advantage is that σˆ is measured in the units of the response and so may be directly interpreted in the context of the particular dataset. This may also be a disadvantage in that one
2.12.EXAMPLE 23 must understand whether the practical significance of this measure whereas R2,being unitless,is easy to understand. 2.12 Example Now let's look at an example concerning the number of species of tortoise on the various Galapagos Islands. There are 30 cases(Islands)and 7 variables in the dataset.We start by reading the data into R and examining it data(gala) gala Species Endemics Area Elevation Nearest Scruz Adjacent Baltra 58 23 25.09 346 0.60.6 1.84 Bartolome 31 21 1.24 109 0.626.3 572.33 ---cases deleted Tortuga 16 1.24 186 6.850.9 17.95 Wolf 21 12 2.85 253 34.1254.7 2.33 The variables are Species The number of species of tortoise found on the island Endemics The number of endemic species Elevation The highest elevation of the island(m) Nearest The distance from the nearest island (km) Seruz The distance from Santa Cruz island(km) Adjacent The area of the adjacent island(km2) The data were presented by Johnson and Raven(1973)and also appear in Weisberg(1985).I have filled in some missing values for simplicity (see Chapter 14 for how this can be done).Fitting a linear model in R is done using the lm()command.Notice the syntax for specifying the predictors in the model.This is the so-called Wilkinson-Rogers notation.In this case,since all the variables are in the gala data frame,we must use the data=argument: gfit <-lm(Species Area Elevation Nearest Scruz Adjacent, data=gala) summary(gfit) Call: lm(formula Species Area Elevation Nearest Scruz Adjacent, data gala) Residuals: Min 10 Median 30 Max -111.68-34.90 -7.86 33.46182.58
2.12. EXAMPLE 23 must understand whether the practical significance of this measure whereas R 2 , being unitless, is easy to understand. 2.12 Example Now let’s look at an example concerning the number of species of tortoise on the various Galapagos Islands. There are 30 cases (Islands) and 7 variables in the dataset. We start by reading the data into R and examining it > data(gala) > gala Species Endemics Area Elevation Nearest Scruz Adjacent Baltra 58 23 25.09 346 0.6 0.6 1.84 Bartolome 31 21 1.24 109 0.6 26.3 572.33 --- cases deleted --- Tortuga 16 8 1.24 186 6.8 50.9 17.95 Wolf 21 12 2.85 253 34.1 254.7 2.33 The variables are Species The number of species of tortoise found on the island Endemics The number of endemic species Elevation The highest elevation of the island (m) Nearest The distance from the nearest island (km) Scruz The distance from Santa Cruz island (km) Adjacent The area of the adjacent island (km2 ) The data were presented by Johnson and Raven (1973) and also appear in Weisberg (1985). I have filled in some missing values for simplicity (see Chapter 14 for how this can be done). Fitting a linear model in R is done using the lm() command. Notice the syntax for specifying the predictors in the model. This is the so-called Wilkinson-Rogers notation. In this case, since all the variables are in the gala data frame, we must use the data= argument: > gfit <- lm(Species ˜ Area + Elevation + Nearest + Scruz + Adjacent, data=gala) > summary(gfit) Call: lm(formula = Species ˜ Area + Elevation + Nearest + Scruz + Adjacent, data = gala) Residuals: Min 1Q Median 3Q Max -111.68 -34.90 -7.86 33.46 182.58
2.12.EXAMPLE 24 Coefficients: Estimate Std.Error t value Pr(>Itl) (Intercept)7.06822 19.15420 0.37 0.7154 Area -0.02394 0.02242 -1.07 0.2963 Elevation 0.31946 0.05366 5.953.8e-06 Nearest 0.00914 1.05414 0.01 0.9932 Scruz -0.24052 0.21540 -1.12 0.2752 Adjacent -0.07480 0.01770 -4.23 0.0003 Residual standard error:61 on 24 degrees of freedom Multiple R-Squared:0.766, Adjusted R-squared:0.717 F-statistic:15.7 on 5 and 24 degrees of freedom, p-value:6.84e-07 We can identify several useful quantities in this output.Other statistical packages tend to produce output quite similar to this.One useful feature of R is that it is possible to directly calculate quantities of interest. Of course,it is not necessary here because the lm (function does the job but it is very useful when the statistic you want is not part of the pre-packaged functions. First we make the X-matrix >x<-cbind(1,ga1a[,-c(1,2)]) and here's the response y: y <galasspecies Now let's construct XTY:t (does transpose and*does matrix multiplication: >七(X)号*号 Error:requires numeric matrix/vector arguments Gives a somewhat cryptic error.The problem is that matrix arithmetic can only be done with numeric values but x here derives from the data frame type.Data frames are allowed to contain character variables which would disallow matrix arithmetic.We need to force x into the matrix form: x <as.matrix(x) >t(x)号*号x Inverses can be taken using the solve (command: >xtxi<-solve(t(x)号*号x) xtxi A somewhat more direct way to get TX)-1 is as follows: gfit <-lm(Species Area Elevation Nearest Scruz Adjacent, data=gala) gs <summary(gfit) gsScov.unscaled
2.12. EXAMPLE 24 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.06822 19.15420 0.37 0.7154 Area -0.02394 0.02242 -1.07 0.2963 Elevation 0.31946 0.05366 5.95 3.8e-06 Nearest 0.00914 1.05414 0.01 0.9932 Scruz -0.24052 0.21540 -1.12 0.2752 Adjacent -0.07480 0.01770 -4.23 0.0003 Residual standard error: 61 on 24 degrees of freedom Multiple R-Squared: 0.766, Adjusted R-squared: 0.717 F-statistic: 15.7 on 5 and 24 degrees of freedom, p-value: 6.84e-07 We can identify several useful quantities in this output. Other statistical packages tend to produce output quite similar to this. One useful feature of R is that it is possible to directly calculate quantities of interest. Of course, it is not necessary here because the lm() function does the job but it is very useful when the statistic you want is not part of the pre-packaged functions. First we make the X-matrix > x <- cbind(1,gala[,-c(1,2)]) and here’s the response y: > y <- gala$Species Now let’s construct X TX: t() does transpose and %*% does matrix multiplication: > t(x) %*% x Error: %*% requires numeric matrix/vector arguments Gives a somewhat cryptic error. The problem is that matrix arithmetic can only be done with numeric values but x here derives from the data frame type. Data frames are allowed to contain character variables which would disallow matrix arithmetic. We need to force x into the matrix form: > x <- as.matrix(x) > t(x) %*% x Inverses can be taken using the solve() command: > xtxi <- solve(t(x) %*% x) > xtxi A somewhat more direct way to get ✁ X TX ✂ 1 is as follows: > gfit <- lm(Species ˜ Area + Elevation + Nearest + Scruz + Adjacent, data=gala) > gs <- summary(gfit) > gs$cov.unscaled
2.12.EXAMPLE 25 The names (command is the way to see the components of an Splus object-you can see that there are other useful quantities that are directly available: names(gs) names(gfit) In particular,the fitted (or predicted)values and residuals are gfitsfit gfitsres We can get B directly: >Xtxi号*号t(x)号*号y [,1] [1,]7.068221 [2,]-0.023938 [3,]0.319465 [4,]0.009144 [5,]-0.240524 [6,]-0.074805 or in a computationally efficient and stable manner: >solve(t(x)号*号x,t(x)号*号y) [,1] [1,]7.068221 [2,]-0.023938 [3,] 0.319465 [4,] 0.009144 [5,]-0.240524 [6,]-0.074805 We can estimate o using the estimator in the text: sqrt(sum(gfitsres2)/(30-6)) [1]60.975 Compare this to the results above. We may also obtain the standard errors for the coefficients.Also diag (returns the diagonal of a matrix)月 sqrt(diag(xtxi))*60.975 [1]19.1541390.0224220.0536631.0541330.2154020.017700 Finally we may compute R2: 1-sum(gfitsres"2)/sum((y-mean(y))2) [1]0.76585
2.12. EXAMPLE 25 The names() command is the way to see the components of an Splus object - you can see that there are other useful quantities that are directly available: > names(gs) > names(gfit) In particular, the fitted (or predicted) values and residuals are > gfit$fit > gfit$res We can get ˆβ directly: > xtxi %*% t(x) %*% y [,1] [1,] 7.068221 [2,] -0.023938 [3,] 0.319465 [4,] 0.009144 [5,] -0.240524 [6,] -0.074805 or in a computationally efficient and stable manner: > solve(t(x) %*% x, t(x) %*% y) [,1] [1,] 7.068221 [2,] -0.023938 [3,] 0.319465 [4,] 0.009144 [5,] -0.240524 [6,] -0.074805 We can estimate σ using the estimator in the text: > sqrt(sum(gfit$resˆ2)/(30-6)) [1] 60.975 Compare this to the results above. We may also obtain the standard errors for the coefficients. Also diag() returns the diagonal of a matrix): > sqrt(diag(xtxi))*60.975 [1] 19.154139 0.022422 0.053663 1.054133 0.215402 0.017700 Finally we may compute R 2 : > 1-sum(gfit$resˆ2)/sum((y-mean(y))ˆ2) [1] 0.76585