3.4.CONFIDENCE INTERVALS FOR B 36 be very easy to get statistically significant results,but the actual effects may be unimportant.Would we really care if test scores were 0.1%higher in one state than another?Or that some medication reduced pain by 2%?Confidence intervals on the parameter estimates are a better way of assessing the size of an effect.There are useful even when the null hypothesis is not rejected because they tell us how confident we are that the true effect or value is close to the null. Even so,hypothesis tests do have some value,not least because they impose a check on unreasonable conclusions which the data simply does not support. 3.4 Confidence Intervals for B Confidence intervals provide an alternative way of expressing the uncertainty in our estimates.Even so,they are closely linked to the tests that we have already constructed.For the confidence intervals and regions that we will consider here,the following relationship holds.For a 100(1-a)%confidence region,any point that lies within the region represents a null hypothesis that would not be rejected at the 1000%level while every point outside represents a null hypothesis that would be rejected.So,in a sense,the confidence region provides a lot more information than a single hypothesis test in that it tells us the outcome of a whole range of hypotheses about the parameter values.Of course,by selecting the particular level of confidence for the region,we can only make tests at that level and we cannot determine the p-value for any given test simply from the region.However,since it is dangerous to read too much into the relative size of p-values (as far as how much evidence they provide against the null),this loss is not particularly important. The confidence region tells us about plausible values for the parameters in a way that the hypothesis test cannot.This makes it more valuable. As with testing,we must decide whether to form confidence regions for parameters individually or simultaneously.Simultaneous regions are preferable but for more than two dimensions they are difficult to display and so there is still some value in computing the one-dimensional confidence intervals. We start with the simultaneous regions.Some results from multivariate analysis show that (B-B)xX(B-B)x σ2 and (n-p)62 02 XH-P and these two quantities are independent.Hence B-B)IXTX(B-B)。X/P P62 Xh-p/(n-p) 三Fp,m-p So to form a 100(l-)%confidence region forβ,takeβsuch that B-B)'xTxB-B)≤pG2r0p These regions are ellipsoidally shaped.Because these ellipsoids live in higher dimensions,they cannot easily be visualized. Alternatively,one could consider each parameter individually which leads to confidence intervals which take the general form of estimate+critical value x s.e.of estimate
3.4. CONFIDENCE INTERVALS FOR β 36 be very easy to get statistically significant results, but the actual effects may be unimportant. Would we really care if test scores were 0.1% higher in one state than another? Or that some medication reduced pain by 2%? Confidence intervals on the parameter estimates are a better way of assessing the size of an effect. There are useful even when the null hypothesis is not rejected because they tell us how confident we are that the true effect or value is close to the null. Even so, hypothesis tests do have some value, not least because they impose a check on unreasonable conclusions which the data simply does not support. 3.4 Confidence Intervals for β Confidence intervals provide an alternative way of expressing the uncertainty in our estimates. Even so, they are closely linked to the tests that we have already constructed. For the confidence intervals and regions that we will consider here, the following relationship holds. For a 100 ✁ 1 α ✂ % confidence region, any point that lies within the region represents a null hypothesis that would not be rejected at the 100α% level while every point outside represents a null hypothesis that would be rejected. So, in a sense, the confidence region provides a lot more information than a single hypothesis test in that it tells us the outcome of a whole range of hypotheses about the parameter values. Of course, by selecting the particular level of confidence for the region, we can only make tests at that level and we cannot determine the p-value for any given test simply from the region. However, since it is dangerous to read too much into the relative size of p-values (as far as how much evidence they provide against the null), this loss is not particularly important. The confidence region tells us about plausible values for the parameters in a way that the hypothesis test cannot. This makes it more valuable. As with testing, we must decide whether to form confidence regions for parameters individually or simultaneously. Simultaneous regions are preferable but for more than two dimensions they are difficult to display and so there is still some value in computing the one-dimensional confidence intervals. We start with the simultaneous regions. Some results from multivariate analysis show that ✁ ˆβ β ✂ TX TX ✁ ˆβ β ✂ σ2 χ 2 p and ✁ n p ✂ σˆ 2 σ2 χ 2 n p and these two quantities are independent. Hence ✁ ˆβ β ✂ TX TX ✁ ˆβ β ✂ pσˆ 2 χ 2 p p χ 2 n p ✁ n p ✂✁ Fp ✂ n p So to form a 100 ✁ 1 α ✂ % confidence region for β, take β such that ✁ ˆβ β ✂ TX TX ✁ ˆβ β ✂ pσˆ 2F α ✁ p ✂ n p These regions are ellipsoidally shaped. Because these ellipsoids live in higher dimensions, they cannot easily be visualized. Alternatively, one could consider each parameter individually which leads to confidence intervals which take the general form of estimate ✂ critical value s ✁ e ✁ of estimate
3.4.CONFIDENCE INTERVALS FOR B 37 or specifically in this case: It's better to consider the joint confidence intervals when possible,especially when the B are heavily correlated. Consider the full model for the savings data.The.in the model formula stands for"every other variable in the data frame"which is a useful abbreviation. >g<-1m(sr~ .savings) summary (g) Coefficients: Estimate Std.Error t value Pr(>Itl) (Intercept)28.566087 7.354516 3.88 0.00033 pop15 -0.461193 0.144642 -3.19 0.00260 pop75 -1.691498 1.083599 -1.56 0.12553 dpi -0.000337 0.000931 -0.36 0.71917 ddpi 0.409695 0.196197 2.090.04247 Residual standard error:3.8 on 45 degrees of freedom Multiple R-Squared:0.338, Adjusted R-squared:0.28 F-statistic:5.76 on 4 and 45 degrees of freedom, p-va1ue:0.00079 We can construct individual 95%confidence intervals for the regression parameters of pop75: >gt(0.975,45) [1]2.0141 >c(-1.69-2.01*1.08,-1.69+2.01*1.08) [1]-3.86080.4808 and similarly for growth >c(0.41-2.01*0.196,0.41+2.01*0.196) [1]0.016040.80396 Notice that this confidence interval is pretty wide in the sense that the upper limit is about 50 times larger than the lower limit.This means that we are not really that confident about what the exact effect of growth on savings really is. Confidence intervals often have a duality with two-sided hypothesis tests.A 95%confidence interval contains all the null hypotheses that would not be rejected at the 5%level.Thus the interval for pop75 contains zero which indicates that the null hypothesis Ho:B=0would not be rejected at the 5%level. We can see from the output above that the p-value is 12.5%-greater than 5%-confirming this point.In contrast,we see that the interval for ddpi does not contain zero and so the null hypothesis is rejected for its regression parameter. Now we construct the joint 95%confidence region for these parameters.First we load in a"library"for drawing confidence ellipses which is not part of base R: library(ellipse) and now the plot:
3.4. CONFIDENCE INTERVALS FOR β 37 or specifically in this case: ˆβi ✂ t α 2 ✁ n p σˆ ✁ XTX ✂✁1 ii It’s better to consider the joint confidence intervals when possible, especially when the ˆβ are heavily correlated. Consider the full model for the savings data. The . in the model formula stands for “every other variable in the data frame” which is a useful abbreviation. > g <- lm(sr ˜ ., savings) > summary(g) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 28.566087 7.354516 3.88 0.00033 pop15 -0.461193 0.144642 -3.19 0.00260 pop75 -1.691498 1.083599 -1.56 0.12553 dpi -0.000337 0.000931 -0.36 0.71917 ddpi 0.409695 0.196197 2.09 0.04247 Residual standard error: 3.8 on 45 degrees of freedom Multiple R-Squared: 0.338, Adjusted R-squared: 0.28 F-statistic: 5.76 on 4 and 45 degrees of freedom, p-value: 0.00079 We can construct individual 95% confidence intervals for the regression parameters of pop75: > qt(0.975,45) [1] 2.0141 > c(-1.69-2.01*1.08,-1.69+2.01*1.08) [1] -3.8608 0.4808 and similarly for growth > c(0.41-2.01*0.196,0.41+2.01*0.196) [1] 0.01604 0.80396 Notice that this confidence interval is pretty wide in the sense that the upper limit is about 50 times larger than the lower limit. This means that we are not really that confident about what the exact effect of growth on savings really is. Confidence intervals often have a duality with two-sided hypothesis tests. A 95% confidence interval contains all the null hypotheses that would not be rejected at the 5% level. Thus the interval for pop75 contains zero which indicates that the null hypothesis H0 : βpop75 ☎ 0 would not be rejected at the 5% level. We can see from the output above that the p-value is 12.5% — greater than 5% — confirming this point. In contrast, we see that the interval for ddpi does not contain zero and so the null hypothesis is rejected for its regression parameter. Now we construct the joint 95% confidence region for these parameters. First we load in a ”library” for drawing confidence ellipses which is not part of base R: > library(ellipse) and now the plot:
3.4.CONFIDENCE INTERVALS FOR B 38 plot (ellipse(g,c(2,3)),type="1",xlim=c(-1,0)) add the origin and the point of the estimates: points(0,0) points(g$coef[2],gscoef[3],pch=18) How does the position of the origin relate to a test for removing pop75 and pop15? Now we mark the one way confidence intervals on the plot for reference: >ab1ine(v=c(-0.461-2.01*0.145,-0.461+2.01*0.145),1ty=2) >ab1ine(h=c(-1.69-2.01*1.08,-1.69+2.01*1.08),1ty=2) See the plot in Figure 3.3. 。 寸 -1.0-0.8-0.6-0.4-0.20.0 pop15 Figure 3.3:Confidence ellipse and regions for Bpop7s and Bpopis Why are these lines not tangential to the ellipse?The reason for this is that the confidence intervals are calculated individually.If we wanted a 95%chance that both intervals contain their true values,then the lines would be tangential. In some circumstances,the origin could lie within both one-way confidence intervals,but lie outside the ellipse.In this case,both one-at-a-time tests would not reject the null whereas the joint test would.The latter test would be preferred.It's also possible for the origin to lie outside the rectangle but inside the ellipse.In this case,the joint test would not reject the null whereas both one-at-a-time tests would reject.Again we prefer the joint test result. Examine the correlation of the two predictors: cor(savingsSpop15,savingsspop75) [1]-0.90848 But from the plot,we see that coefficients have a positive correlation.The correlation between predictors and the correlation between the coefficients of those predictors are often different in sign.Intuitively,this
3.4. CONFIDENCE INTERVALS FOR β 38 > plot(ellipse(g,c(2,3)),type="l",xlim=c(-1,0)) add the origin and the point of the estimates: > points(0,0) > points(g$coef[2],g$coef[3],pch=18) How does the position of the origin relate to a test for removing pop75 and pop15? Now we mark the one way confidence intervals on the plot for reference: > abline(v=c(-0.461-2.01*0.145,-0.461+2.01*0.145),lty=2) > abline(h=c(-1.69-2.01*1.08,-1.69+2.01*1.08),lty=2) See the plot in Figure 3.3. −1.0 −0.8 −0.6 −0.4 −0.2 0.0 −4 −3 −2 −1 0 1 pop15 pop75 Figure 3.3: Confidence ellipse and regions for βpop75 and βpop15 Why are these lines not tangential to the ellipse? The reason for this is that the confidence intervals are calculated individually. If we wanted a 95% chance that both intervals contain their true values, then the lines would be tangential. In some circumstances, the origin could lie within both one-way confidence intervals, but lie outside the ellipse. In this case, both one-at-a-time tests would not reject the null whereas the joint test would. The latter test would be preferred. It’s also possible for the origin to lie outside the rectangle but inside the ellipse. In this case, the joint test would not reject the null whereas both one-at-a-time tests would reject. Again we prefer the joint test result. Examine the correlation of the two predictors: > cor(savings$pop15,savings$pop75) [1] -0.90848 But from the plot, we see that coefficients have a positive correlation. The correlation between predictors and the correlation between the coefficients of those predictors are often different in sign. Intuitively, this
3.5.CONFIDENCE INTERVALS FOR PREDICTIONS 39 can be explained by realizing that two negatively correlated predictors are attempting to the perform the same job.The more work one does,the less the other can do and hence the positive correlation in the coefficients 3.5 Confidence intervals for predictions Given a new set of predictors,xo what is the predicted response?Easy-just o=xB.However,we need to distinguish between predictions of the future mean response and predictions of future observations.To make the distinction,suppose we have built a regression model that predicts the selling price of homes in a given area that is based on predictors like the number of bedrooms,closeness to a major highway etc.There are two kinds of predictions that can be made for a given xo. 1.Suppose a new house comes on the market with characteristics Its selling price will be Since Es=0,the predicted price isxB but in assessing the variance of this prediction,we must include the variance of e. 2.Suppose we ask the question-"What would the house with characteristics xo"sell for on average. This selling price is xB and is again predicted by xB but now only the variance in B needs to be taken into account. Most times,we will want the first case which is called "prediction of a future value"while the second case, called "prediction of the mean response"is less common. Now var (xoB)=x(+xoo2 A future observation is predicted to be xB(where we don't what the future will turn out to be). So a 100(1-)%confidence interval for a single future response is 士26V1y6w0p0 If on the other hand,you want a confidence interval for the average of the responses for given xo then use o±26VxW0+0 We return to the Galapagos data for this example. g <lm(Species Area+Elevation+Nearest+Scruz+Adjacent,data=gala) Suppose we want to predict the number of species(oftortoise)on an island with predictors 0.08,93,6.0,12.0,0.34 (same order as in the dataset).Of course it is difficult to see why in practice we would want to do this be- cause a new island is unlikely to present itself.For a dataset like this interest would center on the structure of the model and relative importance of the predictors,so we should regard this more as a"what if?"exercise. Do it first directly from the formula: >x0<-c(1,0.08,93,6.0,12.0,0.34) y0 <-sum(x0*gscoef) >y0 [1]33.92
3.5. CONFIDENCE INTERVALS FOR PREDICTIONS 39 can be explained by realizing that two negatively correlated predictors are attempting to the perform the same job. The more work one does, the less the other can do and hence the positive correlation in the coefficients. 3.5 Confidence intervals for predictions Given a new set of predictors, x0 what is the predicted response? Easy — just yˆ0 ☎ x T 0 ˆβ. However, we need to distinguish between predictions of the future mean response and predictions of future observations. To make the distinction, suppose we have built a regression model that predicts the selling price of homes in a given area that is based on predictors like the number of bedrooms, closeness to a major highway etc. There are two kinds of predictions that can be made for a given x0. 1. Suppose a new house comes on the market with characteristics x0. Its selling price will be x T 0 β ε. Since Eε ☎ 0, the predicted price is x T 0 ˆβ but in assessing the variance of this prediction, we must include the variance of ε. 2. Suppose we ask the question — “What would the house with characteristics x0” sell for on average. This selling price is x T 0 β and is again predicted by x T 0 ˆβ but now only the variance in ˆβ needs to be taken into account. Most times, we will want the first case which is called “prediction of a future value” while the second case, called “prediction of the mean response” is less common. Now var ✁ x T 0 ˆβ ✂ ☎ x T 0 ✁ X TX ✂ 1 x0σ 2 . A future observation is predicted to be x T 0 ˆβ ε (where we don’t what the future ε will turn out to be). So a 100 ✁ 1 α ✂ % confidence interval for a single future response is yˆ0 ✂ t α 2 ✁ n p σˆ 1 x T 0 ✁ XTX ✂ 1x0 If on the other hand, you want a confidence interval for the average of the responses for given x0 then use yˆ0 ✂ t α 2 ✁ n p σˆ x T 0 ✁ XTX ✂ 1x0 We return to the Galapagos data for this example. > g <- lm(Species ˜ Area+Elevation+Nearest+Scruz+Adjacent,data=gala) Suppose we want to predict the number of species (of tortoise) on an island with predictors 0.08,93,6.0,12.0,0.34 (same order as in the dataset). Of course it is difficult to see why in practice we would want to do this because a new island is unlikely to present itself. For a dataset like this interest would center on the structure of the model and relative importance of the predictors, so we should regard this more as a ”what if?” exercise. Do it first directly from the formula: > x0 <- c(1,0.08,93,6.0,12.0,0.34) > y0 <- sum(x0*g$coef) > y0 [1] 33.92
3.5.CONFIDENCE INTERVALS FOR PREDICTIONS 40 This is the predicted no.of species which is not a whole number as the response is.We could round up to34. Now if we want a 95%confidence interval for the prediction,we must decide whether we are predicting the number of species on one new island or the mean response for all islands with same predictors x0. Possibly,an island might not have been surveyed for the original dataset in which case the former interval would be the one we want.For this dataset,the latter interval would be more valuable for"what if?"type calculations. First we need the t-critical value: >qt(0.975,24) [1]2.0639 You may need to recalculate the matrix: x <cbind(1,gala[,3:7]) x <as.matrix(x) >xtxi<-solve(t(x)号*号x) The width of the bands for mean response CI is >bm<-sqrt(x0号*号xtxi号*号x0)*2.064*60.98 >bm [,1] [1,]32.89 and the interval is c(y0-bm,y0+bm) [1]1.029666.8097 Now we compute the prediction interval for the single future response. >bm<-sqrt(1+x0号*号xtxi号*号x0)*2.064*60.98 >c(y0-bm,y0+bm) [1]-96.17164.01 What physically unreasonable feature do you notice about it?In such instances,impossible values in the confidence interval can be avoided by transforming the response,say taking logs,(explained in a later chapter)or by using a probability model more appropriate to the response.The normal distribution is supported on the whole real line and so negative values are always possible.A better choice for this example might be the Poisson distribution which is supported on the non-negative integers. There is a more direct method for computing the CI.The function predict ()requires that its second argument be a data frame with variables named in the same way as the original dataset: predict (g,data.frame(Area=0.08,Elevation=93,Nearest=6.0,Scruz=12, Adjacent=0.34),se=T) sfit: 33.92
3.5. CONFIDENCE INTERVALS FOR PREDICTIONS 40 This is the predicted no. of species which is not a whole number as the response is. We could round up to 34. Now if we want a 95% confidence interval for the prediction, we must decide whether we are predicting the number of species on one new island or the mean response for all islands with same predictors x0. Possibly, an island might not have been surveyed for the original dataset in which case the former interval would be the one we want. For this dataset, the latter interval would be more valuable for “what if?” type calculations. First we need the t-critical value: > qt(0.975,24) [1] 2.0639 You may need to recalculate the ✁ X TX ✂ 1 matrix: > x <- cbind(1,gala[,3:7]) > x <- as.matrix(x) > xtxi <- solve(t(x) %*% x) The width of the bands for mean response CI is > bm <- sqrt(x0 %*% xtxi %*% x0) *2.064 * 60.98 > bm [,1] [1,] 32.89 and the interval is > c(y0-bm,y0+bm) [1] 1.0296 66.8097 Now we compute the prediction interval for the single future response. > bm <- sqrt(1+x0 %*% xtxi %*% x0) *2.064 * 60.98 > c(y0-bm,y0+bm) [1] -96.17 164.01 What physically unreasonable feature do you notice about it? In such instances, impossible values in the confidence interval can be avoided by transforming the response, say taking logs, (explained in a later chapter) or by using a probability model more appropriate to the response. The normal distribution is supported on the whole real line and so negative values are always possible. A better choice for this example might be the Poisson distribution which is supported on the non-negative integers. There is a more direct method for computing the CI. The function predict() requires that its second argument be a data frame with variables named in the same way as the original dataset: > predict(g,data.frame(Area=0.08,Elevation=93,Nearest=6.0,Scruz=12, Adjacent=0.34),se=T) $fit: 33.92