Chapter 3 Inference Up till now,we haven't found it necessary to assume any distributional form for the errors g.However,if we want to make any confidence intervals or perform any hypothesis tests,we will need to do this.The usual assumption is that the errors are normally distributed and in practice this is often,although not always,a reasonable assumption.We'll assume that the errors are independent and identically normally distributed with mean 0 and variance o2,i.e. eN0,62) We can handle non-identity variance matrices provided we know the form-see the section on gener- alized least squares later.Now since y=XB+8, y~NXB,σ2I) is a compact description of the regression model and from this we find that (using the fact that linear com- binations of normally distributed values are also normal) B=XTX)+XyNB,XX+σ2) 3.1 Hypothesis tests to compare models Given several predictors for a response,we might wonder whether all are needed.Consider a large model, and a smaller model,o,which consists of a subset of the predictors that are in By the principle of Occam's Razor(also known as the law of parsimony),we'd prefer to use o if the data will support it.So we'll take o to represent the null hypothesis and to represent the alternative.A geometric view of the problem may be seen in Figure 3.1. If RSS,RSSo is small,then is an adequate model relative to This suggests that something like RSS,RSSg RSSg would be a potentially good test statistic where the denominator is used for scaling purposes. As it happens the same test statistic arises from the likelihood-ratio testing approach.We give an outline of the development:If L B,oy)is likelihood function,then the likelihood ratio statistic is maxB.cE2L B,oy) maxB,cEL B,oly) 26
Chapter 3 Inference Up till now, we haven’t found it necessary to assume any distributional form for the errors ε. However, if we want to make any confidence intervals or perform any hypothesis tests, we will need to do this. The usual assumption is that the errors are normally distributed and in practice this is often, although not always, a reasonable assumption. We’ll assume that the errors are independent and identically normally distributed with mean 0 and variance σ 2 , i.e. ε N ✁ 0 σ 2 I ✂ We can handle non-identity variance matrices provided we know the form — see the section on generalized least squares later. Now since y ☎ Xβ ε, y N ✁ Xβ σ 2 I ✂ is a compact description of the regression model and from this we find that (using the fact that linear combinations of normally distributed values are also normal) ˆβ ☎ ✁ X TX ✂ 1X T y N ✁ β ✁ X TX ✂ 1σ 2 ✂ 3.1 Hypothesis tests to compare models Given several predictors for a response, we might wonder whether all are needed. Consider a large model, Ω, and a smaller model, ω, which consists of a subset of the predictors that are in Ω. By the principle of Occam’s Razor (also known as the law of parsimony), we’d prefer to use ω if the data will support it. So we’ll take ω to represent the null hypothesis and Ω to represent the alternative. A geometric view of the problem may be seen in Figure 3.1. If RSSω RSSΩ is small, then ω is an adequate model relative to Ω. This suggests that something like RSSω RSSΩ RSSΩ would be a potentially good test statistic where the denominator is used for scaling purposes. As it happens the same test statistic arises from the likelihood-ratio testing approach. We give an outline of the development: If L ✁ β σ ✁ y ✂ is likelihood function, then the likelihood ratio statistic is maxβ ✂ σ✄ Ω L ✁ β σ ✁ y ✂ maxβ ✂ σ✄ω L ✁ β σ ✁ y ✂ 26
3.1.HYPOTHESIS TESTS TO COMPARE MODELS 27 Residual for small model Residual for large model Difference between two models Large model space Small model space Figure 3.1:Geometric view of the comparison between big model,2,and small model,o.The squared length of the residual vector for the big model is RSSo while that for the small model is RSS.By Pythago- ras'theorem,the squared length of the vector connecting the two fits is RSSRSSo.A small value for this indicates that the small model fits almost as well as the large model and thus might be preferred due to its simplicity. The test should reject if this ratio is too large.Working through the details,we find that LB,66y,x6Ψ which gives us a test that rejects if a constant which is equivalent to RSS>a constant RSSo (constants are not the same)or RSSo1>a constant~1 RSSg which is RSS0~RSSo RSSg >a constant which is the same statistics suggested by the geometric view.It remains for us to discover the null distribu- tion of this statistic. Now suppose that the dimension (no.of parameters)of is g and dimension of is p.Now by Cochran's theorem.if the null ()is true then RSSa~RSSa 2.2 q~p RSSa ng and these two quantities are independent.So we find that F=RSS。RSaa≈卫+Fphe RSSa/八n~q
3.1. HYPOTHESIS TESTS TO COMPARE MODELS 27 Large model space Small model space Difference between two models Residual for small model Residual for large model Y Figure 3.1: Geometric view of the comparison between big model, Ω, and small model, ω. The squared length of the residual vector for the big model is RSSΩ while that for the small model is RSSω. By Pythagoras’ theorem, the squared length of the vector connecting the two fits is RSSω RSSΩ. A small value for this indicates that the small model fits almost as well as the large model and thus might be preferred due to its simplicity. The test should reject if this ratio is too large. Working through the details, we find that L ✁ ˆβ σˆ ✁ y ✂ ∝ σˆ n which gives us a test that rejects if σˆ 2 ω σˆ 2 Ω ✆ a constant which is equivalent to RSSω RSSΩ ✆ a constant (constants are not the same) or RSSω RSSΩ 1 ✆ a constant 1 which is RSSω RSSΩ RSSΩ ✆ a constant which is the same statistics suggested by the geometric view. It remains for us to discover the null distribution of this statistic. Now suppose that the dimension (no. of parameters) of Ω is q and dimension of ω is p. Now by Cochran’s theorem, if the null (ω) is true then RSSω RSSΩ q p σ 2 χ 2 q p RSSΩ n q σ 2 χ 2 n q and these two quantities are independent. So we find that F ☎ ✁ RSSω RSSΩ ✂✁ ✁ q p ✂ RSSΩ ✁ n q ✂ Fq p ✂ n q
3.2.SOME EXAMPLES 28 Thus we would reject the null hypothesis ifThe degrees of freedom of a model is (usually)the number of observations minus the number of parameters so this test statistic can be written F (RSS+RSSg)/(df +dfo) RSSg/dfo where dfo n+g and dfo n+p.The same test statistic applies not just when is a subset of but also to a subspace.This test is very widely used in regression and analysis of variance.When it is applied in different situlations,the form bf test statistic may be re-expressed in various different ways.The beauty of this approach is you only need to know the general form.In any particular case,you just need to figure out which models represents the null and alternative hypotheses,fit them and compute the test statistic.It is very versatile. 3.2 Some Examples 3.2.1 Test of all predictors Are any of the predictors useful in predicting the response? ·Full model(2):yXβ+e where X is a full-.rank n×p matrix. ●Reduced model(o:yu+e一predict y by the mean. We could write the null hypothesis in this case as H0:β1 Bp-10 Now 11川1 ·RSSa O+XB)TUy+XB)èTRSS RSS+)SYY,which is sometimes known as the sum of squares corrected for the mean. So in this case F (SYY+RSS)/(p+1) RSS/(n+p) We'd now refer to Fp-)for a critical value or a p-value.Large values of Fwould indicate rejection of the null.Traditionally,the information in the above test is presented in an analysis of variance table. Most computer packages produce a variant on this.See Table 3.1.It is not really necessary to specifically compute all the elements of the table.As the originator of the table,Fisher said in 1931,it is"nothing but a convenient way of arranging the arithmetic".Since he had to do his calculations by hand,the table served some purpose but it is less useful now. A failure to reject the null hypothesis is not the end of the game-you must still investigate the pos- sibility of non-linear transformations of the variables and of outliers which may obscure the relationship. Even then,you may just have insufficient data to demonstrate a real effect which is why we must be care- ful to say "fail to reject"the null rather than "accept"the null.It would be a mistake to conclude that no real relationship exists.This issue arises when a pharmaceutical company wishes to show that a proposed generic replacement for a brand-named drug is equivalent.It would not be enough in this instance just to fail to reject the null.A higher standard would be required
3.2. SOME EXAMPLES 28 Thus we would reject the null hypothesis if F ✆ F α ✁ q p ✂ n q The degrees of freedom of a model is (usually) the number of observations minus the number of parameters so this test statistic can be written F ☎ ✁ RSSω RSSΩ ✂✁ ✁ d fω d fΩ ✂ RSSΩ d fΩ where d fΩ ☎ n q and d fω ☎ n p. The same test statistic applies not just when ω is a subset of Ω but also to a subspace. This test is very widely used in regression and analysis of variance. When it is applied in different situations, the form of test statistic may be re-expressed in various different ways. The beauty of this approach is you only need to know the general form. In any particular case, you just need to figure out which models represents the null and alternative hypotheses, fit them and compute the test statistic. It is very versatile. 3.2 Some Examples 3.2.1 Test of all predictors Are any of the predictors useful in predicting the response? Full model (Ω) : y ☎ Xβ ε where X is a full-rank n p matrix. Reduced model (ω) : y ☎ µ ε — predict y by the mean. We could write the null hypothesis in this case as H0 : β1 ☎ ✁✂✁✂✁ βp 1 ☎ 0 Now RSSΩ ☎ ✁ y X ˆβ✂ T ✁ y X ˆβ✂ ☎ εˆ T εˆ ☎ RSS RSSω ☎ ✁ y ¯y ✂ T ✁ y ¯y ✂ ☎SYY, which is sometimes known as the sum of squares corrected for the mean. So in this case F ☎ ✁ SYY RSS✂✁ ✁ p 1 ✂ RSS ✁ n p ✂ We’d now refer to Fp 1 ✂ n p for a critical value or a p-value. Large values of F would indicate rejection of the null. Traditionally, the information in the above test is presented in an analysis of variance table. Most computer packages produce a variant on this. See Table 3.1. It is not really necessary to specifically compute all the elements of the table. As the originator of the table, Fisher said in 1931, it is “nothing but a convenient way of arranging the arithmetic”. Since he had to do his calculations by hand, the table served some purpose but it is less useful now. A failure to reject the null hypothesis is not the end of the game — you must still investigate the possibility of non-linear transformations of the variables and of outliers which may obscure the relationship. Even then, you may just have insufficient data to demonstrate a real effect which is why we must be careful to say “fail to reject” the null rather than “accept” the null. It would be a mistake to conclude that no real relationship exists. This issue arises when a pharmaceutical company wishes to show that a proposed generic replacement for a brand-named drug is equivalent. It would not be enough in this instance just to fail to reject the null. A higher standard would be required
3.2.SOME EXAMPLES 29 Source Deg.of Freedom Sum of Squares Mean Square F Regression p+1 SSreg SSres(p+1)F Residual n-p RSS RSS (n+p) Total n-1 SYY Table 3.1:Analysis of Variance table When the null is rejected,this does not imply that the alternative model is the best model.We don't know whether all the predictors are required to predict the response or just some of them.Other predictors might also be added-for example quadratic terms in the existing predictors.Either way,the overall F-test is just the beginning of an analysis and not the end. Let's illustrate this test and others using an old economic dataset on 50 different countries.These data are averages over 1960-1970(to remove business cycle or other short-term fluctuations).dpi is per-capita disposable income in U.S.dollars;ddpi is the percent rate of change in per capita disposable income;sr is aggregate personal saving divided by disposable income.The percentage population under 15(pop15) and over 75(pop75)are also recorded.The data come from Belsley,Kuh,and Welsch(1980).Take a look at the data: data(savings) savings sr pop15 pop75 dpi ddpi Australia 11.4329.352.872329.682.87 Austria 12.0723.324.411507.993.93 ---cases deleted -- Malaysia 4.7147.200.66 242.695.08 First consider a model with all the predictors: g<-1m(sr pop15 pop75 dpi ddpi,data=savings) summary (g) Coefficients: Estimate Std.Error t value Pr(>Itl) (Intercept)28.566087 7.354516 3.88 0.00033 pop15 -0.461193 0.144642 -3.19 0.00260 pop75 -1.691498 1.083599 -1.56 0.12553 dpi -0.000337 0.000931 -0.36 0.71917 ddpi 0.409695 0.196197 2.090.04247 Residual standard error:3.8 on 45 degrees of freedom Multiple R-Squared:0.338, Adjusted R-squared:0.28 F-statistic:5.76 on 4 and 45 degrees of freedom, p-va1ue:0.00079 We can see directly the result of the test of whether any of the predictors have significance in the model. In other words,whether Bi=B2 =B3=B4=0.Since the p-value is so small,this null hypothesis is rejected. We can also do it directly using the F-testing formula:
3.2. SOME EXAMPLES 29 Source Deg. of Freedom Sum of Squares Mean Square F Regression p 1 SSreg SSreg ✁ p 1 ✂ F Residual n-p RSS RSS ✁ n p ✂ Total n-1 SYY Table 3.1: Analysis of Variance table When the null is rejected, this does not imply that the alternative model is the best model. We don’t know whether all the predictors are required to predict the response or just some of them. Other predictors might also be added — for example quadratic terms in the existing predictors. Either way, the overall F-test is just the beginning of an analysis and not the end. Let’s illustrate this test and others using an old economic dataset on 50 different countries. These data are averages over 1960-1970 (to remove business cycle or other short-term fluctuations). dpi is per-capita disposable income in U.S. dollars; ddpi is the percent rate of change in per capita disposable income; sr is aggregate personal saving divided by disposable income. The percentage population under 15 (pop15) and over 75 (pop75) are also recorded. The data come from Belsley, Kuh, and Welsch (1980). Take a look at the data: > data(savings) > savings sr pop15 pop75 dpi ddpi Australia 11.43 29.35 2.87 2329.68 2.87 Austria 12.07 23.32 4.41 1507.99 3.93 --- cases deleted --- Malaysia 4.71 47.20 0.66 242.69 5.08 First consider a model with all the predictors: > g <- lm(sr ˜ pop15 + pop75 + dpi + ddpi, data=savings) > summary(g) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 28.566087 7.354516 3.88 0.00033 pop15 -0.461193 0.144642 -3.19 0.00260 pop75 -1.691498 1.083599 -1.56 0.12553 dpi -0.000337 0.000931 -0.36 0.71917 ddpi 0.409695 0.196197 2.09 0.04247 Residual standard error: 3.8 on 45 degrees of freedom Multiple R-Squared: 0.338, Adjusted R-squared: 0.28 F-statistic: 5.76 on 4 and 45 degrees of freedom, p-value: 0.00079 We can see directly the result of the test of whether any of the predictors have significance in the model. In other words, whether β1 ☎ β2 ☎ β3 ☎ β4 ☎ 0. Since the p-value is so small, this null hypothesis is rejected. We can also do it directly using the F-testing formula:
3.2.SOME EXAMPLES 30 sum((savingsssr-mean (savingsssr))2) [1]983.63 sum(gsres'2) [1]650.71 >((983.63-650.71)/4)/(650.706/45) [1]5.7558 >1-pf(5.7558,4,45) [1]0.00079026 Do you know where all the numbers come from?Check that they match the regression summary above. 3.2.2 Testing just one predictor Can one particular predictor be dropped from the model?The null hypothesis would be Ho:Bi=0.Set it up like this .RSSo is the RSS for the model with all the predictors of interest(p parameters). .RSS is the RSS for the model with all the above predictors except predictor i. The F-statistic may be computed using the formula from above.An alternative approach is to use a t-statistic for testing the hypothesis: 专=Bie(B) and check for significance using a t distribution with n+p degrees of freedom. However,squaring the t-statistic here,i.e.t2 gives you the F-statistic,so the two approaches are identical. For example,to test the null hypothesis that BI=0 i.e.that p15 is not significant in the full model,we can simply observe that the p-value is 0.0026 from the table and conclude that the null should be rejected. Let's do the same test using the general F-testing approach:We'll need the RSS and df for the full model -these are 650.71 and 45 respectively. and then fit the model that represents the null: g2 <1m(sr pop75 dpi ddpi,data=savings) and compute the RSS and the F-statistic: sum(g2Sres^2) [1]797.72 >(797.72-650.71)/(650.71/45) [1]10.167 The p-value is then >1-pf(10.167,1,45) [1]0.0026026 We can relate this to the t-based test and p-value by >sqrt(10.167) [1]3.1886 >2*(1-pt(3.1886,45)) [1]0.0026024
3.2. SOME EXAMPLES 30 > sum((savings$sr-mean(savings$sr))ˆ2) [1] 983.63 > sum(g$resˆ2) [1] 650.71 > ((983.63-650.71)/4)/(650.706/45) [1] 5.7558 > 1-pf(5.7558,4,45) [1] 0.00079026 Do you know where all the numbers come from? Check that they match the regression summary above. 3.2.2 Testing just one predictor Can one particular predictor be dropped from the model? The null hypothesis would be H0 : βi ☎ 0. Set it up like this RSSΩ is the RSS for the model with all the predictors of interest (p parameters). RSSω is the RSS for the model with all the above predictors except predictor i. The F-statistic may be computed using the formula from above. An alternative approach is to use a t-statistic for testing the hypothesis: ti ☎ ˆβi se ✁ ˆβi ✂ and check for significance using a t distribution with n p degrees of freedom. However, squaring the t-statistic here, i.e. t 2 i gives you the F-statistic, so the two approaches are identical. For example, to test the null hypothesis that β1 ☎ 0 i.e. that p15 is not significant in the full model, we can simply observe that the p-value is 0.0026 from the table and conclude that the null should be rejected. Let’s do the same test using the general F-testing approach: We’ll need the RSS and df for the full model — these are 650.71 and 45 respectively. and then fit the model that represents the null: > g2 <- lm(sr ˜ pop75 + dpi + ddpi, data=savings) and compute the RSS and the F-statistic: > sum(g2$resˆ2) [1] 797.72 > (797.72-650.71)/(650.71/45) [1] 10.167 The p-value is then > 1-pf(10.167,1,45) [1] 0.0026026 We can relate this to the t-based test and p-value by > sqrt(10.167) [1] 3.1886 > 2*(1-pt(3.1886,45)) [1] 0.0026024