3.2.SOME EXAMPLES 31 A somewhat more convenient way to compare two nested models is anova(g2,g) Analysis of Variance Table Model 1:sr pop75 dpi ddpi Model 2:sr pop15 pop75 dpi ddpi Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F) 1 46 798 2 45 651 1 147 10.20.0026 Understand that this test of pop15 is relative to the other predictors in the model,namely pop75,dpi and ddpi.If these other predictors were changed,the result of the test may be different.This means that it is not possible to look at the effect of pop5 in isolation.Simply stating the null hypothesis asHo:Bs=0 is insufficient-information about what other predictors are included in the null is necessary.The result of the test may be different if the predictors change. 3.2.3 Testing a pair of predictors Suppose we wish to test the significance of variables Xi and X.We might construct a table as shown just above and find that both variables have p-values greater than 0.05 thus indicating that individually neither is significant.Does this mean that both Yi and Xk can be eliminated from the model?Not necessarily Except in special circumstances,dropping one variable from a regression model causes the estimates of the other parameters to change so that we might find that after dropping Xi,that a test of the significance of Xk shows that it should now be included in the model. If you really want to check the joint significance of X;and Xk,you should fit a model with and then without them and use the general F-test discussed above.Remember that even the result of this test may depend on what other predictors are in the model. Can you see how to test the hypothesis that both pop75 and ddpi may be excluded from the model? y-x1+x2+x3 y~xl+x2 y~x1+x3 y Figure 3.2:Testing two predictors The testing choices are depicted in Figure 3.2.Here we are considering two predictors,x2 and x3 in the presence of x1.Five possible tests may be considered here and the results may not always be appar- ently consistent.The results of each test need to be considered individually in the context of the particular example
3.2. SOME EXAMPLES 31 A somewhat more convenient way to compare two nested models is > anova(g2,g) Analysis of Variance Table Model 1: sr ˜ pop75 + dpi + ddpi Model 2: sr ˜ pop15 + pop75 + dpi + ddpi Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F) 1 46 798 2 45 651 1 147 10.2 0.0026 Understand that this test of pop15 is relative to the other predictors in the model, namely pop75, dpi and ddpi. If these other predictors were changed, the result of the test may be different. This means that it is not possible to look at the effect of pop15 in isolation. Simply stating the null hypothesis as H0 : βpop15 ☎ 0 is insufficient — information about what other predictors are included in the null is necessary. The result of the test may be different if the predictors change. 3.2.3 Testing a pair of predictors Suppose we wish to test the significance of variables Xj and Xk. We might construct a table as shown just above and find that both variables have p-values greater than 0.05 thus indicating that individually neither is significant. Does this mean that both Xj and Xk can be eliminated from the model? Not necessarily Except in special circumstances, dropping one variable from a regression model causes the estimates of the other parameters to change so that we might find that after dropping Xj , that a test of the significance of Xk shows that it should now be included in the model. If you really want to check the joint significance of Xj and Xk, you should fit a model with and then without them and use the general F-test discussed above. Remember that even the result of this test may depend on what other predictors are in the model. Can you see how to test the hypothesis that both pop75 and ddpi may be excluded from the model? ✁ ✄✂ ☎✆ ✝✞ ✁ ✂✄ ✁ ✂✄ ✟ ✟ ✟✡✠ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☞ ✌ ✌ ✌ ✌✍ ✌ ✎ ✎ ✎ ✎✑✏ ✒ ✒ ✒✓ y ˜ x1 + x2 + x3 y ˜ x1 + x2 y ˜ x1 + x3 y ˜ x1 Figure 3.2: Testing two predictors The testing choices are depicted in Figure 3.2. Here we are considering two predictors, x2 and x3 in the presence of x1. Five possible tests may be considered here and the results may not always be apparently consistent. The results of each test need to be considered individually in the context of the particular example
3.2.SOME EXAMPLES 32 3.2.4 Testing a subspace Consider this example.Suppose that y is the miles-per-gallon for a make of car and Xi is the weight of the engine and X is the weight of the rest of the car.There would also be some other predictors.We might wonder whether we need two weight variables-perhaps they can be replaced by the total weight,XiX. So if the original model was then the reduced model is y( y=Bo e which requires that B;=B&for this reduction to be possible.So the hull hypothesis is Ho:B,=βk This defines a linear subspace to which the general F-testing procedure applies.In our example,we might hypothesize that the effect of young and old people on the savings rate was the same or in other words that H:βpopl5=阝pop75 In this case the null model would take the form Po Bpopl5 popl5 pop We can then compare this to the full model as follows. g <1m(sr ~.,savings) gr <lm(sr I(pop15+pop75)+dpi+ddpi,savings) anova(gr,g) Analysis of Variance Table Model 1:sr I(pop15 pop75)+dpi ddpi Model 2:sr pop15 pop75 dpi ddpi Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F) 46 674 2 45 6511 23 1.58 0.21 The period in the first model formula is short hand for all the other variables in the data frame.The function I()ensures that the argument is evaluated rather than interpreted as part of the model formula The p-value of 0.21 indicates that the null cannot be rejected here meaning that there is not evidence here that young and old people need to be treated separately in the context of this particular model. Suppose we want to test whether one of the coefficients can be set to a particular value.For example, Ho:βddpi=I Here the null model would take the form: y-Bo Bpopispop15 Bpopispop75 Bapidpi ddpi Notice that there is now no coefficient on the ddpi term.Such a fixed'term in the regression equation is called an offset.We fit this model and compare it to the full:
3.2. SOME EXAMPLES 32 3.2.4 Testing a subspace Consider this example. Suppose that y is the miles-per-gallon for a make of car and X j is the weight of the engine and Xk is the weight of the rest of the car. There would also be some other predictors. We might wonder whether we need two weight variables — perhaps they can be replaced by the total weight, X j Xk. So if the original model was y ☎ β0 ✁✂✁✂✁ βjXj βkXk ✁✂✁✂✁ ε then the reduced model is y ☎ β0 ✁✂✁✂✁ βl ✁ Xj Xk ✂✁ ✁✂✁✂✁ ε which requires that βj ☎ βk for this reduction to be possible. So the null hypothesis is H0 : βj ☎ βk This defines a linear subspace to which the general F-testing procedure applies. In our example, we might hypothesize that the effect of young and old people on the savings rate was the same or in other words that H0 : βpop15 ☎ βpop75 In this case the null model would take the form y ☎ β0 βpop15 ✁ pop15 pop75 ✂ βdpid pi βddpidd pi ε We can then compare this to the full model as follows: > g <- lm(sr ˜ .,savings) > gr <- lm(sr ˜ I(pop15+pop75)+dpi+ddpi,savings) > anova(gr,g) Analysis of Variance Table Model 1: sr ˜ I(pop15 + pop75) + dpi + ddpi Model 2: sr ˜ pop15 + pop75 + dpi + ddpi Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F) 1 46 674 2 45 651 1 23 1.58 0.21 The period in the first model formula is short hand for all the other variables in the data frame. The function I() ensures that the argument is evaluated rather than interpreted as part of the model formula. The p-value of 0.21 indicates that the null cannot be rejected here meaning that there is not evidence here that young and old people need to be treated separately in the context of this particular model. Suppose we want to test whether one of the coefficients can be set to a particular value. For example, H0 : βddpi ☎ 1 Here the null model would take the form: y ☎ β0 βpop15 pop15 βpop75 pop75 βdpid pi dd pi ε Notice that there is now no coefficient on the ddpi term. Such a fixed term in the regression equation is called an offset. We fit this model and compare it to the full:
3.3.CONCERNS ABOUT HYPOTHESIS TESTING 33 gr <-1m(sr pop15+pop75+dpi+offset (ddpi),savings) anova(gr,g) Analysis of Variance Table Model 1:sr pop15 pop75 dpi offset(ddpi) Model 2:sr pop15 pop75 dpi ddpi Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F) 1 46 782 2 45 6511 131 9.050.0043 We see that the p-value is small and the null hypothesis here is soundly rejected.A simpler way to test such point hypotheses is to use a t-statistic: t(c):seB) where c is the point hypothesis.So in our example the statistic and corresponding p-value is >tstat<-(0.409695-1)/0.196197 tstat [1]-3.0087 2*pt(tstat,45) [1]0.0042861 We can see the p-value is the same as before and if we square the t-statistic tstat2 [1]9.0525 we find we get the F-value.This latter approach is preferred in practice since we don't need to fit two models but it is important to understand that it is equivalent to the result obtained using the general F-testing approach. Can we test a hypothesis such as H0:βBk=1 using our general theory? No.This hypothesis is not linear in the parameters so we can't use our general method.We'd need to fit a non-linear model and that lies beyond the scope of this book. 3.3 Concerns about Hypothesis Testing 1.The general theory of hypothesis testing posits a population from which a sample is drawn-this is our data.We want to say something about the unknown population values B using estimated values B that are obtained from the sample data.Furthermore,we require that the data be generated using a simple random sample of the population.This sample is finite in size,while the population is infinite in size or at least so large that the sample size is a negligible proportion of the whole.For more complex sampling designs,other procedures should be applied,but of greater concern is the case when the data is not a random sample at all.There are two cases: (a)A sample of convenience is where the data is not collected according to a sampling design. In some cases,it may be reasonable to proceed as if the data were collected using a random mechanism.For example,suppose we take the first 400 people from the phonebook whose
3.3. CONCERNS ABOUT HYPOTHESIS TESTING 33 > gr <- lm(sr ˜ pop15+pop75+dpi+offset(ddpi),savings) > anova(gr,g) Analysis of Variance Table Model 1: sr ˜ pop15 + pop75 + dpi + offset(ddpi) Model 2: sr ˜ pop15 + pop75 + dpi + ddpi Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F) 1 46 782 2 45 651 1 131 9.05 0.0043 We see that the p-value is small and the null hypothesis here is soundly rejected. A simpler way to test such point hypotheses is to use a t-statistic: t ☎ ✁ ˆβ c ✂✁ se ✁ ˆβ ✂ where c is the point hypothesis. So in our example the statistic and corresponding p-value is > tstat <- (0.409695-1)/0.196197 > tstat [1] -3.0087 > 2*pt(tstat,45) [1] 0.0042861 We can see the p-value is the same as before and if we square the t-statistic > tstatˆ2 [1] 9.0525 we find we get the F-value. This latter approach is preferred in practice since we don’t need to fit two models but it is important to understand that it is equivalent to the result obtained using the general F-testing approach. Can we test a hypothesis such as H0 : βjβk ☎ 1 using our general theory? No. This hypothesis is not linear in the parameters so we can’t use our general method. We’d need to fit a non-linear model and that lies beyond the scope of this book. 3.3 Concerns about Hypothesis Testing 1. The general theory of hypothesis testing posits a population from which a sample is drawn — this is our data. We want to say something about the unknown population values β using estimated values ˆβ that are obtained from the sample data. Furthermore, we require that the data be generated using a simple random sample of the population. This sample is finite in size, while the population is infinite in size or at least so large that the sample size is a negligible proportion of the whole. For more complex sampling designs, other procedures should be applied, but of greater concern is the case when the data is not a random sample at all. There are two cases: (a) A sample of convenience is where the data is not collected according to a sampling design. In some cases, it may be reasonable to proceed as if the data were collected using a random mechanism. For example, suppose we take the first 400 people from the phonebook whose
3.3.CONCERNS ABOUT HYPOTHESIS TESTING 34 names begin with the letter P.Provided there is no ethnic effect,it may be reasonable to consider this a random sample from the population defined by the entries in the phonebook.Here we are assuming the selection mechanism is effectively random with respect to the objectives of the study.An assessment of exchangeability is required-are the data as good as random?Other situations are less clear cut and judgment will be required.Such judgments are easy targets for criticism.Suppose you are studying the behavior of alcoholics and advertise in the media for study subjects.It seems very likely that such a sample will be biased perhaps in unpredictable ways.In cases such as this,a sample of convenience is clearly biased in which case conclusions must be limited to the sample itself.This situation reduces to the next case,where the sample is the population. Sometimes,researchers may try to select a"representative"sample by hand.Quite apart from the obvious difficulties in doing this,the logic behind the statistical inference depends on the sample being random.This is not to say that such studies are worthless but that it would be unreasonable to apply anything more than descriptive statistical techniques.Confidence in the of conclusions from such data is necessarily suspect. (b)The sample is the complete population in which case one might argue that inference is not required since the population and sample values are one and the same.For both regression datasets we have considered so far,the sample is effectively the population or a large and biased proportion thereof. In these situations,we can put a different meaning to the hypothesis tests we are making.For the Galapagos dataset,we might suppose that if the number of species had no relation to the five geographic variables,then the observed response values would be randomly distributed between the islands without relation to the predictors.We might then ask what the chance would be under this assumption that an F-statistic would be observed as large or larger than one we actually observed.We could compute this exactly by computing the F-statistic for all possible (30!)permutations of the response variable and see what proportion exceed the observed F- statistic.This is a permutation test.If the observed proportion is small,then we must reject the contention that the response is unrelated to the predictors.Curiously,this proportion is estimated by the p-value calculated in the usual way based on the assumption of normal errors thus saving us from the massive task of actually computing the regression on all those computations Let see how we can apply the permutation test to the savings data.I chose a model with just pop75 and dpi so as to get a p-value for the F-statistic that is not too small. g <-1m(sr pop75+dpi,data=savings) summary(g) Coefficients: Estimate Std.Error t value Pr(>Itl) (Intercept) 7.056619 1.290435 5.471.7e-06 pop75 1.304965 0.777533 1.68 0.10 dpi -0.000341 0.001013 -0.34 0.74 Residual standard error:4.33 on 47 degrees of freedom Multiple R-Squared:0.102, Adjusted R-squared:0.0642 F-statistic:2.68 on 2 and 47 degrees of freedom,p-value:0.0791 We can extract the F-statistic as gs <-summary(g)
3.3. CONCERNS ABOUT HYPOTHESIS TESTING 34 names begin with the letter P. Provided there is no ethnic effect, it may be reasonable to consider this a random sample from the population defined by the entries in the phonebook. Here we are assuming the selection mechanism is effectively random with respect to the objectives of the study. An assessment of exchangeability is required - are the data as good as random? Other situations are less clear cut and judgment will be required. Such judgments are easy targets for criticism. Suppose you are studying the behavior of alcoholics and advertise in the media for study subjects. It seems very likely that such a sample will be biased perhaps in unpredictable ways. In cases such as this, a sample of convenience is clearly biased in which case conclusions must be limited to the sample itself. This situation reduces to the next case, where the sample is the population. Sometimes, researchers may try to select a “representative” sample by hand. Quite apart from the obvious difficulties in doing this, the logic behind the statistical inference depends on the sample being random. This is not to say that such studies are worthless but that it would be unreasonable to apply anything more than descriptive statistical techniques. Confidence in the of conclusions from such data is necessarily suspect. (b) The sample is the complete population in which case one might argue that inference is not required since the population and sample values are one and the same. For both regression datasets we have considered so far, the sample is effectively the population or a large and biased proportion thereof. In these situations, we can put a different meaning to the hypothesis tests we are making. For the Galapagos dataset, we might suppose that if the number of species had no relation to the five geographic variables, then the observed response values would be randomly distributed between the islands without relation to the predictors. We might then ask what the chance would be under this assumption that an F-statistic would be observed as large or larger than one we actually observed. We could compute this exactly by computing the F-statistic for all possible (30!) permutations of the response variable and see what proportion exceed the observed Fstatistic. This is a permutation test. If the observed proportion is small, then we must reject the contention that the response is unrelated to the predictors. Curiously, this proportion is estimated by the p-value calculated in the usual way based on the assumption of normal errors thus saving us from the massive task of actually computing the regression on all those computations. Let see how we can apply the permutation test to the savings data. I chose a model with just pop75 and dpi so as to get a p-value for the F-statistic that is not too small. > g <- lm(sr ˜ pop75+dpi,data=savings) > summary(g) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.056619 1.290435 5.47 1.7e-06 pop75 1.304965 0.777533 1.68 0.10 dpi -0.000341 0.001013 -0.34 0.74 Residual standard error: 4.33 on 47 degrees of freedom Multiple R-Squared: 0.102, Adjusted R-squared: 0.0642 F-statistic: 2.68 on 2 and 47 degrees of freedom, p-value: 0.0791 We can extract the F-statistic as > gs <- summary(g)
3.3.CONCERNS ABOUT HYPOTHESIS TESTING 35 gssfstat value numdf dendf 2.67962.000047.0000 The function sample()generates random permutations.We compute the F-statistic for 1000 randomly selected permutations and see what proportion exceed the the F-statistic for the origi- nal data: fstats <-numeric(1000) >for(iin1:1000)( ge <-1m(sample(sr)~pop75+dpi,data=savings) fstats[i]<-summary(ge)$fstat[1] +} length(fstats[fstats 2.6796])/1000 [1]0.092 So our estimated p-value using the permutation test is 0.092 which is close to the normal theory based value of 0.0791.We could reduce variability in the estimation of the p-value simply by computing more random permutations.Since the permutation test does not depend on the assumption of normality,we might regard it as superior to the normal theory based value. Thus it is possible to give some meaning to the p-value when the sample is the population or for samples of convenience although one has to be clear that one's conclusion apply only the particular sample. Tests involving just one predictor also fall within the permutation test framework.We permute that predictor rather than the response Another approach that gives meaning to the p-value when the sample is the population involves the imaginative concept of"alternative worlds"where the sample/population at hand is sup- posed to have been randomly selected from parallel universes.This argument is definitely more tenuous. 2.A model is usually only an approximation of underlying reality which makes the meaning of the pa- rameters debatable at the very least.We will say more on the interpretation of parameter estimates later but the precision of the statement that BI=0 exactly is at odds with the acknowledged approx- imate nature of the model.Furthermore,it is highly unlikely that a predictor that one has taken the trouble to measure and analyze has exactly zero effect on the response.It may be small but it won't be zero. This means that in many cases,we know that the point null hypothesis is false without even looking at the data.Furthermore,we know that the more data we have,the greater the power of our tests. Even small differences from zero will be detected with a large sample.Now if we fail to reject the null hypothesis,we might simply conclude that we didn't have enough data to get a significant result. According to this view,the hypothesis test just becomes a test of sample size.For this reason,I prefer confidence intervals. 3.The inference depends on the correctness of the model we use.We can partially check the assumptions about the model but there will always be some element of doubt.Sometimes the data may suggest more than one possible model which may lead to contradictory results. 4.Statistical significance is not equivalent to practical significance.The larger the sample,the smaller your p-values will be so don't confuse p-values with a big predictor effect.With large datasets it will
3.3. CONCERNS ABOUT HYPOTHESIS TESTING 35 > gs$fstat value numdf dendf 2.6796 2.0000 47.0000 The function sample() generates random permutations. We compute the F-statistic for 1000 randomly selected permutations and see what proportion exceed the the F-statistic for the original data: > fstats <- numeric(1000) > for(i in 1:1000){ + ge <- lm(sample(sr) ˜ pop75+dpi,data=savings) + fstats[i] <- summary(ge)$fstat[1] + } > length(fstats[fstats > 2.6796])/1000 [1] 0.092 So our estimated p-value using the permutation test is 0.092 which is close to the normal theory based value of 0.0791. We could reduce variability in the estimation of the p-value simply by computing more random permutations. Since the permutation test does not depend on the assumption of normality, we might regard it as superior to the normal theory based value. Thus it is possible to give some meaning to the p-value when the sample is the population or for samples of convenience although one has to be clear that one’s conclusion apply only the particular sample. Tests involving just one predictor also fall within the permutation test framework. We permute that predictor rather than the response Another approach that gives meaning to the p-value when the sample is the population involves the imaginative concept of “alternative worlds” where the sample/population at hand is supposed to have been randomly selected from parallel universes. This argument is definitely more tenuous. 2. A model is usually only an approximation of underlying reality which makes the meaning of the parameters debatable at the very least. We will say more on the interpretation of parameter estimates later but the precision of the statement that β1 ☎ 0 exactly is at odds with the acknowledged approximate nature of the model. Furthermore, it is highly unlikely that a predictor that one has taken the trouble to measure and analyze has exactly zero effect on the response. It may be small but it won’t be zero. This means that in many cases, we know that the point null hypothesis is false without even looking at the data. Furthermore, we know that the more data we have, the greater the power of our tests. Even small differences from zero will be detected with a large sample. Now if we fail to reject the null hypothesis, we might simply conclude that we didn’t have enough data to get a significant result. According to this view, the hypothesis test just becomes a test of sample size. For this reason, I prefer confidence intervals. 3. The inference depends on the correctness of the model we use. We can partially check the assumptions about the model but there will always be some element of doubt. Sometimes the data may suggest more than one possible model which may lead to contradictory results. 4. Statistical significance is not equivalent to practical significance. The larger the sample, the smaller your p-values will be so don’t confuse p-values with a big predictor effect. With large datasets it will