Implementing Statistical Criteria to Select Return Forecasting Models:What Do We Learn? TOR Peter Bossaerts;Pierre Hillion The Review of Financial Studies,Volume 12,Issue 2 (Summer,1999),405-428. Stable URL: hup://links.jstor.org/sici?sici=0893-9454%28199922%2912%3A2%3C405%3AISCTSR%3E2.0.CO%3B2-K Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use,available at http://www.jstor.org/about/terms.html.JSTOR's Terms and Conditions of Use provides,in part,that unless you have obtained prior permission,you may not download an entire issue of a journal or multiple copies of articles,and you may use content in the JSTOR archive only for your personal,non-commercial use. Each copy of any part of a STOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. The Review of Financial Studies is published by Oxford University Press.Please contact the publisher for further permissions regarding the use of this work.Publisher contact information may be obtained at http://www.jstor.org/journals/oup.html. The Review of Financial Studies 1999 Oxford University Press JSTOR and the JSTOR logo are trademarks of JSTOR,and are Registered in the U.S.Patent and Trademark Office. For more information on JSTOR contact jstor-info@umich.edu. ©2003 JSTOR http://www.jstor.org/ Mon Feb1718:17:092003
Implementing Statistical Criteria to Select Return Forecasting Models: What Do We Learn? Peter Bossaerts California Institute of Technology Pierre Hillion INSEAD Statistical model selection criteria provide an informed choice of the model with best external(i.e.,out-of-sample)validity.Therefore they guard against overfitting ("data snooping").We implement several model selection criteria in order to verify recent evidence of predictability in excess stock returns and to determine which variables are valuable predictors.We confirm the presence of in-sample predictability in an international stock market dataset,but discover that even the best prediction models have no out-of-sample forecasting power.The failure to detect out-of-sample predictability is not due to lack of power. 1.Introduction Almost all validation of financial theory is based on historical datasets. Take,for instance,the theory of efficient markets.Loosely speaking,it asserts that securities returns must not be predictable from past information. Numerous studies have attempted to verify this theory,and ample evidence of predictability has been uncovered.This has led many to question the validity of the theory. Quite reasonably,some have recently questioned the conclusiveness of such findings,pointing to the fact that they are based on repeated reeval- uation of the same dataset,or,if not the same,at least datasets that cover similar time periods.For instance,Lo and MacKinlay (1990)argue that Address corresponcence to Peter Bossaerts.HSS 228-77.California Institute of Technology,Pasadena, CA 91125.or e-mail:pbs@rioja.caltech.edu.P.Bossaerts thanks First Quadrant for financial support through a grant to the Califoria Institute of Technology.First Quadrant also provided the data that were used in this study.The article was revised in part when the first author was at the Center for Economic Research.Tilburg University.P.Hillion thanks the Hong Kong University of Science and Technology for their hospitality while doing part of the research.Comments from Michel Dacorogna,Rob Engle,Joel Hasbrouck.Andy Lo,P.C.B.Phillips,Richard Roll,Mark Taylor,and Ken West.from two anonymous referees.and the editor(Ravi Jagannathan),as well as seminar participants at the Hong Kong University of Science and Technology.University of California San Diego.University of Califomia Santa Barbara the 1994 NBER Spring Conference on Asset Pricing,the 1994 Western Finance Association Meetings. and the 1995 CEPR/LIFE Conference on International Finance are gratefully acknowledged. The Review of Financial Stdies Summer 1999 Vol.12,No.2,pp.405-428 e 1999 The Society for Financial Studies 0893-9454/99/$1.50
The Review of Financial Smdies /v 12 n 2 1999 the "size effect"in tests of the capital asset pricing model (CAPM)may very well be the result of an unconscious,exhaustive search for a portfolio formation criterion with the aim of rejecting the theory. Repeated visits of the same dataset indeed lead to a problem that statis- ticians refer to as model overfitting [Lo and MacKinlay (1990)called it "data snooping"],that is,the tendency to discover spurious relationships when applying tests that are inspired by evidence from prior visits to the same dataset.There are several ways to address model overfitting.The finance literature has emphasized two approaches.First,one can attempt to collect new data,covering different time periods and/or markets [e.g., Solnik (1993)].Second,standard test sizes can be adjusted for overfitting tendencies.These adjustments are either based on theoretical approxima- tions such as Bonferroni bounds [Foster,Smith,and Whaley (1997)],or on bootstrapping stationary time series [Sullivan,Timmermann,and White (1997)1. The two routes that the finance literature has taken to deal with model overfitting,however,do present some limitations.New,independent data are available only to a certain extent.And adjustment of standard test sizes merely help in correctly rejecting the simple null hypothesis of no rela- tionship.It will provide little information,however,when,in addition,the empiricist is asked to discriminate between competing models under the alternative of the existence of some relationship. In contrast,the statistics literature has long promoted model selection criteria to guard against overfitting.Of these,Akaike's criterion [Akaike (1974)]is probably the best known.There are many others,however,in- spired by different criteria about what constitutes an optimal model (one distinguishes Bayesian and information-theoretic criteria),and with varying degrees of robustness to unit-root nonstationarities in the data. The purpose of this article is to implement several selection criteria from the statistics literature (including our own,meant to correct some well- known small-sample biases in one of these criteria),based on popularity and on robustness to unit roots in the independent variables.The aim is to verify whether stock index returns in excess of the riskfree rate are indeed predictable,as many have recently concluded [e.g.,Fama (1991),Keim and Stambaugh (1986),Campbell (1987),Breen,Glosten,and Jagannathan (1990),Brock,Lakonishok,and LeBaron (1992),Sullivan,Timmermann, and White (1997)]. Our insistence on model selection criteria that are robust to unit-root nonstationarities is motivated by the time-series properties of some candi- date predictors,such as price-earnings ratios,dividend yields,lagged index levels,or even short-term interest rates.These variables are either mani- Foster,Smith,and Whaley (1997)also present simulation-based adjustments. 406
Implementing Statistical Criteria to Select Return Forecasting Models festly nonstationary,or,if not,their behavior is close enough to unit-root nonstationary for small-sample statistics to be affected. We study an international sample of excess stock returns and candidate predictors which First Quadrant was kind enough to release to us.The time period nests that of another international study,Solnik(1993).Therefore, we also provide diagnostic tests that compare the two datasets (which are based on different sources). We discover ample evidence of predictability,confirming the conclusion of studies that were not based on formal model selection criteria.Usually only a few standard predictors are retained,however.Some of these are unit- root nonstationary (e.g.,dividend yield).Multiple lagged bond or stock returns are at times included,effectively generating the moving-average predictors that have become popular in professional circles lately [see also Brock,Lakonishok,and LeBaron (1992)and Sullivan,Timmermann,and White(1997)]. Formal model selection criteria guard against overfitting.The ultimate purpose is to obtain the model with the best external validity.In the context of prediction,this means that the retained model should provide good out- of-sample predictability.We test this on our dataset of international stock returns. Overall,we find no out-of-sample predictability.More specifically,none of the models that the selection criteria chose generates significant predic- tive power in the 5-year period beyond the initial ("training)sample.This conclusion is based on an SUR test of the slope coefficients in out-of-sample regressions of outcomes onto predictions across the different stock markets. The failure to detect out-of-sample predictability cannot be attributed to lack of power.Schwarz's Bayesian criterion,for instance,discovers predictabil- ity in 9 of 14 markets,with an average R2 of the retained models of 6%. Out of sample,however,none of the retained models generates significant forecasting power.Even with only nine samples of 60 months each,chances that this would occur if 6%were indeed the true R2 are less than I in 333. The poor external validity of the prediction models that formal model selection criteria chose indicates model nonstationarity:the parameters of the"best"prediction model change over time.It is an open question why this is.One potential explanation is that the "correct"prediction model is actually nonlinear,while our selection criteria chose exclusively among linear models.Still,these criteria pick the best linear prediction model:it is surprising that even this best forecaster does not work out of sample. As an explanation for the findings,however,model nonstationarity lacks economic content.It begs the question as to what generates this nonsta- tionarity.Pesaran and Timmermann(1995)also noticed that prediction per- formance improves if one switches models over time.They suggest that it reflects learning in the marketplace.Bossaerts(1997)investigates this possi- bility theoretically.He proves that evidence of predictability will disappear 407
The Review of Financial Smdies /v12 n 2 1999 entirely out of sample if the market learns on the basis of Bayesian updating rules.In other words,Bayesian learning could explain our findings. The remainder of this article is organized as follows.The next section introduces model selection criteria.Section 3 describes the dataset.Section 4 presents the results.Section 5 discusses the power of the out-of-sample prediction tests.Section 6 concludes.There are three appendixes.They discuss technical issues and list the data sources. 2.Model Selection Criteria Formal model selection criteria have long been considered in the statistics literature in order to select the "best"model among a set of candidate models.Statisticians realized that there is a tendency to overfit,and hence that the model that has the highest in-sample explanatory power usually does not have the highest external validity (i.e.,out-of-sample fit).Several criteria were developed,starting from particular decision criteria,Bayesian or information theoretic. We decided to pick several model selection criteria in our study of the predictability of excess stock returns.Each has its merit,and many are robust to the presence of unit roots in the candidate predictors.It is not appropriate to discuss here the advantages and shortcomings of the retained selection criteria.Suffice it to mention that all selection criteria contributed uniformly to the main conclusions of this article. Formally,we use statistical criteria and T observations to select among K linear models that predict the market's excess return,r(t =1,...,T).The models differ in terms of the content and dimension of the prediction vector. Let p*denote the dimension for model k (k =1....,K).The prediction vector of this model for the rth returnr is obtained by dropping all but p elements from the vector of all possible predictors,x,-1.x-I includes an intercept as one of the predictors,as well as variables such as the short-term Treasury bill yield,etc.(We will be explicit later on.)Letting o*denote its coefficient vector,model k can be written as n=0x1+e, (1) with E[ef ]=0,Elef]=0. In the first model,with k =1,we included only the intercept.(Hence, p=1.)This way,selection criteria are allowed to decide in favor of no predictabiliry,beyond a constant.The latter is usually interpreted as a (fixed)risk premium.This option is important.Indeed,the original goal of this study was to verify whether the evidence of return predictability would still emerge if examined with formal selection criteria. Each selection criterion chooses among the K possible model specifi- cations.We will use the notation k*to denote the preferred model.Seven 408