The Review of Financial Studies /v 3 n 3 1990 where X,and Z,are independent.Therefore,the induced order sta- tistics may be represented as dttM=a+p(CalG[XiN-+Zul Zii.d.N(0,σ(1-p2), (13) where the Z are independent of the (order)statistics XN.But since XN is an order statistic,and since the sequence /N converges to XN converges to the th quantile,F-().Using (13)then shows that aw is gaussian,with mean and variance given by (8)and (9), and independent of the other induced order statistics.12 To evaluate the size of a 5 percent test based on the statistic 0,we need only evaluate the cumulative distribution function of the non central x)at the point Cs/(1-p2),where Cs is given in (6). Observe that the noncentrality parameter A is an increasing function of p2.If p2=0 then the distribution of reduces to a central xwhich is identical to the distribution of 0 in(5)-sorting on a characteristic that is statistically independent of the a,'s cannot affect the null dis- tribution of 6.As &and X,become more highly correlated,the non- central x2 distribution shifts to the right.However,this does not imply that the actual size of a 5 percent test necessarily increases since the relevant critical value for 0,Cs/(1-p2),also grows with p2.15 Numerical values for the size of a 5 percent test based on 0 may be obtained by first specifying choices for the relative ranks of the n securities.We choose three sets of [)yielding three distinct test statistics 0,02,and 03: 015e= n+1' k=1,2,.··, (14) In fact,this shows how our parametric specification may be relaxed.If we replace normality by the assumption that a,and X,satisfy the linear regression equation, a=4n+B(X,-,)+Zm where Z,is independent of X.then our results remain unchanged.Moreover,this specification may allow us to relax the rather strong i.i.d.assumption since David (1981,chapters 2.8 and 5.6) does present some results for order statistics in the nonidentically distributed and the dependent cases separately.However,combining and applying them to the above linear regression relation is a formidable task which we leave to the more industrious. In fact,if p2=1,the limiting distribution of is degenerate since the test statistic converges in probability to the following limit: 2G,那 This limit may be greater or less than C depending on the values of hence,the size of the test in this case may be either zero or unity. 440
Data-Snooping Biases 2 (m+1)(n。+1)' for k=1,2,...,nox 25。= (15) k+m(n。+1)-no ,fork=n。+1,…,2no (m+1)(n。+1) +n。+1 (m+1)(n。+1)' for k=1,2,...,no 3台= (16) +(m-1)(n。+1)-,fok=n。+1,2ni (m+1)(n。+1) where n=2n and no is an arbitrary positive integer.The first method (14)simply sets the s so that they divide the unit interval into n equally spaced increments.The second procedure (15)first divides the unit interval into m+1 equally spaced increments,sets the first half of theE's to divide the first such increment into equally spaced intervals each of width 1/(m+1)(n+1),and then sets the remain- ing half so as to divide the last increment into equally spaced intervals also of width 1/(m+1)(n+1)each.The third procedure is similar to the second,except that the s are chosen to divide the second smallest and second largest m+1 increments into equally spaced intervals of width 1/(m +1)(n+1). These three ways of choosing n securities allow us to see how an attempt to create (or remove)dispersion-as measured by the char- acteristic X,affects the null distribution of the statistics.The first choice for the relative ranks is the most disperse,being evenly dis- tributed on (0,1).The second yields the opposite extreme:the 's selected are those with characteristics in the lowest and high- est 100/(m 1)-percentiles.As the parameter m is increased,more extreme outliers are used to compute 62.This is also true for 03,but to a lesser extent since the statistic is based on's in the second lowest and second highest 100/(m+1)-percentiles. Table 1 shows the size of the 5 percent test using 62,and for various values of n,p2,and m.For concreteness,observe that p2 is simply the R2 of the cross-sectional regression of &on X so that p =t.10 implies that only 1 percent of the variation in &is explained by X.For this value of R2,the entries in the second panel of Table 1 show that the size of a 5 percent test using 0,is 4.9 percent for samples of 10 to 100 securities.However,using securities with extreme characteristics does affect the size,as the entries in the "0-test'and "0,-test"columns indicate.Nevertheless the largest deviation is only 8.1 percent.As expected,the size is larger for the test based on 02 than for that of since the former statistic is based on more extreme induced order statistics than the latter. 441
Tbe Review of Financial Studies/v 3 n 3 1990 Table 1 Theoretical sizes of nominal 5 percent xi-tests of E:a,=0 (1,...,n)using the test statistics 6, 0test 0-test 0test Btest 6-test 8.-test n 0test (m=4) (m=4) (m=9) (m=9)】 (m=19) (m■19) R2=.005 10 0.049 0.051 0.049 0.053 0.050 0.054 0.052 20 0.050 0.052 0.049 0.054 0.050 0.056 0.052 50 0.050 0.053 0.048 0.056 0.050 0.060 0.053 100 0.050 0.054 0.047 0.059 0.050 0.064 0.054 R2=.01 10 0.049 0.053 0.048 0.056 0.050 0.059 0.053 20 0.049 0.054 0.047 0.058 0.050 0.063 0.054 50 0.049 0.056 0.046 0.063 0.051 0.071 0.057 100 0.049 0.059 0.045 0.069 0.051 0.081 0.059 R2=05 0.045 0.063 0.041 0.080 0.051 0.101 0.066 20 0.045 0.070 0.038 0.096 0.052 0.130 0.073 0 0.046 0.086 0.033 0.135 0.053 0.201 0.087 100 0.047 0.107 0.028 0.190 0.054 0.304 0.106 R2=.10 0.040 0.076 0.032 0.116 0.052 0.166 0.083 20 0.041 0.093 0.028 0.158 0.053 0,244 0.099 0 0.042 0.133 0.020 0.267 0.055 0.442 0.137 100 0.043 0.192 0.014 0.423 0.058 0.680 0.191 R2=.20 10 0.030 0.104 0.019 0.202 0.052 0.330 0.121 0.032 0.146 0.013 0.318 0.054 0.528 0.163 50 0.034 0.262 0.006 0.599 0.059 0.862 0.272 100 0.036 0.432 0.002 0.857 0.064 0.987 0.429 ==1,2,3,for various sample sizes n.The statistic is based on induced order statistics with relative ranks evenly spaced in (0,1);is constructed from induced order statistics ranked in the lowest and highest 100/(m 1)-percent fractiles;and 0,is constructed from those ranked in the second lowest and second highest 100/(m +1)-percent fractiles.The R2 is the square of the correlation between &and the sorting characteristics. When the R2 increases to 10 percent the bias becomes more impor- tant.Although tests based on a set of securities with evenly spaced characteristics still have sizes approximately equal to their nominal 5 percent value,the size deviates more substantially when securities with extreme characteristics are used.For example,the size of the 02 test that uses the 100 securities in the lowest and highest charac- teristic decile is 42.3 percent!In comparison,the 5 percent test based on the second lowest and second highest deciles exhibits only a 5.8 percent rejection rate.These patterns become even more pronounced for R2's higher than 10 percent. The intuition for these results may be found in (8)-the more extreme induced order statistics have means farther away from zero; hence,a statistic based on evenly distributed's will not provide evidence against the null hypothesis a =0.If the relative ranks are 442