Statistical Seience 1996.VoL.11.N0.3.189-228 Bootstrap Confidence Intervals Thomas J.DiCiccio and Bradley Efron Abstract.This article surveys bootstrap methods for producing good approximate confidence intervals.The goal is to improve by an order of magnitude upon the accuracy of the standard intervals6+z(),in a way that allows routine application even to very complicated problems. Both theory and examples are used to show how this is done.The first seven sections provide a heuristic overview of four bootstrap confidence interval procedures:BCa,bootstrap-t,ABC and calibration.Sections 8 and 9 describe the theory behind these methods,and their close connec- tion with the likelihood-based confidence interval theory developed by Barndorff-Nielsen,Cox and Reid and others. Key words and phrases:Bootstrap-t,BCa and ABC methods,calibra- tion,second-order accuracy 1.INTRODUCTION ate,2(0.95)=1.645 and so on.Often,and always in Confidence this paper,6 and a are obtained by maximum like- intervals have become familiar lihood theory. friends in the applied statistician's collection of data-analytic tools.They combine point estima- The standard intervals,as implemented by maxi- tion and hypothesis testing into a single inferen- mum likelihood theory,are a remarkably useful tool tial statement of great intuitive appeal.Recent The method is completely automatic:the statisti- advances in statistical methodology allow the con- cian inputs the data,the class of possible probabil- struction of highly accurate approximate confidence ity models and the parameter of interest;a com- intervals,even for very complicated probability puter algorithm outputs the intervals(1.1),with no models and elaborate data structures.This article further intervention required.This is in notable con- trast to the construction of an exact interval,which discusses bootstrap methods for constructing such intervals in a routine,automatic way. requires clever thought on a problem-by-problem Two distinct approaches have guided confidence basis when it is possible at all. interval construction since the 1930's.A small cata- The trouble with standard intervals is that they logue of exact intervals has been built up for special are based on an asymptotic approximation that can situations.like the ratio of normal means or a sin- be quite inaccurate in practice.The example below gle binomial parameter.However,most confidence illustrates what every applied statistician knows, that (1.1)can considerably differ from exact inter- intervals are approximate,with by far the favorite vals in those cases where exact intervals exist.Over approximation being the standard interval the years statisticians have developed tricks for im- (1.1) 0±za)6. proving(1.1),involving bias-corrections and param- eter transformations.The bootstrap confidence Here 6 is a point estimate of the parameter of in- intervals that we will discuss here can be thought terest 0,is an estimate of 0's standard deviation, of as automatic algorithms for carrying out these and z()is the 100ath percentile of a normal devi- improvements without human intervention.Of course they apply as well to situations so compli- Thomas J.DiCiccio is Assocciate Professor,De- cated that they lie beyond the power of traditional partment of Social Statistics,358 Ives Hall,Cor- analysis. nell University,Ithaca,New York 14853-3901 We begin with a simple example,where we can (email:tidg@cornell.edu).Bradley Efron is Pro- compute the bootstrap methods with an exact inter- fessor,Department of Statistics and Department val.Figure 1 shows the cd4 data:20 HIV-positive of Health Research and Policy,Stanford Uni- subjects received an experimental antiviral drug; versity,Stanford,California 94305-4065 (e-mail: cd4 counts in hundreds were recorded for each sub- brad@playfair.stanford.edu). ject at baseline and after one year of treatment,giv- 189
Statistical Science 1996, Vol. 11, No. 3, 189–228 Bootstrap Confidence Intervals Thomas J. DiCiccio and Bradley Efron Abstract. This article surveys bootstrap methods for producing good approximate confidence intervals. The goal is to improve by an order of magnitude upon the accuracy of the standard intervals θˆ ± z ασˆ , in a way that allows routine application even to very complicated problems. Both theory and examples are used to show how this is done. The first seven sections provide a heuristic overview of four bootstrap confidence interval procedures: BCa , bootstrap-t, ABC and calibration. Sections 8 and 9 describe the theory behind these methods, and their close connection with the likelihood-based confidence interval theory developed by Barndorff-Nielsen, Cox and Reid and others. Key words and phrases: Bootstrap-t, BCa and ABC methods, calibration, second-order accuracy 1. INTRODUCTION Confidence intervals have become familiar friends in the applied statistician’s collection of data-analytic tools. They combine point estimation and hypothesis testing into a single inferential statement of great intuitive appeal. Recent advances in statistical methodology allow the construction of highly accurate approximate confidence intervals, even for very complicated probability models and elaborate data structures. This article discusses bootstrap methods for constructing such intervals in a routine, automatic way. Two distinct approaches have guided confidence interval construction since the 1930’s. A small catalogue of exact intervals has been built up for special situations, like the ratio of normal means or a single binomial parameter. However, most confidence intervals are approximate, with by far the favorite approximation being the standard interval 1:1 θˆ ± z ασˆ : Here θˆ is a point estimate of the parameter of interest θ, σˆ is an estimate of θˆ’s standard deviation, and z α is the 100αth percentile of a normal deviThomas J. DiCiccio is Assocciate Professor, Department of Social Statistics, 358 Ives Hall, Cornell University, Ithaca, New York 14853-3901 (email: tjd9@cornell.edu). Bradley Efron is Professor, Department of Statistics and Department of Health Research and Policy, Stanford University, Stanford, California 94305-4065 (e-mail: brad@playfair.stanford.edu). ate, z 0:95 = 1:645 and so on. Often, and always in this paper, θˆ and σˆ are obtained by maximum likelihood theory. The standard intervals, as implemented by maximum likelihood theory, are a remarkably useful tool. The method is completely automatic: the statistician inputs the data, the class of possible probability models and the parameter of interest; a computer algorithm outputs the intervals (1.1), with no further intervention required. This is in notable contrast to the construction of an exact interval, which requires clever thought on a problem-by-problem basis when it is possible at all. The trouble with standard intervals is that they are based on an asymptotic approximation that can be quite inaccurate in practice. The example below illustrates what every applied statistician knows, that (1.1) can considerably differ from exact intervals in those cases where exact intervals exist. Over the years statisticians have developed tricks for improving (1.1), involving bias-corrections and parameter transformations. The bootstrap confidence intervals that we will discuss here can be thought of as automatic algorithms for carrying out these improvements without human intervention. Of course they apply as well to situations so complicated that they lie beyond the power of traditional analysis. We begin with a simple example, where we can compute the bootstrap methods with an exact interval. Figure 1 shows the cd4 data: 20 HIV-positive subjects received an experimental antiviral drug; cd4 counts in hundreds were recorded for each subject at baseline and after one year of treatment, giv- 189
190 T.J.DICICCIO AND B.EFRON TABLE 1 The cd4 data,as plotted in Figure 1 6 Subject Baseline One year Subject Baseline One year 1 2.12 2.47 11 4.15 4.74 2 4.35 4.61 12 3.56 3.29 3 3.39 5.26 13 3.39 5.55 4 2.51 3.02 14 1.88 2.82 4.04 6.36 15 2.56 4.23 6 5.10 5.93 16 2.96 3.23 3.77 3.93 17 2.49 2.56 8 3.35 4.09 18 3.03 4.31 9 4.10 4.88 19 2.66 4.37 10 3.35 3.81 20 3.00 2.40 computer-based adjustments to the standard in- B(Baseiine)】 terval endpoints that are guaranteed to improve the coverage accuracy by an order of magnitude,at FIG.1.The cd4 data;cd4 counts in hundreds for 20 subjects, least asymptotically. at baseline and after one year of treatment with an experimental The exact interval endpoints [0.47,0.86]are de- anti-viral drug;numerical values appear in Table 1. fined by the fact that they"cover"the observed value =0.723 with the appropriate probabilities, ing data,say,xi=(Bi,Ai)for i=1,2,...,20.The (1.4) Probo=0.4{6>0.723}=0.05 data is listed in Table 1.The two measurements are highly correlated,having sample correlation coeffi- and cient0=0.723. What if we wish to construct a confidence inter- (1.5) Prob9=0.86{9>0.723}=0.95 val for the true correlation 0?We can find an exact Table 2 shows that the corresponding probabilities interval for 6 if we are willing to assume bivariate for the standard endpoints [0.55,0.90]are 0.12 and normality for the (B:,A:)pairs, 0.99.The standard interval is far too liberal at its lower endpoint and far too cautious at its upper end- (1.2) ii.dN2(入,)fori=1,2,..,20 point.This kind of error is particularly pernicious if the confidence interval is used to test a parameter where A and I are the unknown expectation vec- value of interest like 6=0. tor and covariance matrix.The exact central 90% Table 2 describes the various confidence intervals interval is in terms of their length and right-left asymmetry around the point estimate 0, (1.3)(0xAcT[0.05],0 EXACTI0.95])=(0.47,0.86) length=0.95]-0.05]. This notation emphasizes that a two-sided interval (1.6) 0.95]-8 is intended to give correct coverage at both end- shape points,two 0.05 noncoverage probabilities in this 0-0.05] case,not just an overall 0.10 noncoverage probabil- The standard intervals always have shape equal to ity. 1.00.It is in this way that they err most seriously. The left panel of Table 2 shows the exact and For example,the exact normal-theory interval for standard intervals for the correlation coefficient of Corr has shape equal to 0.52,extending twice as far the cd4 data,assuming the normal model (1.2).Also to the left of=0.723 as to the right.The stan- shown are approximate confidence intervals based dard interval is much too optimistic about ruling on three different (but closely related)bootstrap out values of 6 below 6,and much too pessimistic methods:ABC,BCa and bootstrap-t.The ABC and about ruling out values above 6.This kind of error BC methods match the exact interval to two dec- is automatically identified and corrected by all the imal places,and all of the bootstrap intervals are bootstrap confidence interval methods. more accurate than the standard.The examples There is no compelling reason to assume bivariate and theory that follow are intended to show that normality for the data in Figure 1.A nonparamet- this is no accident.The bootstrap methods make ric version of(1.2)assumes that the pairs (Bi,A:)
190 T. J. DICICCIO AND B. EFRON Fig. 1. The cd4 data; cd4 counts in hundreds for 20 subjects, at baseline and after one year of treatment with an experimental anti-viral drug; numerical values appear in Table 1. ing data, say, xi = Bi ; Ai for i = 1; 2;: : :; 20. The data is listed in Table 1. The two measurements are highly correlated, having sample correlation coeffi- cient θˆ = 0:723. What if we wish to construct a confidence interval for the true correlation θ? We can find an exact interval for θ if we are willing to assume bivariate normality for the Bi ; Ai pairs, 1:2 Bi Ai ∼i:i:d: N2 λ; 0 for i = 1; 2;: : :; 20; where λ and 0 are the unknown expectation vector and covariance matrix. The exact central 90% interval is 1:3 θˆ EXACT0:05; θˆ EXACT0:95 = 0:47; 0:86: This notation emphasizes that a two-sided interval is intended to give correct coverage at both endpoints, two 0.05 noncoverage probabilities in this case, not just an overall 0.10 noncoverage probability. The left panel of Table 2 shows the exact and standard intervals for the correlation coefficient of the cd4 data, assuming the normal model (1.2). Also shown are approximate confidence intervals based on three different (but closely related) bootstrap methods: ABC, BCa and bootstrap-t. The ABC and BCa methods match the exact interval to two decimal places, and all of the bootstrap intervals are more accurate than the standard. The examples and theory that follow are intended to show that this is no accident. The bootstrap methods make Table 1 The cd4 data, as plotted in Figure 1 Subject Baseline One year Subject Baseline One year 1 2.12 2.47 11 4.15 4.74 2 4.35 4.61 12 3.56 3.29 3 3.39 5.26 13 3.39 5.55 4 2.51 3.02 14 1.88 2.82 5 4.04 6.36 15 2.56 4.23 6 5.10 5.93 16 2.96 3.23 7 3.77 3.93 17 2.49 2.56 8 3.35 4.09 18 3.03 4.31 9 4.10 4.88 19 2.66 4.37 10 3.35 3.81 20 3.00 2.40 computer-based adjustments to the standard interval endpoints that are guaranteed to improve the coverage accuracy by an order of magnitude, at least asymptotically. The exact interval endpoints [0.47, 0.86] are de- fined by the fact that they “cover” the observed value θˆ = 0:723 with the appropriate probabilities, 1:4 Probθ=0:47θˆ > 0:723 = 0:05 and 1:5 Probθ=0:86θˆ > 0:723 = 0:95: Table 2 shows that the corresponding probabilities for the standard endpoints [0.55, 0.90] are 0.12 and 0.99. The standard interval is far too liberal at its lower endpoint and far too cautious at its upper endpoint. This kind of error is particularly pernicious if the confidence interval is used to test a parameter value of interest like θ = 0. Table 2 describes the various confidence intervals in terms of their length and right–left asymmetry around the point estimate θˆ, 1:6 length = θˆ0:95 − θˆ0:05; shape = θˆ0:95 − θˆ θˆ − θˆ0:05: The standard intervals always have shape equal to 1.00. It is in this way that they err most seriously. For example, the exact normal-theory interval for Corr has shape equal to 0.52, extending twice as far to the left of θˆ = 0:723 as to the right. The standard interval is much too optimistic about ruling out values of θ below θˆ, and much too pessimistic about ruling out values above θˆ. This kind of error is automatically identified and corrected by all the bootstrap confidence interval methods. There is no compelling reason to assume bivariate normality for the data in Figure 1. A nonparametric version of (1.2) assumes that the pairs Bi ; Ai
BOOTSTRAP CONFIDENCE INTERVALS 191 TABLE 2 Exact and approximate confidence intervals for the correlation coefficient,ed4 data;=0.723:the bootstrap methods ABC,BCa bootstrap-t and calibrated ABC are explained in Sections 2-7;the ABC and BCa intervals are close to exact in the normal theory situation (left panel);the standard interval errs badly at both endpoints,as can be seen from the coverage probabilities in the bottom rows Normal theory Nonparametric Exact ABC BCa Bootstrap-t Standard ABC BCa Bootstrap-t Calibrated Standard 0.05 0.47 0.47 0.47 0.45 0.55 0.56 0.55 0.51 0.56 0.59 0.95 0.86 0.86 0.86 0.87 0.90 0.83 0.85 0.86 0.83 0.85 Length 0.39 0.39 0.39 0.42 0.35 0.27 0.30 0.35 0.27 0.26 Shape 0.52 0.52 0.54 0.52 1.00 0.67 0.70 0.63 0.67 1.00 Cov 05 0.05 0.05 0.05 0.04 0.12 Cov 95 0.95 0.95 0.95 0.97 0.99 are a random sample ("i.i.d.")from some unknown One of the achievements of the theory discussed bivariate distribution F. in Section 8 is to provide a reasonable theoretical gold standard for approximate confidence inter- (1.7) iid.F, i=1,2,.,n, vals.Comparison with this gold standard shows that the bootstrap intervals are not only asymptot- n =20,without assuming that F belongs to any ically more accurate than the standard intervals, particular parametric family.Bootstrap-based confi- they are also more correct."Accuracy"refers to the dence intervals such as abC are available for non- coverage errors:a one-sided bootstrap interval of parametric situations,as discussed in Section 6.In intended coverage a actually covers 6 with proba- theory they enjoy the same second-order accuracy as bility a+0(1/n),where n is the sample size.This in parametric problems.However,in some nonpara- is second-order accuracy,compared to the slower metric confidence interval problems that have been first-order accuracy of the standard intervals,with examined carefully,the small-sample advantages of coverage probabilites a+0(1//n).However con- the bootstrap methods have been less striking than fidence intervals are supposed to be inferentially in parametric situations.Methods that give third- correct as well as accurate.Correctness is a harder order accuracy,like the bootstrap calibration of an property to pin down,but it is easy to give exam- ABC interval,seem to be more worthwhile in the ples of incorrectness:if x1,x2,...,xn is a random nonparametric framework(see Section 6). sample from a normal distribution N(0,1),then In most problems and for most parameters there (min(x:),max(xi))is an exactly accurate two-sided will not exist exact confidence intervals.This great confidence interval for 6 of coverage probability gray area has been the province of the standard in- 1-1/2"-1,but it is incorrect.The theory of Section tervals for at least 70 years.Bootstrap confidence in- 8 shows that all of our better confidence intervals tervals provide a better approximation to exactness are second-order correct as well as second-order in most situations.Table 3 refers to the parameter accurate.We can see this improvement over the 6 defined as the maximum eigenvalue of the covari- standard intervals on the left side of Table 2.The ance matrix of(B,A)in the cd4 experiment, theory says that this improvement exists also in (1.8) 6=maximum eigenvalue {cov(B,A)}. those cases like Table 3 where we cannot see it directly. The maximum likelihood estimate(MLE)of 0,as- suming either model (1.2)or (1.7),is 6=1.68.The bootstrap intervals extend further to the right than 2.THE BCa INTERVALS to the left of 6 in this case,more than 2.5 times as The next six sections give a heuristic overview far under the normal model.Even though we have of bootstrap confidence intervals.More examples no exact endpoint to serve as a"gold standard"here, are presented,showing how bootstrap intervals the theory that follows strongly suggests the supe- can be routinely constructed even in very compli- riority of the bootstrap intervals.Bootstrapping in- cated and messy situations.Section 8 derives the volves much more computation than the standard second-order properties of the bootstrap intervals in intervals,on the order of 1,000 times more,but the terms of asymptotic expansions.Comparisons with algorithms are completely automatic,requiring no likelihood-based methods are made in Section 9 more thought for the maximum eigenvalue than the The bootstrap can be thought of as a convenient correlation coefficient,or for any other parameter. way of executing the likelihood calculations in para-
BOOTSTRAP CONFIDENCE INTERVALS 191 Table 2 Exact and approximate confidence intervals for the correlation coefficient, cd4 data; θˆ = 0:723: the bootstrap methods ABC, BCa , bootstrap-t and calibrated ABC are explained in Sections 2–7; the ABC and BCa intervals are close to exact in the normal theory situation (left panel); the standard interval errs badly at both endpoints, as can be seen from the coverage probabilities in the bottom rows Normal theory Nonparametric Exact ABC BCa Bootstrap-t Standard ABC BCa Bootstrap-t Calibrated Standard 0.05 0.47 0.47 0.47 0.45 0.55 0.56 0.55 0.51 0.56 0.59 0.95 0.86 0.86 0.86 0.87 0.90 0.83 0.85 0.86 0.83 0.85 Length 0.39 0.39 0.39 0.42 0.35 0.27 0.30 0.35 0.27 0.26 Shape 0.52 0.52 0.54 0.52 1.00 0.67 0.70 0.63 0.67 1.00 Cov 05 0.05 0.05 0.05 0.04 0.12 Cov 95 0.95 0.95 0.95 0.97 0.99 are a random sample (“i.i.d.”) from some unknown bivariate distribution F, 1:7 Bi Ai ∼i:i:d: F; i = 1; 2;: : :; n; n = 20, without assuming that F belongs to any particular parametric family. Bootstrap-based confi- dence intervals such as ABC are available for nonparametric situations, as discussed in Section 6. In theory they enjoy the same second-order accuracy as in parametric problems. However, in some nonparametric confidence interval problems that have been examined carefully, the small-sample advantages of the bootstrap methods have been less striking than in parametric situations. Methods that give thirdorder accuracy, like the bootstrap calibration of an ABC interval, seem to be more worthwhile in the nonparametric framework (see Section 6). In most problems and for most parameters there will not exist exact confidence intervals. This great gray area has been the province of the standard intervals for at least 70 years. Bootstrap confidence intervals provide a better approximation to exactness in most situations. Table 3 refers to the parameter θ defined as the maximum eigenvalue of the covariance matrix of B; A in the cd4 experiment, 1:8 θ = maximum eigenvalue covB; A: The maximum likelihood estimate (MLE) of θ, assuming either model (1.2) or (1.7), is θˆ = 1:68. The bootstrap intervals extend further to the right than to the left of θˆ in this case, more than 2.5 times as far under the normal model. Even though we have no exact endpoint to serve as a “gold standard” here, the theory that follows strongly suggests the superiority of the bootstrap intervals. Bootstrapping involves much more computation than the standard intervals, on the order of 1,000 times more, but the algorithms are completely automatic, requiring no more thought for the maximum eigenvalue than the correlation coefficient, or for any other parameter. One of the achievements of the theory discussed in Section 8 is to provide a reasonable theoretical gold standard for approximate confidence intervals. Comparison with this gold standard shows that the bootstrap intervals are not only asymptotically more accurate than the standard intervals, they are also more correct. “Accuracy” refers to the coverage errors: a one-sided bootstrap interval of intended coverage α actually covers θ with probability α + O1/n, where n is the sample size. This is second-order accuracy, compared to the slower first-order accuracy of the standard intervals, with coverage probabilites α + O1/ √ n. However con- fidence intervals are supposed to be inferentially correct as well as accurate. Correctness is a harder property to pin down, but it is easy to give examples of incorrectness: if x1 ; x2 ;: : :; xn is a random sample from a normal distribution Nθ; 1, then (minxi , maxxi ) is an exactly accurate two-sided confidence interval for θ of coverage probability 1 − 1/2 n−1 , but it is incorrect. The theory of Section 8 shows that all of our better confidence intervals are second-order correct as well as second-order accurate. We can see this improvement over the standard intervals on the left side of Table 2. The theory says that this improvement exists also in those cases like Table 3 where we cannot see it directly. 2. THE BCa INTERVALS The next six sections give a heuristic overview of bootstrap confidence intervals. More examples are presented, showing how bootstrap intervals can be routinely constructed even in very complicated and messy situations. Section 8 derives the second-order properties of the bootstrap intervals in terms of asymptotic expansions. Comparisons with likelihood-based methods are made in Section 9. The bootstrap can be thought of as a convenient way of executing the likelihood calculations in para-
192 T.J.DICICCIO AND B.EFRON TABLE 3 Approximate 90%central confidence intervals for the maximum eigenvalue parameter (1.7).cd4 data:the bootstrap intervals extend much further to the right of the MLE=1.68 than to the left Normal theory Nonparametric ABC BCa Standard ABC BCa Calibated Standard 0.05 1.11 1.10 0.80 1.15 1.14 1.16 1.01 0.95 3.25 3.18 2.55 2.56 2.55 3.08 2.35 Length 2.13 2.08 1.74 1.42 1.41 1.92 1.34 Shape 2.80 2.62 1.00 1.70 1.64 2.73 1.00 metric exponential family situations and even in The 2,000 bootstrap replications had standard nonparametric problems. deviation 0.52.This is the bootstrap estimate of The bootstrap was introduced as a nonparametric standard error for 6,generally a more dependable device for estimating standard errors and biases. standard error estimate than the usual parametric Confidence intervals are inherently more delicate delta-method value (see Efron,1981).The mean of inference tools.A considerable amount of effort has the 2,000 values was 1.61,compared to =1.68, gone into upgrading bootstrap methods to the level indicating a small downward bias in the Maxeig of precision required for confidence intervals. statistic.In this case it is easy to see that the down- The BC method is an automatic algorithm for ward bias comes from dividing by n instead of n-1 producing highly accurate confidence limits from a in obtaining the MLE I of the covariance matrix. bootstrap distribution.Its effectiveness was demon- Two thousand bootstrap replications is 10 times strated in Table 2.References include Efron(1987), too many for estimating a standard error,but not too Hall (1988),DiCiccio (1984),DiCiccio and Romano many for the more delicate task of setting confidence (1995)and Efron and Tibshirani(1993).A program intervals.These bootstrap sample size calculations written in the language S is available [see the note appear in Efron(1987,Section 9). in the second paragraph following (4.14)]. The BC procedure is a method of setting approx- The goal of bootstrap confidence interval theory imate confidence intervals for 6 from the percentiles is to calculate dependable confidence limits for a pa- of the bootstrap histogram.Suppose e is a param- rameter of interest 6 from the bootstrap distribution eter of interest;0(x)is an estimate of based on of 6.Figure 2 shows two such bootstrap distribu- the observed data x;and*=(x*)is a bootstrap tions relating to the maximum eigenvalue param- replication of 6 obtained by resampling x*from an eter 0 for the cd4 data,(1.8).The nonparametric estimate of the distribution governing x.Let G(c) bootstrap distribution (on the right)will be dis- be the cumulative distribution function(c.d.f.)of B cussed in Section 6. bootstrap replications *(b), The left panel is the histogram of 2,000 normal- theory bootstrap replications of 6.Each replication (2.2) G(c)=#{(b)<c}/B was obtained by drawing a bootstrap data set anal- ogous to(1.2)】 In our case B =2,000.The upper endpoint Oc.[a]of a one-sided level-a BCa interval,0 2.1 iid.N2(,),i=1,2,..,20, (-OBc [a])is defined in terms of G and two numerical parameters discussed below:the bias. and then computing the maximum likelihood correction zo and the acceleration a(BCa stands for estimate (MLE)of based on the boostrap data.In "bias-corrected and accelerated").By definition the other words6*was the maximum eigenvalue of the BCa endpoint is empirical covariance matrix of the 20 pairs(B:,A). The mean vector A and covariance matrix I'in(2.1) (2.3) 20+2a) were the usual maximum likelihood estimates for c.al=G-1(2o+1-a(0+z@可 入and「,based on the original data in Figure 1. Relation (2.1)is a parametric bootstrap sample, Here d is the standard normal c.d.f,with z()= obtained by sampling from a parametric MLE for (a)as before.The central 0.90 BC interval the unknown distribution F.Section 6 discusses is given by (BC.[0.05],0BC.[0.95]).Formula (2.3) nonparametric bootstrap samples and confidence looks strange,but it is well motivated by the trans- intervals. formation and asymptotic arguments that follow
192 T. J. DICICCIO AND B. EFRON Table 3 Approximate 90% central confidence intervals for the maximum eigenvalue parameter 1:7, cd4 data; the bootstrap intervals extend much further to the right of the MLE θˆ = 1:68 than to the left Normal theory Nonparametric ABC BCa Standard ABC BCa Calibated Standard 0.05 1.11 1.10 0.80 1.15 1.14 1.16 1.01 0.95 3.25 3.18 2.55 2.56 2.55 3.08 2.35 Length 2.13 2.08 1.74 1.42 1.41 1.92 1.34 Shape 2.80 2.62 1.00 1.70 1.64 2.73 1.00 metric exponential family situations and even in nonparametric problems. The bootstrap was introduced as a nonparametric device for estimating standard errors and biases. Confidence intervals are inherently more delicate inference tools. A considerable amount of effort has gone into upgrading bootstrap methods to the level of precision required for confidence intervals. The BCa method is an automatic algorithm for producing highly accurate confidence limits from a bootstrap distribution. Its effectiveness was demonstrated in Table 2. References include Efron (1987), Hall (1988), DiCiccio (l984), DiCiccio and Romano (1995) and Efron and Tibshirani (1993). A program written in the language S is available [see the note in the second paragraph following (4.14)]. The goal of bootstrap confidence interval theory is to calculate dependable confidence limits for a parameter of interest θ from the bootstrap distribution of θˆ. Figure 2 shows two such bootstrap distributions relating to the maximum eigenvalue parameter θ for the cd4 data, (1.8). The nonparametric bootstrap distribution (on the right) will be discussed in Section 6. The left panel is the histogram of 2,000 normaltheory bootstrap replications of θˆ. Each replication was obtained by drawing a bootstrap data set analogous to (1.2), 2:1 B∗ i A∗ i ∼i:i:d: N2 λˆ; 0ˆ ; i = 1; 2;: : :; 20; and then computing θˆ ∗ , the maximum likelihood estimate (MLE) of θ based on the boostrap data. In other words θˆ ∗ was the maximum eigenvalue of the empirical covariance matrix of the 20 pairs B∗ i ; A∗ i . The mean vector λˆ and covariance matrix 0ˆ in (2.1) were the usual maximum likelihood estimates for λ and 0, based on the original data in Figure 1. Relation (2.1) is a parametric bootstrap sample, obtained by sampling from a parametric MLE for the unknown distribution F. Section 6 discusses nonparametric bootstrap samples and confidence intervals. The 2,000 bootstrap replications θˆ ∗ had standard deviation 0.52. This is the bootstrap estimate of standard error for θˆ, generally a more dependable standard error estimate than the usual parametric delta-method value (see Efron, 1981). The mean of the 2,000 values was 1.61, compared to θˆ = 1:68, indicating a small downward bias in the Maxeig statistic. In this case it is easy to see that the downward bias comes from dividing by n instead of n − 1 in obtaining the MLE 0ˆ of the covariance matrix. Two thousand bootstrap replications is 10 times too many for estimating a standard error, but not too many for the more delicate task of setting confidence intervals. These bootstrap sample size calculations appear in Efron (1987, Section 9). The BCa procedure is a method of setting approximate confidence intervals for θ from the percentiles of the bootstrap histogram. Suppose θ is a parameter of interest; θˆx is an estimate of θ based on the observed data x; and θˆ ∗ = θˆx ∗ is a bootstrap replication of θˆ obtained by resampling x ∗ from an estimate of the distribution governing x. Let Gˆ c be the cumulative distribution function (c.d.f.) of B bootstrap replications θˆ ∗ b, 2:2 Gˆ c = #θˆ ∗ b < c/B: In our case B = 2,000. The upper endpoint θˆBCa α of a one-sided level-α BCa interval, θ ∈ −∞; θˆBCa α is defined in terms of Gˆ and two numerical parameters discussed below: the biascorrection z0 and the acceleration a (BCa stands for “bias-corrected and accelerated”). By definition the BCa endpoint is 2:3 θˆBCa α = Gˆ −18 z0 + z0 + z α 1 − az0 + z α : Here 8 is the standard normal c.d.f, with z α = 8−1 α as before. The central 0.90 BCa interval is given by θˆBCa 0:05; θˆBCa 0:95. Formula (2.3) looks strange, but it is well motivated by the transformation and asymptotic arguments that follow.
BOOTSTRAP CONFIDENCE INTERVALS 193 0.5 1.0 1.520 2.5 3.0 *+ FIG.2.Bootstrap distributions for the maximum eigenvalue of the covariance matrix,cd4 data:(left)2,000 parametric bootstrap replications assuming a bivariate normal distribution;(right)2,000 nonparametric bootstrap replications,discussed in Section 6.The solid lines indicate the limits of the BCa 0.90 central confidence intervals,compared to the standard intervals (dashed lines). If a and zo are zero,then OBc [a]=G-1(a),the thatξ=专+W for all values ofg,with W always 100ath percentile of the bootstrap replications.In having the same distribution.This is a translation this case the 0.90 BCa interval is the interval be- problem so we know how to set confidence limits tween the 5th and 95th percentiles of the bootstrap a]for replications.If in addition G is perfectly normal, then Ogc,[a]=0+z(),the standard interval end- (2.6) a=专-W1-, point.In general,(2.3)makes three distinct correc- where wd-a)is the 100(1-a)th percentile of W tions to the standard intervals,improving their The BCa interval(2.3)is exactly equivalent to the coverage accuracy from first to second order. translation interval(2.6),and in this sense it is cor- The c.d.f.G is markedly long-tailed to the rect as well as accurate. right,on the normal-theory side of Figure 2. The bias-correction constant zo is easy to inter- Also a and zo are both estimated to be positive, pret in (2.5)since (a,)=(0.105,0.226),further shifting 0nc [a]to the right of OsTAN[a]=+()The 0.90 BC (2.7) Prob{中<中}=(zo): interval for 6 is Then Prob<}=(zo)because of monotonicity. (2.4)(G-1(0.157),G-1(0.995)=(1.10,3.18) The BC algorithm,in its simplest form,estimates zo by compared to the standard interval(0.80,2.55). The following argument motivates the BCa def- 0=-1 #{*(b)<创 (2.8) inition (2.3),as well as the parameters a and zo. B Suppose that there exists a monotone increasing -1 of the proportion of the bootstrap replications transformation中=m(f)such that中=m(0is less than 6.Of the 2,000 normal-theory bootstrap normally distributed for every choice of 0,but pos- replications shown in the left panel of Fig- sibly with a bias and a nonconstant variance, ure 2,1179 were less than 6=1.68.This gave (2.5)中~N(中-z00b,o6),06=1+a中 2o=-(0.593)=0.226,a positive bias correction since is biased downward relative to 6.An often Then(2.3)gives exactly accurate and correct confi- more accurate method of estimating zo is described dence limits for 0 having observed 6. in Section 4. The argument in Section 3 of Efron(1987)shows The acceleration a in(2.5)measures how quickly that in situation (2.5)there is another monotone the standard error is changing on the normalized transformation,.say专=M(a)and专=M(a),such scale.The value a =0.105 in (2.4),obtained from
BOOTSTRAP CONFIDENCE INTERVALS 193 Fig. 2. Bootstrap distributions for the maximum eigenvalue of the covariance matrix, cd4 data: (left) 2,000 parametric bootstrap replications assuming a bivariate normal distribution; (right) 2,000 nonparametric bootstrap replications, discussed in Section 6. The solid lines indicate the limits of the BCa 0:90 central confidence intervals, compared to the standard intervals (dashed lines). If a and z0 are zero, then θˆBCa α = Gˆ −1 α, the 100αth percentile of the bootstrap replications. In this case the 0.90 BCa interval is the interval between the 5th and 95th percentiles of the bootstrap replications. If in addition Gˆ is perfectly normal, then θˆBCa α = θˆ + z ασˆ , the standard interval endpoint. In general, (2.3) makes three distinct corrections to the standard intervals, improving their coverage accuracy from first to second order. The c.d.f. Gˆ is markedly long-tailed to the right, on the normal-theory side of Figure 2. Also a and z0 are both estimated to be positive, aˆ; zˆ0 = 0:105; 0:226, further shifting θˆBCa α to the right of θˆ STANα = θˆ + z ασˆ . The 0.90 BCa interval for θ is 2:4 Gˆ −1 0:157; Gˆ −1 0:995 = 1:10; 3:18; compared to the standard interval (0.80, 2.55). The following argument motivates the BCa definition (2.3), as well as the parameters a and z0 . Suppose that there exists a monotone increasing transformation φ = mθ such that φˆ = mθˆ is normally distributed for every choice of θ, but possibly with a bias and a nonconstant variance, 2:5 φˆ ∼ Nφ − z0σφ; σ 2 φ ; σφ = 1 + aφ: Then (2.3) gives exactly accurate and correct confi- dence limits for θ having observed θˆ. The argument in Section 3 of Efron (1987) shows that in situation (2.5) there is another monotone transformation, say ξ = Mθ and ξˆ = Mθˆ, such that ξˆ = ξ + W for all values of ξ, with W always having the same distribution. This is a translation problem so we know how to set confidence limits ξˆα for ξ, 2:6 ξˆα = ξ − W1−α ; where W1−α is the 1001 − αth percentile of W. The BCa interval (2.3) is exactly equivalent to the translation interval (2.6), and in this sense it is correct as well as accurate. The bias-correction constant z0 is easy to interpret in (2.5) since 2:7 Probφˆ < φ = 8z0 : Then Probθˆ < θ = 8z0 because of monotonicity. The BCa algorithm, in its simplest form, estimates z0 by 2:8 zˆ0 = 8 −1 #θˆ ∗ b < θˆ B ; 8−1 of the proportion of the bootstrap replications less than θˆ. Of the 2,000 normal-theory bootstrap replications θˆ ∗ shown in the left panel of Figure 2, 1179 were less than θˆ = 1:68. This gave zˆ0 = 8−1 0:593 = 0:226, a positive bias correction since θˆ ∗ is biased downward relative to θˆ. An often more accurate method of estimating z0 is described in Section 4. The acceleration a in (2.5) measures how quickly the standard error is changing on the normalized scale. The value aˆ = 0:105 in (2.4), obtained from