A Probabilistic Proof of the Lindeberg-feller Central Limit theorem Larry goldstein 1 INTRODUCTION The Central Limit Theorem, one of the most striking and useful results in probability and statistics, explains why the normal distribution appears in areas as diverse as gambling, measurement error, sampling, and statistical mechanics. In essence, the Central Limit Theorem states that the normal dis- tribution applies whenever one is approximating probabilities for a quantit which is a sum of many independent contributions all of which are roughly the same size. It is the Lindeberg- Feller Theorem which makes this statement precise in providing the sufficient, and in some sense necessary, Lindeberg condition whose satisfaction accounts for the ubiquitous appearance of the bell shaped normal Generally the Lindeberg condition is handled using Fourier methods and is somewhat hard to interpret from the classical point of view. Here we pro- vide a simpler, equivalent, and more easily interpretable probabilistic formu- lation of the Lindeberg condition and demonstrate its sufficiency and partia necessity in the Central Limit Theorem using more elementary means The seeds of the Central Limit Theorem or clt lie in the work of Abra- ham de moivre who. in 1733. not being able to secure himself an academic appointment, supported himself consulting on problems of probability, and gambling. He approximated the limiting probabilities of the Binomial distri bution, the one which governs the behavior of the number Sn of success in an experiment which consists of n independent trials, each one having the same probability p E(0, 1)of success Each individual trial of the experiment can be modelled by X, a(Bernoulli) random variable which records one for each success. and zero for each failure P(X=1)=p and P(X=0)=1-p
A Probabilistic Proof of the Lindeberg-Feller Central Limit Theorem Larry Goldstein 1 INTRODUCTION. The Central Limit Theorem, one of the most striking and useful results in probability and statistics, explains why the normal distribution appears in areas as diverse as gambling, measurement error, sampling, and statistical mechanics. In essence, the Central Limit Theorem states that the normal distribution applies whenever one is approximating probabilities for a quantity which is a sum of many independent contributions all of which are roughly the same size. It is the Lindeberg-Feller Theorem which makes this statement precise in providing the sufficient, and in some sense necessary, Lindeberg condition whose satisfaction accounts for the ubiquitous appearance of the bell shaped normal. Generally the Lindeberg condition is handled using Fourier methods and is somewhat hard to interpret from the classical point of view. Here we provide a simpler, equivalent, and more easily interpretable probabilistic formulation of the Lindeberg condition and demonstrate its sufficiency and partial necessity in the Central Limit Theorem using more elementary means. The seeds of the Central Limit Theorem, or CLT, lie in the work of Abraham de Moivre, who, in 1733, not being able to secure himself an academic appointment, supported himself consulting on problems of probability, and gambling. He approximated the limiting probabilities of the Binomial distribution, the one which governs the behavior of the number Sn of success in an experiment which consists of n independent trials, each one having the same probability p ∈ (0, 1) of success. Each individual trial of the experiment can be modelled by X, a (Bernoulli) random variable which records one for each success, and zero for each failure, P(X = 1) = p and P(X = 0) = 1 − p, 1
and has mean EX= p and variance Var(X)= p(1-p) e reco successes and failures in n independent trials is then given by an independent sequence X1, X2,.. Xn of these Bernuolli variables, and the total number of success S, by their sum Sn=X1+…+Xn. Exactly, Sn has the binomial distribution which specifies that P(Sn=k) (R)p(1-p)n-k for k=0, 1,...,n. For even moderate values of n managing the binomial coefficients(n) becomes unwieldy, to say nothing of computing the sum which yields the cumulative probability P(sn≤m)=∑(m)-(x-ny=k that there will be m or fewer successes providing table approx- imation to such probabilities that can be quite accurate even for moderate Standardizing the binomial Sn by subt dividing by its standard deviation to obtain the mean zero, variance random variable Wn=(Sn-np)/np(1-p), the CLT yields that Vr lim p(Wn≤x)=P(Z≤x) n→ where Z is N(0, 1), a standard, mean zero variance one, normal random variable, that is. the one with distribution function p(u)du where (u)=exp(-5u) We may therefore, for instance, approximate the cumbersome cumulative binomial probability P(Sn < m) by the simpler d((m-np)/np(1-p) It was only for the special case of the binomial that the normal approxi- mation was first considered. Only many years later with the work of Laplace around 1820 did it begin to be systematically realized that the same normal limit is obtained when the underlying Bernoulli variables are replaced by any variables with a finite variance. The result was the classical Central Limit which states that(2) holds whenever 2
and has mean EX = p and variance Var(X) = p(1 − p). The record of successes and failures in n independent trials is then given by an independent sequence X1, X2, . . . , Xn of these Bernuolli variables, and the total number of success Sn by their sum Sn = X1 + · · · + Xn. (1) Exactly, Sn has the binomial distribution which specifies that P(Sn = k) = n k p k (1 − p) n−k for k = 0, 1, . . . , n. For even moderate values of n managing the binomial coefficients n k becomes unwieldy, to say nothing of computing the sum which yields the cumulative probability P(Sn ≤ m) = X k≤m n k p k (1 − p) n−k that there will be m or fewer successes. The great utility of the CLT is in providing an easily computable approximation to such probabilities that can be quite accurate even for moderate values of n. Standardizing the binomial Sn by subtracting its mean and dividing by its standard deviation to obtain the mean zero, variance one random variable Wn = (Sn − np)/ p np(1 − p), the CLT yields that ∀x limn→∞ P(Wn ≤ x) = P(Z ≤ x) (2) where Z is N (0, 1), a standard, mean zero variance one, normal random variable, that is, the one with distribution function Φ(x) = Z x −∞ ϕ(u)du where ϕ(u) = 1 √ 2π exp(− 1 2 u 2 ). (3) We may therefore, for instance, approximate the cumbersome cumulative binomial probability P(Sn ≤ m) by the simpler Φ((m − np)/ p np(1 − p)). It was only for the special case of the binomial that the normal approximation was first considered. Only many years later with the work of Laplace around 1820 did it begin to be systematically realized that the same normal limit is obtained when the underlying Bernoulli variables are replaced by any variables with a finite variance. The result was the classical Central Limit, which states that (2) holds whenever Wn = (Sn − nµ)/ √ nσ2 2
is the standardization of a sum Sn, as in(1), of independent and identically distributed random variables each with mean u and variance o From this generalization it now becomes somewhat clearer why various distributions observed in nature, which may not be at all related to the binomial, such the errors of measurement averages, or the heights of individuals in a sample take on the bell shaped form: each observation is the result of summing many mall independent factors A further extension of the classical CLt could yet come. In situations where the summand distributions do not have identical distributions, can the normal curve still govern? For an example, consider the symmetric group the set of all permutations T on the set (1, 2, .. n. We can represent 丌∈Sr, for example, by two line notation 234567 4376512 from which one can read that T(1)=4 and T(4)=6. This permutation can also be represented in the cycle notation 丌=(1,4,6)(2,3,7)(5) with the meaning that T maps 1 to 4, 4 to 6, 6 to 1, and so forth. From the cycle representation we see that T has two cycles of length 3 and one of length 1, for a total of three cycles. In general, let Kn(T) denote the total number of cycles in a permutation T E Sn. If T is chosen uniformly from all the n! permutations in Sn, does the Central Limit Theorem imply that Kn(r) is approximately normally distributed for large n? To answer this question we will employ the Feller coupling [3, which constructs a random permutation T uniformly from Sn with the help of n independent Bernoulli variables X1,..., Xn with distributions Px=0)=1- and P(Xi=1)=,i=1,…,n.(4) Begin the first cycle at stage 1 with the element 1. At stage i, i=1 if Xn-i+i= 1 close the current cycle and begin a new one starting with the smallest number not yet in any cycle, and otherwise choose an element uniformly from those yet unused and place it to the right of the last element in the current cycle. In this way at stage i we complete a cycle with probability
is the standardization of a sum Sn, as in (1), of independent and identically distributed random variables each with mean µ and variance σ 2 . From this generalization it now becomes somewhat clearer why various distributions observed in nature, which may not be at all related to the binomial, such as the errors of measurement averages, or the heights of individuals in a sample, take on the bell shaped form: each observation is the result of summing many small independent factors. A further extension of the classical CLT could yet come. In situations where the summand distributions do not have identical distributions, can the normal curve still govern? For an example, consider the symmetric group Sn, the set of all permutations π on the set {1, 2, . . . , n}. We can represent π ∈ S7, for example, by two line notation π = 1 2 3 4 5 6 7 4 3 7 6 5 1 2 from which one can read that π(1) = 4 and π(4) = 6. This permutation can also be represented in the cycle notation π = (1, 4, 6)(2, 3, 7)(5) with the meaning that π maps 1 to 4, 4 to 6, 6 to 1, and so forth. From the cycle representation we see that π has two cycles of length 3 and one of length 1, for a total of three cycles. In general, let Kn(π) denote the total number of cycles in a permutation π ∈ Sn. If π is chosen uniformly from all the n! permutations in Sn, does the Central Limit Theorem imply that Kn(π) is approximately normally distributed for large n? To answer this question we will employ the Feller coupling [3], which constructs a random permutation π uniformly from Sn with the help of n independent Bernoulli variables X1, . . . , Xn with distributions P(Xi = 0) = 1 − 1 i and P(Xi = 1) = 1 i , i = 1, . . . , n. (4) Begin the first cycle at stage 1 with the element 1. At stage i, i = 1, . . . , n, if Xn−i+1 = 1 close the current cycle and begin a new one starting with the smallest number not yet in any cycle, and otherwise choose an element uniformly from those yet unused and place it to the right of the last element in the current cycle. In this way at stage i we complete a cycle with probability 3
1/(n-i+1), upon mapping the last element of the current cycle to the one which begins it As the total number Kn(T)of cycles of T is exactly the number of times an element closes the loop upon completing its own cycle Kn(丌)=X1+…X a sum of independent, but not identically distributed random variables Hence, despite the similarity of(5) to(1), the hypotheses of the classical central limit theorem do not hold. Nevertheless, in 1922 Lindeberg 7 pro- vided a general condition which can be applied in this case to show that Kn(T) is asymptotically normal To explore Lindeberg's condition, first consider the proper standardiza- tion of Kn(T) in our example. As any Bernoulli random variable with success probability p has mean p and variance p(1-p), the Bernoulli variable Xi in (4) has mean i-I and variance i-(1-i-l)for i=1,., n. Thus, 1 hn=∑;and ∑ mean and variance of Kn(T), respectively; the mean hn is known as harmonic number. In particular, standardizing Kn(r) to have mean zero and variance 1 results in Kn(丌)-hn which, absorbing the scaling inside the sum, can be written as Wn=∑ Xin where Xi In general, it is both more convenient and more encompassing to deal not with a sequence of variables but rather with a triangular array as in which satisfies the following condition Condition 1.1 For every n= 1, 2,.. the random variables making up the collection Xn=(Xi n: Isisn are independent with mean zero and finite variances a n= Var(Xi n), standardized so that Xi,n has variance Var(Wn)
1/(n − i + 1), upon mapping the last element of the current cycle to the one which begins it. As the total number Kn(π) of cycles of π is exactly the number of times an element closes the loop upon completing its own cycle, Kn(π) = X1 + · · · Xn, (5) a sum of independent, but not identically distributed random variables. Hence, despite the similarity of (5) to (1), the hypotheses of the classical central limit theorem do not hold. Nevertheless, in 1922 Lindeberg [7] provided a general condition which can be applied in this case to show that Kn(π) is asymptotically normal. To explore Lindeberg’s condition, first consider the proper standardization of Kn(π) in our example. As any Bernoulli random variable with success probability p has mean p and variance p(1 − p), the Bernoulli variable Xi in (4) has mean i −1 and variance i −1 (1 − i −1 ) for i = 1, . . . , n. Thus, hn = Xn i=1 1 i and σ 2 n = Xn i=1 1 i − 1 i 2 (6) are the mean and variance of Kn(π), respectively; the mean hn is known as the n th harmonic number. In particular, standardizing Kn(π) to have mean zero and variance 1 results in Wn = Kn(π) − hn σn , which, absorbing the scaling inside the sum, can be written as Wn = Xn i=1 Xi,n where Xi,n = Xi − i −1 σn . (7) In general, it is both more convenient and more encompassing to deal not with a sequence of variables but rather with a triangular array as in (7) which satisfies the following condition. Condition 1.1 For every n = 1, 2, . . ., the random variables making up the collection Xn = {Xi,n : 1 ≤ i ≤ n} are independent with mean zero and finite variances σ 2 i,n = Var(Xi,n), standardized so that Wn = Xn i=1 Xi,n has variance Var(Wn) = Xn i=1 σ 2 i,n = 1. 4
Of course, even under Condition 1.1, some further assumptions must be sat- isfied by the summand variables for the normal convergence(2) For instance, if the first variable accounts for some non-vanishing fraction of the total variability, it will strongly influence the limiting distribution, possi bly resulting in non-normal convergence. The Lindeberg-Feller central limit theorem, see], says that normal convergence(2)holds upon ruling out such situations by imposing the Lindeberg Condition VE>0 lim Ln, c=0 where In. =>EXi I(Xi, nI>e))(8) where for an event A, the indicator'random variable 1(A)takes on the value 1 if A occurs and the value 0 otherwise. Once known to be sufficient the Lindeberg condition was proved to be partially necessary by Feller and Levy, independently; see 8 for history. The appearance of the Lindeberg Condition is justified by explanations such as the one given by Feller [4], who roughly says that it requires the individual variances be due mainly to masses in an interval whose length is small in comparison to the overall variance. We present a probabilistic condition which is seemingly simpler, yet equivalent Our probabilistic approach to the Clt is through the so called zero bias transformation introduced in [6. For every distribution with mean zero, and finite non-zero variance a2 on a random variable x the zero bias transfor mation returns the unique X-zero biased distribution'on X* which satisfies o2Ef(X*)=ELXJ(X) for all absolutely continuous functions f for which these expectations exist The existence of a strong connection between the zero bias transformation and the normal distribution is made clear by the characterization of Stein 91, which implies that X' and X have the same distribution if and only if X has the w(0, a2)distribution, that is, that the normal distribution is the zero bias transformations unique fixed point the ne way to see the "if direction of Steins characterization, that is,why zero bias transformation fixes the normal, is to note that the density function o2(a)=o-p(o-r) of a N(, o)variable, with p(a) given by (3), satisfies the differential equation with a form"conjugate'to( 9) 5
Of course, even under Condition 1.1, some further assumptions must be satisfied by the summand variables for the normal convergence (2) to take place. For instance, if the first variable accounts for some non-vanishing fraction of the total variability, it will strongly influence the limiting distribution, possibly resulting in non-normal convergence. The Lindeberg-Feller central limit theorem, see [4], says that normal convergence (2) holds upon ruling out such situations by imposing the Lindeberg Condition ∀ > 0 limn→∞ Ln, = 0 where Ln, = Xn i=1 E{X 2 i,n1(|Xi,n| ≥ )} (8) where for an event A, the ‘indicator’ random variable 1(A) takes on the value 1 if A occurs, and the value 0 otherwise. Once known to be sufficient, the Lindeberg condition was proved to be partially necessary by Feller and L´evy, independently; see [8] for history. The appearance of the Lindeberg Condition is justified by explanations such as the one given by Feller [4], who roughly says that it requires the individual variances be due mainly to masses in an interval whose length is small in comparison to the overall variance. We present a probabilistic condition which is seemingly simpler, yet equivalent. Our probabilistic approach to the CLT is through the so called zero bias transformation introduced in [6]. For every distribution with mean zero, and finite non-zero variance σ 2 on a random variable X, the zero bias transformation returns the unique ‘X-zero biased distribution’ on X∗ which satisfies σ 2Ef0 (X ∗ ) = E[Xf(X)] (9) for all absolutely continuous functions f for which these expectations exist. The existence of a strong connection between the zero bias transformation and the normal distribution is made clear by the characterization of Stein [9], which implies that X∗ and X have the same distribution if and only if X has the N (0, σ2 ) distribution, that is, that the normal distribution is the zero bias transformation’s unique fixed point. One way to see the ‘if’ direction of Stein’s characterization, that is, why the zero bias transformation fixes the normal, is to note that the density function ϕσ2 (x) = σ −1ϕ(σ −1x) of a N (0, σ2 ) variable, with ϕ(x) given by (3), satisfies the differential equation with a form ‘conjugate’ to (9), σ 2ϕ 0 σ2 (x) = −xϕσ2 (x), 5