range 6.Then he defined (p.156) nP'=∑lnf(e;0)d (21) i=1 and interpreted P to be "proportional to the chance of a given set of observations occurring." Since the factors u are independent of f(y;0),he stated that the "probability of any particular set ofθ's is proportional to P,”where In P=>Inf(vi;0) (22) and the most probable set of values for the 0's will make P a maximum"(p.157).This is in essence Fisher's idea regarding maximum likelihood estimation.9 After outlining his method for fitting curves,Fisher applied his criterion to estimate parameters of a normal density of the following form f4月= V后ep-h(g-w3, (23) where h=1/ov2 in the standard notation of N(u,o2).He obtained the "most probable values" aslo i=I n (24) n i=1 and 记=210-驴] m (25) Fisher's probable value of h did not match the conventional value that used (n-1)rather than n as in(25)[see Bennett (1907-1908)].By integrating out u from(23),Fisher obtained We should note that nowhere in Fisher(1912)he uses the word "likelihood."It came much later in Fisher (1921,p.24),and the phrase "method of maximum likelihood"was first used in Fisher (1922,p.323)also see Edwards(1997a,p.36)].Fisher(1912)did not refer to the Edgeworth(1908,1909)inverse probability method which gives the same estimates,or for that matter to most of the early literature(the paper contained only two references).As Aldrich(1997,p.162)indicated "nobody"noticed Edgeworth work until "Fisher had redone it." Le Cam(1990,p.153)settled the debate on who first proposed the maximum likelihood method in the following way:"Opinions on who was the first to propose the method differ.However Fisher is usually credited with the invention of the name 'maximum likelihood',with a major effort intended to spread its use and with the derivation of the optimality properties of the resulting estimates."We can safely say that although the method of maximum likelihood pre-figured in earlier works,it was first presented in its own right,and with a full view of its significance by Fisher(1912)and later by Fisher(1922). 10In fact Fisher did not use notations fand h.Like Karl Pearson he did not distinguish between the parameter and its estimator.That came much later in Fisher(1922,p.313)when he introduced the concept of "statistic." 9
range δy. Then he defined (p.156) ln P 0 = Xn i=1 ln f(yi ; θ)δyi (21) and interpreted P 0 to be “proportional to the chance of a given set of observations occurring.” Since the factors δyi are independent of f(y; θ), he stated that the “probability of any particular set of θ’s is proportional to P,” where ln P = Xn i=1 ln f(yi ; θ) (22) and the most probable set of values for the θ’s will make P a maximum” (p.157). This is in essence Fisher’s idea regarding maximum likelihood estimation.9 After outlining his method for fitting curves, Fisher applied his criterion to estimate parameters of a normal density of the following form f(y; µ, h) = h √ π exp[−h 2 (y − µ) 2 ], (23) where h = 1/σ√ 2 in the standard notation of N(µ, σ2 ). He obtained the “most probable values” as10 µˆ = 1 n Xn i=1 yi (24) and hˆ2 = n 2 Pn i=1(yi − y¯) 2 . (25) Fisher’s probable value of h did not match the conventional value that used (n − 1) rather than n as in (25) [see Bennett (1907-1908)]. By integrating out µ from (23), Fisher obtained 9We should note that nowhere in Fisher (1912) he uses the word “likelihood.” It came much later in Fisher (1921, p.24), and the phrase “method of maximum likelihood” was first used in Fisher (1922, p.323) [also see Edwards (1997a, p.36)]. Fisher (1912) did not refer to the Edgeworth (1908, 1909) inverse probability method which gives the same estimates, or for that matter to most of the early literature (the paper contained only two references). As Aldrich (1997, p.162) indicated “nobody” noticed Edgeworth work until “Fisher had redone it.” Le Cam (1990, p.153) settled the debate on who first proposed the maximum likelihood method in the following way: “Opinions on who was the first to propose the method differ. However Fisher is usually credited with the invention of the name ‘maximum likelihood’, with a major effort intended to spread its use and with the derivation of the optimality properties of the resulting estimates.” We can safely say that although the method of maximum likelihood pre-figured in earlier works, it was first presented in its own right, and with a full view of its significance by Fisher (1912) and later by Fisher (1922). 10In fact Fisher did not use notations ˆµ and hˆ. Like Karl Pearson he did not distinguish between the parameter and its estimator. That came much later in Fisher (1922, p.313) when he introduced the concept of “statistic.” 9
the "variation of h,"and then maximizing the resulting marginal density with respect to h,he found the conventional estimator 2 h= n-1 2∑1(h- (26) Fisher (p.160)interpreted P=Πf) (27) =1 as the "relative probability of the set of values"0,02,...,p.Implicitly,he was basing his arguments on inverse probability (posterior distribution)with noninformative prior.But at the same time he criticized the process of obtaining(26)saying "integration"with respect to u is "illegitimate and has no meaning with respect to inverse probability."Here Fisher's message is very confusing and hard to decipher.11 In spite of these misgivings Fisher(1912)is a remarkable paper given that it was written when Fisher was still an undergraduate.In fact,his idea of the likelihood function (27)played a central role in introducing and crystallizing some of the fundamental concepts in statistics. The history of the minimum chi-squared (x2)method of estimation is even more blurred. Karl Pearson and his associates routinely used the MM to estimate parameters and the x2 statistic (10)to test the adequacy of the fitted model.This state of affairs prompted Hald (1998,p.712)to comment:"One may wonder why he Karl Pearson]did not take further step to minimizing x2 for estimating the parameters."In fact,for a while,nobody took a concrete step in that direction.As discussed in Edwards (1997a)several early papers that advocated this method of estimation could be mentioned:Harris (1912),Engledow and Yule(1914),Smith (1916)and Haldane (1919a,b).Ironically,it was the Fisher (1928)book and its subsequent editions that brought to prominence this estimation procedure.Smith (1916)was probably the first to state explicitly how to obtain parameter estimates using the minimum x2 method.She started with a mild criticism of Pearson's MM(p.11):"It is an undoubtedly utile and accurate method;but the question of whether it gives the best'values of the constant has not been very fully studied."12 Then,without much fanfare she stated(p.12):"From another standpoint, 1In Fisher(1922,p.326)he went further and confessed:"I must indeed plead guilty in my original statement in the Method of Maximum Likelihood Fisher(1912)]to having based my argument upon the principle of inverse probability;in the same paper,it is true,I emphasized the fact that such inverse probabilities were relative only." Aldrich(1997),Edwards(1997b)and Hald(1999)examined Fisher's paradoxical views to "inverse probability" in detail. 12Kirstine Smith was a graduate student in Karl Pearson's laboratory since 1915.In fact her paper ends with the following acknowledgement:"The present paper was worked out in the Biometric Laboratory and I have to thank Professor Pearson for his aid throughout the work."It is quite understandable that she could not be too critical of Pearson's MM. 10
the “variation of h,” and then maximizing the resulting marginal density with respect to h, he found the conventional estimator ˆ hˆ 2 = n − 1 2 Pn i=1(yi − y¯) 2 . (26) Fisher (p.160) interpreted P = Yn i=1 f(yi ; θ) (27) as the “relative probability of the set of values” θ1, θ2, . . . , θp. Implicitly, he was basing his arguments on inverse probability (posterior distribution) with noninformative prior. But at the same time he criticized the process of obtaining (26) saying “integration” with respect to µ is “illegitimate and has no meaning with respect to inverse probability.” Here Fisher’s message is very confusing and hard to decipher.11 In spite of these misgivings Fisher (1912) is a remarkable paper given that it was written when Fisher was still an undergraduate. In fact, his idea of the likelihood function (27) played a central role in introducing and crystallizing some of the fundamental concepts in statistics. The history of the minimum chi-squared (χ 2 ) method of estimation is even more blurred. Karl Pearson and his associates routinely used the MM to estimate parameters and the χ 2 statistic (10) to test the adequacy of the fitted model. This state of affairs prompted Hald (1998, p.712) to comment: “One may wonder why he [Karl Pearson] did not take further step to minimizing χ 2 for estimating the parameters.” In fact, for a while, nobody took a concrete step in that direction. As discussed in Edwards (1997a) several early papers that advocated this method of estimation could be mentioned: Harris (1912), Engledow and Yule (1914), Smith (1916) and Haldane (1919a, b). Ironically, it was the Fisher (1928) book and its subsequent editions that brought to prominence this estimation procedure. Smith (1916) was probably the first to state explicitly how to obtain parameter estimates using the minimum χ 2 method. She started with a mild criticism of Pearson’s MM (p.11): “It is an undoubtedly utile and accurate method; but the question of whether it gives the ‘best’ values of the constant has not been very fully studied.”12 Then, without much fanfare she stated (p.12): “From another standpoint, 11In Fisher (1922, p.326) he went further and confessed: “I must indeed plead guilty in my original statement in the Method of Maximum Likelihood [Fisher (1912)] to having based my argument upon the principle of inverse probability; in the same paper, it is true, I emphasized the fact that such inverse probabilities were relative only.” Aldrich (1997), Edwards (1997b) and Hald (1999) examined Fisher’s paradoxical views to “inverse probability” in detail. 12Kirstine Smith was a graduate student in Karl Pearson’s laboratory since 1915. In fact her paper ends with the following acknowledgement: “The present paper was worked out in the Biometric Laboratory and I have to thank Professor Pearson for his aid throughout the work.” It is quite understandable that she could not be too critical of Pearson’s MM. 10
however,the 'best values'of the frequency constants may be said to be those for which"the quantity in (10)"is a minimum."She argued that when x2 is a minimum,"the probability of occurrence of a result as divergent as or more divergent than the observed,will be maximum." In other words,using the minimum x2 method the "goodness-of-fit"might be better than that obtained from the MM.Using a slightly different notation let us express(10)as Xo0-∑2 (28) ≈1 Ng(8) where Ngi()is the expected frequency of the j-th class with =(01,02,...,0p)'as the unknown parameter vector.We can write k m呢 -N. (29) 1=1 Therefore,the minimum x2 estimates will be obtained by solving ox2(0)/00=0,i.e.,from n呢 之 0g0=0, 1=1,2,,p (30) ¥a8 This is Smith's(1916,p.264)system of equations (1).Since "these equations will generally be far too involved to be directly solved"she approximated these around MM estimates.Without going in that direction let us connect these equations to those from Fisher's(1912)ML equations. Since ()=1,we have ()/=0,and hence from (30),the minimum x2 estimating equations are =awP@=0,t12…p [Nq(0)]2a0 (31) Under the multinomial framework,Fisher's likelihood function(27),denoted as L()is L(0)=IΠIn)-Π4(. (32) j=1 j=1 Therefore,the log-likelihood function(22),denoted by e(),can be written as k In L(0)=e(0)=constant ∑%ng5(0) (33) j=1 The corresponding ML estimating equations are oe()/00=0,i.e., 0g5(0) 白台4(9)00, 0 (34) 11
however, the ‘best values’ of the frequency constants may be said to be those for which” the quantity in (10) “is a minimum.” She argued that when χ 2 is a minimum, “the probability of occurrence of a result as divergent as or more divergent than the observed, will be maximum.” In other words, using the minimum χ 2 method the “goodness-of-fit” might be better than that obtained from the MM. Using a slightly different notation let us express (10) as χ 2 (θ) = X k j=1 [nj − Nqj (θ)]2 Nqj (θ) , (28) where Nqj (θ) is the expected frequency of the j-th class with θ = (θ1, θ2, . . . , θp) 0 as the unknown parameter vector. We can write χ 2 (θ) = X k j=1 n 2 j Nqj (θ) − N. (29) Therefore, the minimum χ 2 estimates will be obtained by solving ∂χ2 (θ)/∂θ = 0, i.e., from X k j=1 n 2 j [Nqj (θ)]2 ∂qj (θ) ∂θl = 0, l = 1, 2, . . . , p. (30) This is Smith’s (1916, p.264) system of equations (1). Since “these equations will generally be far too involved to be directly solved” she approximated these around MM estimates. Without going in that direction let us connect these equations to those from Fisher’s (1912) ML equations. Since Pk j=1 qj (θ) = 1, we have Pk j=1 ∂qj (θ)/∂θl = 0, and hence from (30), the minimum χ 2 estimating equations are X k j=1 n 2 j − [Nqj (θ)]2 [Nqj (θ)]2 ∂qj (θ) ∂θl = 0, l = 1, 2, . . . , p. (31) Under the multinomial framework, Fisher’s likelihood function (27), denoted as L(θ) is L(θ) = N! Y k j=1 [(nj ) −1 ] Y k j=1 [qj (θ)]nj . (32) Therefore, the log-likelihood function (22), denoted by `(θ), can be written as ln L(θ) = `(θ) = constant +X k j=1 nj ln qj (θ). (33) The corresponding ML estimating equations are ∂`(θ)/∂θ = 0, i.e., X k j=1 nj qj (θ) ∂qj (θ) ∂θl = 0, (34) 11
i.e., k 2,-Ng01.040=0, l=1,2,,p (35) =1 Ng(8) 80 Fisher(1924a)argued that the difference between (31)and (35)is of the factor [nj+Ngi()]/Nqi(), which tends to value 2 for large values of N,and therefore,these two methods are asymptotically equivalent.13 Some of Smith's(1916)numerical illustration showed improvement over MM in terms of goodness-of-fit values (in her notation P)when minimum x2 method was used.How- ever,in her conclusion to the paper Smith (1916)provided a very lukewarm support for the minimum x2 method.14 It is,therefore,not surprising that this method remained dormant for a while even after Neyman and Pearson (1928,pp.265-267)provided further theoretical justi- fication.Neyman(1949)provided a comprehensive treatment of x2 method of estimation and testing.Berkson(1980)revived the old debate,questioned the sovereignty of MLE and argued that minimum x2 is the primary principle of estimation.However,the MLE procedure still remains as one of the most important principles of estimation and Fisher's idea of the likelihood plays the fundamental role in it.It can be said that based on his 1912 paper,Ronald Fisher was able to contemplate much broader problems later in his research that eventually culminated in his monumental paper in 1922.Because of the enormous importance of Fisher (1922)in the history of estimation,in the next section we provide a critical and detail analysis of this paper. 13Note that to compare estimates from two different methods Fisher(1924a)used the "estimating equations" rather than the estimates.Using the estimating equations(31)Fisher(1924a)also showed that x2(0)has k-p-1 degrees of freedom instead of k-1 when the p x 1 parameter vector 6 is replaced by its estimator 0.In Section 4 we will discuss the important role estimating equations play. 14Part of her concluding remarks was "...the present numerical illustrations appear to indicate that but little practical advantage is gained by a great deal of additional labour,the values of P are only slightly raised-probably always within their range of probable error.In other words the investigation justifies the method of moments as giving excellent values of the constants with nearly the maximum value of P or it justifies the use of the method of moments,if the definition of 'best'by which that method is reached must at least be considered somewhat arbitrary."Given that the time when MM was at its highest of popularity and Smith's position under Pearson's laboratory,it was difficult for her to make a strong recommendation for minimum x method see also footnote 12: 12
i.e., X k j=1 [nj − Nqj (θ)] Nqj (θ) · ∂qj (θ) ∂θl = 0, l = 1, 2, . . . , p. (35) Fisher (1924a) argued that the difference between (31) and (35) is of the factor [nj+Nqj (θ)]/Nqj (θ), which tends to value 2 for large values of N, and therefore, these two methods are asymptotically equivalent.13 Some of Smith’s (1916) numerical illustration showed improvement over MM in terms of goodness-of-fit values (in her notation P) when minimum χ 2 method was used. However, in her conclusion to the paper Smith (1916) provided a very lukewarm support for the minimum χ 2 method.14 It is, therefore, not surprising that this method remained dormant for a while even after Neyman and Pearson (1928, pp.265-267) provided further theoretical justi- fication. Neyman (1949) provided a comprehensive treatment of χ 2 method of estimation and testing. Berkson (1980) revived the old debate, questioned the sovereignty of MLE and argued that minimum χ 2 is the primary principle of estimation. However, the MLE procedure still remains as one of the most important principles of estimation and Fisher’s idea of the likelihood plays the fundamental role in it. It can be said that based on his 1912 paper, Ronald Fisher was able to contemplate much broader problems later in his research that eventually culminated in his monumental paper in 1922. Because of the enormous importance of Fisher (1922) in the history of estimation, in the next section we provide a critical and detail analysis of this paper. 13Note that to compare estimates from two different methods Fisher (1924a) used the “estimating equations” rather than the estimates. Using the estimating equations (31) Fisher (1924a) also showed that χ 2 ( ˆθ) has k−p−1 degrees of freedom instead of k − 1 when the p × 1 parameter vector θ is replaced by its estimator ˆθ. In Section 4 we will discuss the important role estimating equations play. 14Part of her concluding remarks was “. . . the present numerical illustrations appear to indicate that but little practical advantage is gained by a great deal of additional labour, the values of P are only slightly raised–probably always within their range of probable error. In other words the investigation justifies the method of moments as giving excellent values of the constants with nearly the maximum value of P or it justifies the use of the method of moments, if the definition of ‘best’ by which that method is reached must at least be considered somewhat arbitrary.” Given that the time when MM was at its highest of popularity and Smith’s position under Pearson’s laboratory, it was difficult for her to make a strong recommendation for minimum χ 2 method [see also footnote 12]. 12
3 Fisher's (1922)mathematical foundations of theoreti- cal statistics and further analysis on MM and ML es- timation If we had to name the single most important paper on the theoretical foundation of statisti- cal estimation theory,we could safely mention Fisher (1922).15 The ideas of this paper are simply revolutionary.It introduced many of the fundamental concepts in estimation,such as, consistency,efficiency,sufficiency,information,likelihood and even the term "parameter"with its present meaning [see Stephen Stigler's comment on Savage (1976)].16 Hald (1998,p.713) succinctly summarized the paper by saying:"For the first time in the history of statistics a framework for a frequency-based general theory of parametric statistical inference was clearly formulated.” In this paper(p.313)Fisher divided the statistical problems into three clear types: "(1)Problems of Specification.These arise in the choice of the mathematical form of the population. (2)Problems of Estimation.These involve the choice of methods of calculating from a sample statistical derivates,or as we shall call them statistics,which are designed to estimate the values of the parameters of the hypothetical population. (3)Problems of Distribution.These include discussions of the distribution of statis- tics derived from samples,or in general any functions of quantities whose distribution is known." Formulation of the general statistical problems into these three broad categories was not really entirely new.Pearson (1902,p.266)mentioned the problems of(a)"choice of a suitable curve'”,(b)“determination of the constants”of the curve,.“when the form of the curve has 15The intervening period between Fisher's 1912 and 1922 papers represented years in wilderness for Ronald Fisher.He contemplated and tried many different things,including joining the army and farming,but failed. However,by 1922,Fisher has attained the position of Chief Statistician at the Rothamsted Experimental Station. For more on this see Box(1978,ch.2)and Bera and Bilias(2000). 16Stigler (1976,p.498)commented that,"The point is that it is to Fisher that we owe the introduction of parametric statistical inference(and thus nonparametric inference).While there are other interpretations under which this statement can be defended,I mean it literally-Fisher was principally responsible for the introduction of the word "parameter"into present statistical terminology!"Stigler (1976,p.499)concluded his comment by saying "..for a measure of Fisher's influence on our field we need look no further than the latest issue of any statistical journal,and notice the ubiquitous "parameter."Fisher's concepts so permeate modern statistics,that we tend to overlook one of the most fundamental!" 13
3 Fisher’s (1922) mathematical foundations of theoretical statistics and further analysis on MM and ML estimation If we had to name the single most important paper on the theoretical foundation of statistical estimation theory, we could safely mention Fisher (1922).15 The ideas of this paper are simply revolutionary. It introduced many of the fundamental concepts in estimation, such as, consistency, efficiency, sufficiency, information, likelihood and even the term “parameter” with its present meaning [see Stephen Stigler’s comment on Savage (1976)].16 Hald (1998, p.713) succinctly summarized the paper by saying: “For the first time in the history of statistics a framework for a frequency-based general theory of parametric statistical inference was clearly formulated.” In this paper (p.313) Fisher divided the statistical problems into three clear types: “(1) Problems of Specification. These arise in the choice of the mathematical form of the population. (2) Problems of Estimation. These involve the choice of methods of calculating from a sample statistical derivates, or as we shall call them statistics, which are designed to estimate the values of the parameters of the hypothetical population. (3) Problems of Distribution. These include discussions of the distribution of statistics derived from samples, or in general any functions of quantities whose distribution is known.” Formulation of the general statistical problems into these three broad categories was not really entirely new. Pearson (1902, p.266) mentioned the problems of (a) “choice of a suitable curve”, (b) “determination of the constants” of the curve, “when the form of the curve has 15The intervening period between Fisher’s 1912 and 1922 papers represented years in wilderness for Ronald Fisher. He contemplated and tried many different things, including joining the army and farming, but failed. However, by 1922, Fisher has attained the position of Chief Statistician at the Rothamsted Experimental Station. For more on this see Box (1978, ch.2) and Bera and Bilias (2000). 16Stigler (1976, p.498) commented that, “The point is that it is to Fisher that we owe the introduction of parametric statistical inference (and thus nonparametric inference). While there are other interpretations under which this statement can be defended, I mean it literally–Fisher was principally responsible for the introduction of the word “parameter” into present statistical terminology!” Stigler (1976, p.499) concluded his comment by saying “. . . for a measure of Fisher’s influence on our field we need look no further than the latest issue of any statistical journal, and notice the ubiquitous “parameter.” Fisher’s concepts so permeate modern statistics, that we tend to overlook one of the most fundamental!” 13