To estimate g(, ), it suffices to estimate two unknown parameters μ and ?. Based onthe random sample [Xt)t-1, we can obtain the maximum likelihood estimators (MLE),T112x.i:t=1T112(x -A)2t=1The approach taken here is called a parametric approach, that is, assuming that theunknown PDF is a known functional form up to some unknown parameters. It can beshown that the parameter estimator converges to the unknown parameter value So ata root-T convergence rate in the sense that VT(@-Qo) = Op(1), or -o =Op(T-1/2),where = (j,o2), 9o = (μo,o%), and Op(1) denotes boundedness in probability. Theroot-T convergence rate is called the parametric convergence rate for and g(r,). Aswewill seebelow,nonparametric density estimators will have a slower convergencerate.Question: What is the definition of Op(Sr)?Let [or,T ≥ 1] be a sequence of positive numbers. A random variable Yr is said to beat most of order or in probability,written Yr =Op(Sr), if the sequence {Yr/8T,T≥1)is tight, that is, iflim lim sup P(Yr/8r/>>)=0.T→Tightness is usually indicated by writing Yr/or =Op(1)Question: What is the advantage of the parametric approach?By the mean-value theorem, we obtainag(r,0) -g(n) = 9(,o) -9(t) + %g(μ,0)(@ - 0)=0+109(,0)VT(0- 0)= 0 + Op(T-1/2)= Op(T-1/2).Intuitively, the first term, g(r, So) - g(r), is the bias of the density estimator g(r,0)which is zero if the assumption of correct model specification holds. The second term,g(r, )(@-o), is due to the sampling error of the estimator , which is unavoidable no11
To estimate g(x; ); it su¢ ces to estimate two unknown parameters and 2 : Based on the random sample fXtg T t=1; we can obtain the maximum likelihood estimators (MLE), ^ = 1 T X T t=1 Xt ; ^ 2 = 1 T X T t=1 (Xt ^) 2 : The approach taken here is called a parametric approach, that is, assuming that the unknown PDF is a known functional form up to some unknown parameters. It can be shown that the parameter estimator ^ converges to the unknown parameter value 0 at a root-T convergence rate in the sense that p T( ^ 0) = OP (1); or ^ 0 = OP (T 1=2 ); where ^ = (^; ^ 2 ) 0 ; 0 = (0 ; 2 0 ) 0 ; and OP (1) denotes boundedness in probability. The root-T convergence rate is called the parametric convergence rate for ^ and g(x; ^). As we will see below, nonparametric density estimators will have a slower convergence rate. Question: What is the deÖnition of OP (T )? Let fT ; T 1g be a sequence of positive numbers. A random variable YT is said to be at most of order T in probability, written YT = OP (T ); if the sequence fYT =T ; T 1g is tight, that is, if lim !1 lim sup T!1 P (jYT =T j > ) = 0: Tightness is usually indicated by writing YT =T = OP (1): Question: What is the advantage of the parametric approach? By the mean-value theorem, we obtain g(x; ^) g(x) = g(x; 0) g(x) + @ @ g(x; )(^ 0) = 0 + 1 p T @ @ g(x; ) p T( ^ 0) = 0 + OP (T 1=2 ) = OP (T 1=2 ): Intuitively, the Örst term, g(x; 0) g(x); is the bias of the density estimator g(x; ^); which is zero if the assumption of correct model speciÖcation holds. The second term, @ @ g(x; )(^0); is due to the sampling error of the estimator ^; which is unavoidable no 11
matter whether the density estimator g(r, ) is correctly specified.This term convergesto zero in probability at the parametric root-T rate.Question: What happens if the correct model specification assumption fails? That is,what happens if g(r, 9) g(r) for all ?When the density model g(r, §) is not correctly specified for the unknown PDF g(r),the estimator g(r,)will not be consistent for g(r) because the bias g(r,*)-g(r) nevervanishes no matter how large the sample size T is, where *=plim .We now introduce a nonparametric estimation method for g(r) which will not as-sume any restrictive functional form for g(r). Instead, it lets data speak for the correctfunctional form for g(r).2.1.1 Kernel Density EstimatorKernel smoothing is a kind of local smoothing.The purpose of nonparametric probabilitydensity estimation is to construct an estimate of a PDF without imposing restrictivefunctional form assumptions. Typically the only condition imposed on the unknownPDF is that it has at least first two order bounded derivatives. In this circumstance, wemay use only local information about the value of the PDF at any given point in thesupport. That is, the value of the PDF of a point must be calculated from data valuesthat lie in a neighborhood of c, and to ensure consistency the neighborhood must shrinkto zero as the sample size T increases.In the case of kernel density estimation,theradiusof the effective neighborhood is roughly equal to the so-called “bandwidth" of a kerneldensity estimator, which is essentially a smoothing parameter. Under the assumptionthat the PDF is univariate with at least first two order bounded derivatives, and using anonnegative kernel function, the size of bandwidth that optimizes the performance of theestimator in term of the mean squared error (MSE) criterion is proportional to the rateT-1/5. The number of "parameters" needed to model the unknown PDF within a giveninterval is approximately equal to the number of bandwidths that can be fitted into thatinterval, and so is roughly of size Ti/5. Thus, nonparametric density estimation involvesthe adaptive fitting of approximately T//5 parameters, with this number growing withthe sample size T.12
matter whether the density estimator g(x; ^) is correctly speciÖed. This term converges to zero in probability at the parametric root-T rate. Question: What happens if the correct model speciÖcation assumption fails? That is, what happens if g(x; ) 6= g(x) for all ? When the density model g(x; ) is not correctly speciÖed for the unknown PDF g(x); the estimator g(x; ^) will not be consistent for g(x) because the bias g(x; )g(x) never vanishes no matter how large the sample size T is, where = p lim ^: We now introduce a nonparametric estimation method for g(x) which will not assume any restrictive functional form for g(x): Instead, it lets data speak for the correct functional form for g(x): 2.1.1 Kernel Density Estimator Kernel smoothing is a kind of local smoothing. The purpose of nonparametric probability density estimation is to construct an estimate of a PDF without imposing restrictive functional form assumptions. Typically the only condition imposed on the unknown PDF is that it has at least Örst two order bounded derivatives. In this circumstance, we may use only local information about the value of the PDF at any given point in the support. That is, the value of the PDF of a point x must be calculated from data values that lie in a neighborhood of x; and to ensure consistency the neighborhood must shrink to zero as the sample size T increases. In the case of kernel density estimation, the radius of the e§ective neighborhood is roughly equal to the so-called ìbandwidthîof a kernel density estimator, which is essentially a smoothing parameter. Under the assumption that the PDF is univariate with at least Örst two order bounded derivatives, and using a nonnegative kernel function, the size of bandwidth that optimizes the performance of the estimator in term of the mean squared error (MSE) criterion is proportional to the rate T 1=5 : The number of ìparametersîneeded to model the unknown PDF within a given interval is approximately equal to the number of bandwidths that can be Ötted into that interval, and so is roughly of size T 1=5 : Thus, nonparametric density estimation involves the adaptive Ötting of approximately T 1=5 parameters, with this number growing with the sample size T: 12
Suppose we are interested in estimating the value of the PDF g(r) at a given point in the support of Xf. There are two basic instruments in kernel estimation: the kernelfunction K() and the bandwidth h. Intuitively, the former gives weighting to the ob-servations in an interval containing the point r, and the latter controls the size of theinterval containing observations.We first introduce an important instrument for local smoothing. This is called akernel function.Definition [Second Order Kernel K(-)]: A second order or positive kernel functionK() is a pre-specified symmetric PDF such that(1) J- K(u)du = 1;(2) J- K(u)udu = 0;(3) J% u2 K(u)du = Ck < 00;(4) J K(u)du = Dk < 00:Intuitively, the kernel function K() is a weighting function that will "discount"theobservations whose values are more away from the point of interest.The kernel functions satisfying the above condition are called a second order orpositive kernel. It should be emphasized that the kernel K() has nothing to do withthe unknown PDF g() of [Xt]; it is just a weighting function for observations whenconstructing a kernel density estimator. More generally, we can define a q-th order kernelK(), where q≥2.Definition [qth Order Kernel]: K() satisfies the conditions that(1) J0 K(u)du = 1;(2) J-wiK(u)du=0for1≤j≤q-1;(3) J- u9K(u)du< 00;(4) J K?(u)du < 00.For a higher order kernel (i.e., g > 2), K() will take some negative values at somepoints.Question: Why is a higher order kernel useful? Can you give an example of a thirdOrder kernel? And an example of a fourth order kernel?Higher order kernels can reduce the bias of a kernel estimator to a higher order. Anexample of higher order kernels is given in Robinson (1991)13
Suppose we are interested in estimating the value of the PDF g(x) at a given point x in the support of Xt : There are two basic instruments in kernel estimation: the kernel function K() and the bandwidth h. Intuitively, the former gives weighting to the observations in an interval containing the point x, and the latter controls the size of the interval containing observations. We Örst introduce an important instrument for local smoothing. This is called a kernel function. DeÖnition [Second Order Kernel K()]: A second order or positive kernel function K() is a pre-speciÖed symmetric PDF such that (1) R 1 1 K(u)du = 1; (2) R 1 1 K(u)udu = 0; (3) R 1 1 u 2K(u)du = CK < 1; (4) R 1 1 K2 (u)du = DK < 1: Intuitively, the kernel function K() is a weighting function that will ìdiscountîthe observations whose values are more away from the point x of interest. The kernel functions satisfying the above condition are called a second order or positive kernel. It should be emphasized that the kernel K() has nothing to do with the unknown PDF g(x) of fXtg; it is just a weighting function for observations when constructing a kernel density estimator. More generally, we can deÖne a q-th order kernel K(); where q 2: DeÖnition [qth Order Kernel]: K() satisÖes the conditions that (1) R 1 1 K(u)du = 1; (2) R 1 1 u jK(u)du = 0 for 1 j q 1; (3) R 1 1 u qK(u)du < 1; (4) R 1 1 K2 (u)du < 1: For a higher order kernel (i.e., q > 2); K() will take some negative values at some points. Question: Why is a higher order kernel useful? Can you give an example of a third order kernel? And an example of a fourth order kernel? Higher order kernels can reduce the bias of a kernel estimator to a higher order. An example of higher order kernels is given in Robinson (1991). 13
We now consider some examples of second order kernels:. Uniform kernelK(u) =,1(Iu|≤ 1);2.Gaussiankernel1K(u) =2-88.exp(V2元2·EpanechnikovKernelK(u) =1-2)1(ll≤1);. Quartic kernel15(1 - 2)1( ≤1)K(u) =16Among these kernels,the Gaussian kernel has unbounded support,while all otherkernels have bounded supports of [-1,1]. Also, the uniform kernel assigns an equalweighting within its support; in contrast, all other kernels have a downward weightingscheme.Question: How does the kernel method work?Let r be a fixed point in the support of Xt. Given a pre-chosen second kernel K(u)we define a kernel density estimator for g(r) based on the random sample [Xt)T-1 :Tg(r) = T-1 Kr(r-Xt)t=11含K(h) dF(g)KwhereK(u) =k(),h = h(T) > 0 is called a bandwidth or a window size, and F(y) = T-1 T,1(Xt ≤y)is the marginal empirical distribution function of the random sample [Xt]t-1- This isexactly the same as the estimator introduced in Chapter 3, and it was first proposed byRosenblatt (1956) and Parzen (1962) and so is also called the Rosenblatt-Parzen kerneldensity estimator.14
We now consider some examples of second order kernels: Uniform kernel K(u) = 1 2 1(juj 1); Gaussian kernel K(u) = 1 p 2 exp( 1 2 u 2 ); 1 < u < 1: Epanechnikov Kernel K(u) = 3 4 (1 u 2 )1(juj 1); Quartic kernel K(u) = 15 16 (1 u 2 ) 21(juj 1): Among these kernels, the Gaussian kernel has unbounded support, while all other kernels have bounded supports of [1; 1]. Also, the uniform kernel assigns an equal weighting within its support; in contrast, all other kernels have a downward weighting scheme. Question: How does the kernel method work? Let x be a Öxed point in the support of Xt . Given a pre-chosen second kernel K(u); we deÖne a kernel density estimator for g(x) based on the random sample fXtg T t=1 : g^(x) = T 1X T t=1 Kh(x Xt) = 1 T X T t=1 1 h K x Xt h = 1 h Z 1 1 K x y h dF^(y); where Kh(u) = 1 h K u h ; h = h(T) > 0 is called a bandwidth or a window size, and F^(y) = T 1 PT t=11(Xt y) is the marginal empirical distribution function of the random sample fXtg T t=1. This is exactly the same as the estimator introduced in Chapter 3, and it was Örst proposed by Rosenblatt (1956) and Parzen (1962) and so is also called the Rosenblatt-Parzen kernel density estimator. 14
We see immediately that the well-known histogram is a special case of the kerneldensity estimator g(r) with the choice of a uniform kernel.Example 1 [Histogram]: If K(u) =1(u|≤1), thenT1Z1(lr-X/≤h).g(r) =2hTt=1Intuitively, with the choice of a uniform kernel, the kernel density estimator g(r) isthe relative sample frequency of the observations on the interval [r -h, r + h] whichcenters at point r and has a size of 2h. Here, 2hT is approximately the sample sizeof the small interval [r - h, r + h], when the size 2h is small enough. Alternatively,T-1 t-, 1(/r - X:/ ≤ h) is the relative sample frequency for the observations fallinginto the small interval [r-h, r +h], which, by the law of large numbers, is approximatelyequal to theprobabilityE[1(l-Xt≤h)] = P(r-h≤Xt≤r+h)g(y)dy~2hg(r)if h is small enough and g(r) is continuous around the point r. Thus, the histogram isa reasonable estimator for g(r), and indeed it is a consistent estimator g(r) if h vanishesto zero but at a slower rate than sample size T goes to infinity.Question: Under what conditions will the density estimator g() be consistent forthe known density function g(r)?We impose an assumption on the data generating process and the unknown PDFg(r).Assumption 3.1 [Smoothness of PDF]: (i) [Xt] is a strictly stationary processwith marginal PDF g(r); (i) g(r) has a bounded support on [a, b], and is continuouslytwice differentiable on [a,b], with g"() being Lipschitz-continuous in the sense thatIg"(ri)-g"(r2)/ ≤ C|r1 -r2| for all r1, 2 [a, b], where a,b and C are finite constants.Question: How to define the derivatives at the boundary points?15
We see immediately that the well-known histogram is a special case of the kernel density estimator g^(x) with the choice of a uniform kernel. Example 1 [Histogram]: If K(u) = 1 2 1(juj 1); then g^(x) = 1 2hT X T t=1 1(jx Xt j h): Intuitively, with the choice of a uniform kernel, the kernel density estimator g^(x) is the relative sample frequency of the observations on the interval [x h; x + h] which centers at point x and has a size of 2h: Here, 2hT is approximately the sample size of the small interval [x h; x + h]; when the size 2h is small enough. Alternatively, T 1 PT t=1 1(jx Xt j h) is the relative sample frequency for the observations falling into the small interval [xh; x+h]; which, by the law of large numbers, is approximately equal to the probability E [1(jx Xt j h)] = P (x h Xt x + h) = Z x+h xh g(y)dy 2hg(x) if h is small enough and g(x) is continuous around the point x: Thus, the histogram is a reasonable estimator for g(x); and indeed it is a consistent estimator g(x) if h vanishes to zero but at a slower rate than sample size T goes to inÖnity. Question: Under what conditions will the density estimator g^(x) be consistent for the known density function g(x)? We impose an assumption on the data generating process and the unknown PDF g(x): Assumption 3.1 [Smoothness of PDF]: (i) fXtg is a strictly stationary process with marginal PDF g(x); (ii) g(x) has a bounded support on [a; b]; and is continuously twice di§erentiable on [a; b]; with g 00() being Lipschitz-continuous in the sense that jg 00(x1)g 00(x2)j Cjx1 x2j for all x1; x2 2 [a; b]; where a; b and C are Önite constants. Question: How to deÖne the derivatives at the boundary points? 15