To estimate g(,)it suffices to estimate two unknown parameters u and a2.Based on the random sample {X,we can obtain the maximum likelihood estimators(MLE), T ò2 t-1 The approach taken here is called a parametric approach,that is,assuming that the unknown PDF is a known functional form up to some unknown parameters.It can be shown that the parameter estimator 0 converges to the unknown parameter value 0o at a root-T convergence rate in the sense that vT(-00)=Op(1),or0-00=Op(T-1/2), where=(i,2),0o=(uo,)and Op(1)denotes boundedness in probability.The root-T convergence rate is called the parametric convergence rate for 0 and g(,0).As we will see below,nonparametric density estimators will have a slower convergence rate. Question:What is the definition of Op(r)? Let fr,T>1}be a sequence of positive numbers.A random variable Yr is said to be at most of order or in probability,written Yr=Op(r),if the sequence (Yr/or,T>1} is tight,that is,if lim lim sup P(Yr/6T>)=0. 入→0 Tightness is usually indicated by writing Yr/or=Op(1). Question:What is the advantage of the parametric approach? By the mean-value theorem,we obtain g(x,)-g(x)=g(c,o)-g(x)+ 09z,)0-o) 0+10 V厅09c,vT(0-) =0+0p(T-1/2) =Op(T-1/2). Intuitively,the first term,g(,0o)-g(),is the bias of the density estimator g(,0), which is zero if the assumption of correct model specification holds.The second term, )0),is due to the sampling error of the estimator,which is unavoidable no 11
To estimate g(x; ); it su¢ ces to estimate two unknown parameters and 2 : Based on the random sample fXtg T t=1; we can obtain the maximum likelihood estimators (MLE), ^ = 1 T X T t=1 Xt ; ^ 2 = 1 T X T t=1 (Xt ^) 2 : The approach taken here is called a parametric approach, that is, assuming that the unknown PDF is a known functional form up to some unknown parameters. It can be shown that the parameter estimator ^ converges to the unknown parameter value 0 at a root-T convergence rate in the sense that p T( ^ 0) = OP (1); or ^ 0 = OP (T 1=2 ); where ^ = (^; ^ 2 ) 0 ; 0 = (0 ; 2 0 ) 0 ; and OP (1) denotes boundedness in probability. The root-T convergence rate is called the parametric convergence rate for ^ and g(x; ^). As we will see below, nonparametric density estimators will have a slower convergence rate. Question: What is the deÖnition of OP (T )? Let fT ; T 1g be a sequence of positive numbers. A random variable YT is said to be at most of order T in probability, written YT = OP (T ); if the sequence fYT =T ; T 1g is tight, that is, if lim !1 lim sup T!1 P (jYT =T j > ) = 0: Tightness is usually indicated by writing YT =T = OP (1): Question: What is the advantage of the parametric approach? By the mean-value theorem, we obtain g(x; ^) g(x) = g(x; 0) g(x) + @ @ g(x; )(^ 0) = 0 + 1 p T @ @ g(x; ) p T( ^ 0) = 0 + OP (T 1=2 ) = OP (T 1=2 ): Intuitively, the Örst term, g(x; 0) g(x); is the bias of the density estimator g(x; ^); which is zero if the assumption of correct model speciÖcation holds. The second term, @ @ g(x; )(^0); is due to the sampling error of the estimator ^; which is unavoidable no 11
matter whether the density estimator g(r,0)is correctly specified.This term converges to zero in probability at the parametric root-T rate. Question:What happens if the correct model specification assumption fails?That is, what happens if g(,0)g(x)for all 0? When the density model g(r,0)is not correctly specified for the unknown PDF g(r), the estimator g(,0)will not be consistent for g()because the bias g(,)-g(z)never vanishes no matter how large the sample size T is,where *=plim. We now introduce a nonparametric estimation method for g(x)which will not as- sume any restrictive functional form for g(x).Instead,it lets data speak for the correct functional form for g(r). 2.1.1 Kernel Density Estimator Kernel smoothing is a kind of local smoothing.The purpose of nonparametric probability density estimation is to construct an estimate of a PDF without imposing restrictive functional form assumptions.Typically the only condition imposed on the unknown PDF is that it has at least first two order bounded derivatives.In this circumstance,we may use only local information about the value of the PDF at any given point in the support.That is,the value of the PDF of a point x must be calculated from data values that lie in a neighborhood of z,and to ensure consistency the neighborhood must shrink to zero as the sample size T increases.In the case of kernel density estimation,the radius of the effective neighborhood is roughly equal to the so-called "bandwidth"of a kernel density estimator,which is essentially a smoothing parameter.Under the assumption that the PDF is univariate with at least first two order bounded derivatives,and using a nonnegative kernel function,the size of bandwidth that optimizes the performance of the estimator in term of the mean squared error(MSE)criterion is proportional to the rate T-1/5.The number of"parameters"needed to model the unknown PDF within a given interval is approximately equal to the number of bandwidths that can be fitted into that interval,and so is roughly of size T/5.Thus,nonparametric density estimation involves the adaptive fitting of approximately T/5 parameters,with this number growing with the sample size T. 12
matter whether the density estimator g(x; ^) is correctly speciÖed. This term converges to zero in probability at the parametric root-T rate. Question: What happens if the correct model speciÖcation assumption fails? That is, what happens if g(x; ) 6= g(x) for all ? When the density model g(x; ) is not correctly speciÖed for the unknown PDF g(x); the estimator g(x; ^) will not be consistent for g(x) because the bias g(x; )g(x) never vanishes no matter how large the sample size T is, where = p lim ^: We now introduce a nonparametric estimation method for g(x) which will not assume any restrictive functional form for g(x): Instead, it lets data speak for the correct functional form for g(x): 2.1.1 Kernel Density Estimator Kernel smoothing is a kind of local smoothing. The purpose of nonparametric probability density estimation is to construct an estimate of a PDF without imposing restrictive functional form assumptions. Typically the only condition imposed on the unknown PDF is that it has at least Örst two order bounded derivatives. In this circumstance, we may use only local information about the value of the PDF at any given point in the support. That is, the value of the PDF of a point x must be calculated from data values that lie in a neighborhood of x; and to ensure consistency the neighborhood must shrink to zero as the sample size T increases. In the case of kernel density estimation, the radius of the e§ective neighborhood is roughly equal to the so-called ìbandwidthîof a kernel density estimator, which is essentially a smoothing parameter. Under the assumption that the PDF is univariate with at least Örst two order bounded derivatives, and using a nonnegative kernel function, the size of bandwidth that optimizes the performance of the estimator in term of the mean squared error (MSE) criterion is proportional to the rate T 1=5 : The number of ìparametersîneeded to model the unknown PDF within a given interval is approximately equal to the number of bandwidths that can be Ötted into that interval, and so is roughly of size T 1=5 : Thus, nonparametric density estimation involves the adaptive Ötting of approximately T 1=5 parameters, with this number growing with the sample size T: 12
Suppose we are interested in estimating the value of the PDF g(r)at a given point x in the support of Xt.There are two basic instruments in kernel estimation:the kernel function K()and the bandwidth h.Intuitively,the former gives weighting to the ob- servations in an interval containing the point and the latter controls the size of the interval containing observations. We first introduce an important instrument for local smoothing.This is called a kernel function. Definition [Second Order Kernel K()]:A second order or positive kernel function K()is a pre-specified symmetric PDF such that (1)∫eK(u)du=1: (2)K(u)udu=O; (3K(u)du=CK<; (4∫K2(u)dhu=DK<o. Intuitively,the kernel function K()is a weighting function that will "discount"the observations whose values are more away from the point r of interest. The kernel functions satisfying the above condition are called a second order or positive kernel.It should be emphasized that the kernel K()has nothing to do with the unknown PDF g(z)of [Xt;it is just a weighting function for observations when constructing a kernel density estimator.More generally,we can define a g-th order kernel K(),where q≥2. Definition [gth Order Kernel]:K()satisfies the conditions that (1)K(u)du=1; (2)∫wK(u)du=0for1≤j≤9-1 (3)uK(u)du<oo; (④)∫K2(u)du<oo. For a higher order kernel (i.e.,g>2),K()will take some negative values at some points. Question:Why is a higher order kernel useful?Can you give an example of a third order kernel?And an example of a fourth order kernel? Higher order kernels can reduce the bias of a kernel estimator to a higher order.An example of higher order kernels is given in Robinson (1991). 13
Suppose we are interested in estimating the value of the PDF g(x) at a given point x in the support of Xt : There are two basic instruments in kernel estimation: the kernel function K() and the bandwidth h. Intuitively, the former gives weighting to the observations in an interval containing the point x, and the latter controls the size of the interval containing observations. We Örst introduce an important instrument for local smoothing. This is called a kernel function. DeÖnition [Second Order Kernel K()]: A second order or positive kernel function K() is a pre-speciÖed symmetric PDF such that (1) R 1 1 K(u)du = 1; (2) R 1 1 K(u)udu = 0; (3) R 1 1 u 2K(u)du = CK < 1; (4) R 1 1 K2 (u)du = DK < 1: Intuitively, the kernel function K() is a weighting function that will ìdiscountîthe observations whose values are more away from the point x of interest. The kernel functions satisfying the above condition are called a second order or positive kernel. It should be emphasized that the kernel K() has nothing to do with the unknown PDF g(x) of fXtg; it is just a weighting function for observations when constructing a kernel density estimator. More generally, we can deÖne a q-th order kernel K(); where q 2: DeÖnition [qth Order Kernel]: K() satisÖes the conditions that (1) R 1 1 K(u)du = 1; (2) R 1 1 u jK(u)du = 0 for 1 j q 1; (3) R 1 1 u qK(u)du < 1; (4) R 1 1 K2 (u)du < 1: For a higher order kernel (i.e., q > 2); K() will take some negative values at some points. Question: Why is a higher order kernel useful? Can you give an example of a third order kernel? And an example of a fourth order kernel? Higher order kernels can reduce the bias of a kernel estimator to a higher order. An example of higher order kernels is given in Robinson (1991). 13
We now consider some examples of second order kernels: 。Uniform kernel K(四)=u≤1 ●Gaussian kernel 1 1 K四=2(-2方 -0<u<0. ·Epanechnikov Kernel k)-a-210u≤ ●Quartic kernel K回=品1-2Y104≤ 15 Among these kernels,the Gaussian kernel has unbounded support,while all other kernels have bounded supports of [-1,1].Also,the uniform kernel assigns an equal weighting within its support;in contrast,all other kernels have a downward weighting scheme. Question:How does the kernel method work? Let x be a fixed point in the support of Xt.Given a pre-chosen second kernel K(u), we define a kernel density estimator for g()based on the random sample [X (x)=T-1K(x-X) t=1 T where K回=K(分), h=h(T)>0 is called a bandwidth or a window size,andF(y))-T-l∑Zl(Xt≤) is the marginal empirical distribution function of the random sample {X.This is exactly the same as the estimator introduced in Chapter 3,and it was first proposed by Rosenblatt (1956)and Parzen (1962)and so is also called the Rosenblatt-Parzen kernel density estimator. 14
We now consider some examples of second order kernels: Uniform kernel K(u) = 1 2 1(juj 1); Gaussian kernel K(u) = 1 p 2 exp( 1 2 u 2 ); 1 < u < 1: Epanechnikov Kernel K(u) = 3 4 (1 u 2 )1(juj 1); Quartic kernel K(u) = 15 16 (1 u 2 ) 21(juj 1): Among these kernels, the Gaussian kernel has unbounded support, while all other kernels have bounded supports of [1; 1]. Also, the uniform kernel assigns an equal weighting within its support; in contrast, all other kernels have a downward weighting scheme. Question: How does the kernel method work? Let x be a Öxed point in the support of Xt . Given a pre-chosen second kernel K(u); we deÖne a kernel density estimator for g(x) based on the random sample fXtg T t=1 : g^(x) = T 1X T t=1 Kh(x Xt) = 1 T X T t=1 1 h K x Xt h = 1 h Z 1 1 K x y h dF^(y); where Kh(u) = 1 h K u h ; h = h(T) > 0 is called a bandwidth or a window size, and F^(y) = T 1 PT t=11(Xt y) is the marginal empirical distribution function of the random sample fXtg T t=1. This is exactly the same as the estimator introduced in Chapter 3, and it was Örst proposed by Rosenblatt (1956) and Parzen (1962) and so is also called the Rosenblatt-Parzen kernel density estimator. 14
We see immediately that the well-known histogram is a special case of the kernel density estimator g()with the choice of a uniform kernel. Example1[Histogram]:fK(u)=l(ul≤1),then T a=10r-X≤ t=1 Intuitively,with the choice of a uniform kernel,the kernel density estimator g(r)is the relative sample frequency of the observations on the interval [x-h,x+h]which centers at point r and has a size of 2h.Here,2hT is approximately the sample size of the small interval [r-h,+h,when the size 2h is small enough.Alternatively, T1)is the relative sample frequency for the observations falling into the small interval [r-h,r+h,which,by the law of large numbers,is approximately equal to the probability E[1(lx-Xl≤)] =P(r-h≤Xt≤x+h) rzth g(y)dy x-h ≈2hg(x) if h is small enough and g(r)is continuous around the point t.Thus,the histogram is a reasonable estimator for g(r),and indeed it is a consistent estimator g(r)if h vanishes to zero but at a slower rate than sample size T goes to infinity. Question:Under what conditions will the density estimator g(r)be consistent for the known density function g(z)? We impose an assumption on the data generating process and the unknown PDF g(c). Assumption 3.1 [Smoothness of PDF]:(i){X}is a strictly stationary process with marginal PDF g();(ii)g(r)has a bounded support on [a,b],and is continuously twice differentiable on [a,b,with g"()being Lipschitz-continuous in the sense that Ig"(1)-g"(x2)<Cx1-z2 for all x1,2 E [a,b,where a,b and C are finite constants. Question:How to define the derivatives at the boundary points? 15
We see immediately that the well-known histogram is a special case of the kernel density estimator g^(x) with the choice of a uniform kernel. Example 1 [Histogram]: If K(u) = 1 2 1(juj 1); then g^(x) = 1 2hT X T t=1 1(jx Xt j h): Intuitively, with the choice of a uniform kernel, the kernel density estimator g^(x) is the relative sample frequency of the observations on the interval [x h; x + h] which centers at point x and has a size of 2h: Here, 2hT is approximately the sample size of the small interval [x h; x + h]; when the size 2h is small enough. Alternatively, T 1 PT t=1 1(jx Xt j h) is the relative sample frequency for the observations falling into the small interval [xh; x+h]; which, by the law of large numbers, is approximately equal to the probability E [1(jx Xt j h)] = P (x h Xt x + h) = Z x+h xh g(y)dy 2hg(x) if h is small enough and g(x) is continuous around the point x: Thus, the histogram is a reasonable estimator for g(x); and indeed it is a consistent estimator g(x) if h vanishes to zero but at a slower rate than sample size T goes to inÖnity. Question: Under what conditions will the density estimator g^(x) be consistent for the known density function g(x)? We impose an assumption on the data generating process and the unknown PDF g(x): Assumption 3.1 [Smoothness of PDF]: (i) fXtg is a strictly stationary process with marginal PDF g(x); (ii) g(x) has a bounded support on [a; b]; and is continuously twice di§erentiable on [a; b]; with g 00() being Lipschitz-continuous in the sense that jg 00(x1)g 00(x2)j Cjx1 x2j for all x1; x2 2 [a; b]; where a; b and C are Önite constants. Question: How to deÖne the derivatives at the boundary points? 15