In general,we do not assume that the function form of ri(r)is known,except that we still maintain the assumption that ri(c)is a square-integrable function.Because ri(r) is square-integrable,we have ri(x)dz ∑∑aat vi(r)v(x)dx j=0k=0 0000 ∑∑by orthonormality j=0k=0 00 ∑<, j= where oj.k is the Kronecker delta function:6ik=1 if j=k and 0 otherwise. The squares summability implies aj-0 as j-oo,that is,aj becomes less impor- tant as the order j-oo.This suggests that a truncated sum rnp)=∑a, j=0 can be used to approximate ri(x)arbitrarily well if p is sufficiently large.The approxi- mation error,or the bias, b(x)三ri(x)-rnip(x) = ∑a,() j=p+1 →0 asp→o. However,the coefficient a;is unknown.To obtain a feasible estimator for ri(r),we consider the following sequence of truncated regression models X=∑B,,(X-i)+ct, j=0 where p=p(T)-oo is the number of series terms that depends on the sample size T. We need p/T-0 as T-oo,i.e.,the number of p is much smaller than the sample size T.Note that the regression error Ept is not the same as the true innovation et for each given p.Instead,it contains the true innovation et and the bias bp(X:-1). 6
In general, we do not assume that the function form of r1(x) is known, except that we still maintain the assumption that r1(x) is a square-integrable function. Because r1(x) is square-integrable, we have Z 1 1 r 2 1 (x)dx = X1 j=0 X1 k=0 jk Z 1 1 j (x) k (x)dx = X1 j=0 X1 k=0 jkj;k by orthonormality = X1 j=0 2 j < 1; where j;k is the Kronecker delta function: j;k = 1 if j = k and 0 otherwise. The squares summability implies j ! 0 as j ! 1; that is, j becomes less important as the order j ! 1. This suggests that a truncated sum r1p(x) = X p j=0 j j (x) can be used to approximate r1(x) arbitrarily well if p is su¢ ciently large. The approximation error, or the bias, bp(x) r1(x) r1p(x) = X1 j=p+1 j j (x) ! 0 as p ! 1: However, the coe¢ cient j is unknown. To obtain a feasible estimator for r1(x); we consider the following sequence of truncated regression models Xt = X p j=0 j j (Xt1) + "pt; where p p(T) ! 1 is the number of series terms that depends on the sample size T: We need p=T ! 0 as T ! 1, i.e., the number of p is much smaller than the sample size T. Note that the regression error "pt is not the same as the true innovation "t for each given p: Instead, it contains the true innovation "t and the bias bp(Xt1): 6
The ordinary least squares estimator =(亚'亚)-1亚X T t=2 where 亚=(i,,r isaT×p matrix,and :=[o(Xt-1),1(X-1,,少(Xt-1)' is a p x 1 vector.The series-based regression estimator is fpl)=∑月,9g(). j=0 To ensure that fip(r)is asymptotically unbiased,we must let p=p(T)-oo as T-oo (e.g.,p=VT).However,if p is too large,the number of estimated parameters will be too large,and as a consequence,the sampling variation of B will be large (i.e.,the estimator B is imprecise.)We must choose an appropriate p=P(T)so as to balance the bias and the sampling variation.The truncation order p is called a smoothing parameter because it controls the smoothness of the estimated function fip().In general,for any given sample,a large p will give a smooth estimated curve whereas a small p will give a wiggly estimated curve.If p is too large such that the variance of fip(r)is larger than its squared bias,we call that there exists oversmoothing.In contrast,if p is too sall such that the variance of fp()is smaller than its squared bias,then we call that there exists undersmoothing.Optimal smoothing is achieved when the variance of fip(r)balances its squared bias.The series estimatorfip()is called a global smoothing method,because once p is given,the estimated function fp()is determined over the entire domain of Xi. Under suitable regularity conditions,fip(r)will consistently estimate the unknown function ri(t)as the sample size T increases.This is called nonparametric estimation because no parametric functional form is imposed on ri(x). The base functions )can be the Fourier series (i.e.,the sin and cosine func- tions),and B-spline functions if X has a bounded support.See (e.g.)Andrews (1991, Econometrica)and Hong and White (1995,Econometrica)for applications. 7
The ordinary least squares estimator ^ = ( 0 )1 0X = X T t=2 t 0 t !1 X T t=2 tXt ; where = ( 0 1 ; :::; 0 T ) 0 is a T p matrix, and t = [ 0 (Xt1); 1 (Xt1); :::; p (Xt1)]0 is a p 1 vector. The series-based regression estimator is r^1p(x) = X p j=0 ^ j j (x): To ensure that r^1p(x) is asymptotically unbiased, we must let p = p(T) ! 1 as T ! 1 (e.g., p = p T): However, if p is too large, the number of estimated parameters will be too large, and as a consequence, the sampling variation of ^ will be large (i.e., the estimator ^ is imprecise.) We must choose an appropriate p = P(T) so as to balance the bias and the sampling variation. The truncation order p is called a smoothing parameter because it controls the smoothness of the estimated function r^1p(x): In general, for any given sample, a large p will give a smooth estimated curve whereas a small p will give a wiggly estimated curve. If p is too large such that the variance of r^1p(x) is larger than its squared bias, we call that there exists oversmoothing. In contrast, if p is too sall such that the variance of r^1p(x) is smaller than its squared bias, then we call that there exists undersmoothing. Optimal smoothing is achieved when the variance of r^1p(x) balances its squared bias. The series estimator r^1p(x) is called a global smoothing method, because once p is given, the estimated function r^1p(x) is determined over the entire domain of Xt : Under suitable regularity conditions, r^1p(x) will consistently estimate the unknown function r1(x) as the sample size T increases. This is called nonparametric estimation because no parametric functional form is imposed on r1(x): The base functions f j ()g can be the Fourier series (i.e., the sin and cosine functions), and B-spline functions if Xt has a bounded support. See (e.g.) Andrews (1991, Econometrica) and Hong and White (1995, Econometrica) for applications. 7
Example 2 Probability Density Function]:Suppose the PDF g(r)of Xt is a smooth function with unbounded support.We can expand g(x)=(e)B,H(e), j=0 where the function 1 =V2元p(-2) is the N(0,1)density function,and [H()}is the sequence of Hermite polynomials, defined as (-1yΦ(@)=-耳-=e)p()forj>0 where (is the N(0,1)CDF.For example, H(x)=1, H1(x)=x, H2(x)=(x2-1) H3(x)=x(x2-3), H4(x)=x4-6x2+3. See,for example,Magnus,Oberhettinger and Soni (1966,Section 5.6)and Abramowitz and Stegun (1972,Ch.22). Here,the Fourier coefficient g(x)Hj(x)o(x)dz. Again,,月,一0asj一ogiven∑go号<oo. The N(0,1)PDF o(r)is the leading term to approximate the unknown density g(x), and the Hermite polynomial series will capture departures from normality(e.g.,skewness and heavy tails). To estimate g(r),we can consider the sequence of truncated probability densities gn(c)=Cp(x)月,H(c, i=0 where the constant Hj(x)o(z)dr
Example 2 [Probability Density Function]: Suppose the PDF g(x) of Xt is a smooth function with unbounded support. We can expand g(x) = (x) X1 j=0 jHj (x); where the function (x) = 1 p 2 exp( 1 2 x 2 ) is the N(0; 1) density function, and fHj (x)g is the sequence of Hermite polynomials, deÖned as (1)j d j dxj (x) = Hj1(x)(x) for j > 0; where () is the N(0; 1) CDF. For example, H0(x) = 1; H1(x) = x; H2(x) = (x 2 1) H3(x) = x(x 2 3); H4(x) = x 4 6x 2 + 3: See, for example, Magnus, Oberhettinger and Soni (1966, Section 5.6) and Abramowitz and Stegun (1972, Ch.22). Here, the Fourier coe¢ cient j = Z 1 1 g(x)Hj (x)(x)dx: Again, j ! 0 as j ! 1 given P1 j=0 2 j < 1: The N(0; 1) PDF (x) is the leading term to approximate the unknown density g(x), and the Hermite polynomial series will capture departures from normality (e.g., skewness and heavy tails). To estimate g(x); we can consider the sequence of truncated probability densities gp(x) = C 1 p (x) X p j=0 jHj (x); where the constant Cp = X p j=0 j Z Hj (x)(x)dx 8
is a normalization factor to ensure that gp(r)is a PDF for each p.The unknown pa- rameters (can be estimated from the sample via the maximum likelihood estimation (MLE)method.For example,suppose {Xt}is an IID sample.Then T 3=arg max∑n9p(X) t=1 To ensure that p(m)=Cg()∑-o3,H) is asymptotically unbiased,we must let p =p(T)oo as T-oo.However,p must grow more slowly than the sample size T grows to infinity so that the sampling variation of B will not be too large. For the use of Hermite Polynomial series expansions,see (e.g.)Gallant and Tauchen (1996,Econometric Theory),Ait-Sahalia (2002,Econometrica),and Cui,Hong and Li (2020) Question:What are the advantages of nonparametric smoothing methods? They require few assumptions or restrictions on the data generating process.In particular,they do not assume a specific functional form for the function of interest (of course certain smoothness condition such as differentiability is required).They can deliver a consistent estimator for the unknown function,no matter whether it is linear or nonlinear.Thus,nonparametric methods can effectively reduce potential systematic bi- ases due to model misspecification,which is more likely to be encountered for parametric modeling. Question:What are the disadvantages of nonparametric methods? Nonparametric methods require a large data set for reasonable estimation.Fur- thermore,there exists a notorious problem of "curse of dimensionality,"when the function of interest contains multiple explanatory variables.This will be explained below. There exists another notorious "boundary effect"problem for nonparametric esti- mation near the boundary regions of the support.This occurs due to asymmetric coverage of data in the boundary regions. 9
is a normalization factor to ensure that gp(x) is a PDF for each p: The unknown parameters fjg can be estimated from the sample fXtg T t=1 via the maximum likelihood estimation (MLE) method. For example, suppose fXtg is an IID sample. Then ^ = arg max X T t=1 ln ^gp(Xt) To ensure that g^p(x) = C^1 p (x) Xp j=0^ jHj (x) is asymptotically unbiased, we must let p = p(T) ! 1 as T ! 1: However, p must grow more slowly than the sample size T grows to inÖnity so that the sampling variation of ^ will not be too large. For the use of Hermite Polynomial series expansions, see (e.g.) Gallant and Tauchen (1996, Econometric Theory), AÔt-Sahalia (2002, Econometrica), and Cui, Hong and Li (2020). Question: What are the advantages of nonparametric smoothing methods? They require few assumptions or restrictions on the data generating process. In particular, they do not assume a speciÖc functional form for the function of interest (of course certain smoothness condition such as di§erentiability is required). They can deliver a consistent estimator for the unknown function, no matter whether it is linear or nonlinear. Thus, nonparametric methods can e§ectively reduce potential systematic biases due to model misspeciÖcation, which is more likely to be encountered for parametric modeling. Question: What are the disadvantages of nonparametric methods? Nonparametric methods require a large data set for reasonable estimation. Furthermore, there exists a notorious problem of ìcurse of dimensionality,îwhen the function of interest contains multiple explanatory variables. This will be explained below. There exists another notorious ìboundary e§ectîproblem for nonparametric estimation near the boundary regions of the support. This occurs due to asymmetric coverage of data in the boundary regions. 9
Coefficients are usually difficult to interpret from an economic point of view. There exists a danger of potential overfitting,in the sense that nonparametric method,due to its flexibility,tends to capture non-essential features in a data which will not appear in out-of-sample scenarios. The above two motivating examples are the so-called orthogonal series expansion methods.There are other nonparametric methods,such as splines smoothing,kernel smoothing,k-near neighbor,and local polynomial smoothing.As mentioned earlier, series expansion methods are examples of so-called global smoothing,because the coefficients are estimated using all observations,and they are then used to evaluate the values of the underlying function over all points in the support of Xt.A nonparametric series model is an increasing sequence of parametric models,as the sample size T grows. In this sense,it is also called a sieve estimator.In contrast,kernel and local polynomial methods are examples of the so-called local smoothing methods,because estimation only requires the observations in a neighborhood of the point of interest.Below we will mainly focus on kernel and local polynomial smoothing methods,due to their simplicity and intuitive nature. 2 Kernel Density Method 2.1 Univariate Density Estimation Suppose IXt}is a strictly stationary time series process with unknown marginal PDF g(x). Question:How to estimate the marginal PDF g(r)of the time series process [X)? We first consider a parametric approach.Assume that g(r)is an N(u,o2)PDF with unknown u and o2.Then we know the functional form of g()up to two unknown parameters 0 =(u,o2)': -a -, -00<x<00 10
Coe¢ cients are usually di¢ cult to interpret from an economic point of view. There exists a danger of potential overÖtting, in the sense that nonparametric method, due to its áexibility, tends to capture non-essential features in a data which will not appear in out-of-sample scenarios. The above two motivating examples are the so-called orthogonal series expansion methods. There are other nonparametric methods, such as splines smoothing, kernel smoothing, k-near neighbor, and local polynomial smoothing. As mentioned earlier, series expansion methods are examples of so-called global smoothing, because the coe¢ cients are estimated using all observations, and they are then used to evaluate the values of the underlying function over all points in the support of Xt . A nonparametric series model is an increasing sequence of parametric models, as the sample size T grows. In this sense, it is also called a sieve estimator. In contrast, kernel and local polynomial methods are examples of the so-called local smoothing methods, because estimation only requires the observations in a neighborhood of the point of interest. Below we will mainly focus on kernel and local polynomial smoothing methods, due to their simplicity and intuitive nature. 2 Kernel Density Method 2.1 Univariate Density Estimation Suppose fXtg is a strictly stationary time series process with unknown marginal PDF g(x): Question: How to estimate the marginal PDF g(x) of the time series process fXtg? We Örst consider a parametric approach. Assume that g(x) is an N(; 2 ) PDF with unknown and 2 : Then we know the functional form of g(x) up to two unknown parameters = (; 2 ) 0 : g(x; ) = 1 p 22 exp 1 2 2 (x ) 2 ; 1 < x < 1: 10