General setupTimeDiscrete,HiddenMarkovModels:Our results will in parts be valid in an even moregeneral time-discrete setting which also covers Hidden Markov Models: we start with poo(r) μo(da)(2.7)P(Xo E A) =andassumethattheunobservablestateevolvesaccordingtoaMarkovtransitionP(Xt E A|Xt-1 = Tt-1,..., Xo = ro) = P(Xt E A|Xt-1 = Tt-1) == JapxXx-=m-() μ(da),(2.8)Again we only have a transformation Y, of X, available which in this case is distributed accordingtoYiXi= (y) v(dy)(2.9)P(Yt E BXt =at)=Fdensities poo,Inthissetting,known(andexisting[conditional]weassumePX-I=YlXt=rtSomewhere in-between the model formulation of this paragraph and the Euclidean SSM you mayrangethedynamic(generalized)linearmodelsasdiscussedinWestetal.(1985)andWestandHarrison(1989).Theseare also coveredbyTheorem3.3as soon as in thestatespacea squarederrormakessense.Continuous setting:In applications of Mathematical Finance we also need to cover continuoustime settings as given by an unobservable state evolving according to an SDE(2.10)dXt=f(t,Xt)dt+g(t,Xt)dWtand where for consistency, we observe Yt according todYt = z(t, Xt) dt +v(t)dWt,Yo = 0(2.11)For Xo we assume(2.7), while Wt, Wt,are independent Wiener processes, and f,q, z, u are suitablymeasurable,knownfunctions.This formulation with a time-continuous observation process as in (2.11) may be found in Tang(1998)and James(2005).More often, however, observations will be made discretely,so that a formulation like the one ofNielsen et al. (2000) and Singer(2002)is more adequate,i.e.; for discrete times ti <...<t wehaveobservations(2.12)Yt=Zt(Xt)+EtkIn this context, a straightforward approach linearizes the corresponding functions f and z to givethe (continuous-discrete) Extended Kalman Filter (EKF), or, improved to second order moment fitting6
General setup Time Discrete, Hidden Markov Models: Our results will in parts be valid in an even more general time-discrete setting which also covers Hidden Markov Models: we start with P(X0 ∈ A) = Z A p X0 0 (x) µ0(dx) (2.7) and assume that the unobservable state evolves according to a Markov transition: P(Xt ∈ A|Xt−1 = xt−1, . . . , X0 = x0) = P(Xt ∈ A|Xt−1 = xt−1) = = R A p Xt|Xt−1=xt−1 t (x) µt(dx), (2.8) Again we only have a transformation Yt of Xt available which in this case is distributed according to P(Yt ∈ B|Xt = xt) = Z B q Yt|Xt=xt t (y) νt(dy) (2.9) In this setting, we assume known (and existing) [conditional] densities p X0 0 , p Xt|Xt−1=xt−1 t , q Yt|Xt=xt t . Somewhere in-between the model formulation of this paragraph and the Euclidean SSM you may range the dynamic (generalized) linear models as discussed in West et al. (1985) and West and Harrison (1989). These are also covered by Theorem 3.3 as soon as in the state space a squared error makes sense. Continuous setting: In applications of Mathematical Finance we also need to cover continuous time settings as given by an unobservable state evolving according to an SDE dXt = f(t, Xt) dt + q(t, Xt) dWt (2.10) and where for consistency, we observe Yt according to dYt = z(t, Xt) dt + v(t) dW0 t , Y0 = 0 (2.11) For X0 we assume (2.7), while Wt , W0 t , are independent Wiener processes, and f, q, z, v are suitably measurable, known functions. This formulation with a time-continuous observation process as in (2.11) may be found in Tang (1998) and James (2005). More often, however, observations will be made discretely, so that a formulation like the one of Nielsen et al. (2000) and Singer (2002) is more adequate, i.e.; for discrete times t1 < . . . < tN we have observations Ytk = ztk (Xtk ) + εtk (2.12) In this context, a straightforward approach linearizes the corresponding functions f and z to give the (continuous-discrete) Extended Kalman Filter (EKF), or, improved to second order moment fitting 6
General setupinthesecondordernonlinearfilter(SNF)introducedinJazwinski(1970),alsoconferSinger(2002sec. 4.3.1). After this linearization we are again in the context of a (time-inhomogeneous) linearssM, hence the methodology we develop in the sequel applies to this setting as well.More recently,approaches to improve on this simple linearization have been introduced, notably theunscentedKalmanfilter(UKF)(Julieretal.,2000)andHermiteexpansionsasinAit-Sahalia(2002)We do not cover them here, though.Fora survey of thesemethods, confer Singer (2002, sec.4.3).For techniquestodeal with non-linear time-discrete situations, see Tanizaki (1996).Control: Going one more step ahead, to cover applications such as optimal portfolio selection,wemayallowforcontrolsUttobesetordeterminedbythestatistician,andwhicharefedbackinthe state equations. In the context of the continuous time model from (2.10) and (2.12), this is alsoknownasSDEX,conferNielsenetal.(200O).In this setting, the controls U, are assumed measurable w.r.t. o(Y,ls < t) or usually even measurableW.r.t. α(Yt-).To integrate these controls into our setting, we just have to generalize functions f, z, q and densitiesptl, qil to f = f(t, Xt, U.) (and z,q likewise) and modify p:l = pxilx-1=at-1,l=ut-1(r), and1: = rlX=a,Ui-i=u-1(y).aFor the application of stochastic control to portfolio optimization, confer Korn (1997)2.2DeviationsfromtheidealmodelAs usualwithRobust Statisticswedonotconfineourselvesto idealmodel assumptionsbutratherallowfor (small)deviationsfromtheseassumptions,mostprominentlygeneratedbyoutliers.Inournotation,sub/superscript id denotestheideal setting,di thedistorting(contaminating)situation,retherealistic,contaminatedsituationContrary to the independent setting, outlier may occur in quite different manors: Following theterminologyofFox(1972),wedistinguishinnovationoutliers(orlO's)andadditiveoutliers (orAO's)Historically, AO's denote gross errors affecting the observation errors, i.e.,(2.13)AO : et~ (1-TAO)C(ed)+rAoC(e)where C(edi) is arbitrary, unknown and uncontrollable and 0 ≤ rao ≤1 is the AO-contaminationradius, i.e.; the probability for an AO.IO's on the other hand are usually defined as outliers which affect the innovations,(2.14)I : ute~(1-rro)C(ud) +rroC(ut)7
General setup in the second order nonlinear filter (SNF) introduced in Jazwinski (1970), also confer Singer (2002, sec. 4.3.1). After this linearization we are again in the context of a (time-inhomogeneous) linear SSM, hence the methodology we develop in the sequel applies to this setting as well. More recently, approaches to improve on this simple linearization have been introduced, notably the unscented Kalman filter (UKF) (Julier et al., 2000) and Hermite expansions as in Aït-Sahalia (2002). We do not cover them here, though. For a survey of these methods, confer Singer (2002, sec. 4.3). For techniques to deal with non-linear time-discrete situations, see Tanizaki (1996). Control: Going one more step ahead, to cover applications such as optimal portfolio selection, we may allow for controls Ut to be set or determined by the statistician, and which are fed back in the state equations. In the context of the continuous time model from (2.10) and (2.12), this is also known as SDEX, confer Nielsen et al. (2000). In this setting, the controls Ut are assumed measurable w.r.t. σ(Ys|s < t) or usually even measurable w.r.t. σ(Yt−). To integrate these controls into our setting, we just have to generalize functions f, z, q and densities p · | · t , q · | · t to f = f(t, Xt , Ut) (and z,q likewise) and modify p · | · t = p Xt|Xt−1=xt−1,Ut−1=ut−1 t (x), and q · | · t = q Yt|Xt=xt,Ut−1=ut−1 t (y). For the application of stochastic control to portfolio optimization, confer Korn (1997). 2.2 Deviations from the ideal model As usual with Robust Statistics we do not confine ourselves to ideal model assumptions but rather allow for (small) deviations from these assumptions, most prominently generated by outliers. In our notation, sub/superscript id denotes the ideal setting, di the distorting (contaminating) situation, re the realistic, contaminated situation. Contrary to the independent setting, outlier may occur in quite different manors: Following the terminology of Fox (1972), we distinguish innovation outliers (or IO’s) and additive outliers (or AO’s). Historically, AO’s denote gross errors affecting the observation errors, i.e., AO :: ε re t ∼ (1 − rAO)L(ε id t ) + rAOL(ε di t ) (2.13) where L(ε di t ) is arbitrary, unknown and uncontrollable and 0 ≤ rAO ≤ 1 is the AO-contamination radius, i.e.; the probability for an AO. IO’s on the other hand are usually defined as outliers which affect the innovations, IO :: v re t ∼ (1 − rIO)L(v id t ) + rIOL(v di t ) (2.14) 7
General setupwhere again C(ut) is arbitrary, unknown and uncontrollable and 0 ≤ ro ≤1 is the correspondingIO-contaminationradius.We stick to this distinction for consistency with the literature, although we will rather use theseterms inthefollowingsense:lO's denoteendogenous outliersaffectingthestateequation ingeneral, hence distorting several subsequent states.This also covers level shifts or linear trends;ifFt|<1 theseare not included in the classical definition,as then iO's would then decay geometrically in t. We also extend the meaning of AO's to denote general exogenous outliers which entethe observation equation only and thus only cause distortions at single time points. This also coverssubstitutive outliers orSO'sdefined as(2. 15)SO : Yre ~ (1 -rso)C(Ytd)+rsoC(Yd)where again C(Yd") is arbitrary, unknown and uncontrollable and o ≤ rso ≤1 is the correspondingSO-contaminationradius.Apparently, the SO-ball of radius r consisting of all C(Yre) according to (2.15) contains the corre-sponding Ao-ball of the same radius when Yr = ZtXt + ete. However, for technical reasons, wemaketheadditionalassumptionthat(2.16)Yid,Ydistochastically independentandthenthe"contains"-relationno longerholds.The more general definition of AO's and IO's in the sequel will be labeled "wide-sense" to distin-guish it from the "narrow-sense" definitions (2.13) and (2.14).Remark 2.1. Whether (narrow-sense) AO's or SO's are better suited to capture model deviationswill dependontheactualapplication;seenfrommathematicaloperability,clearlysO'sareeasiertotreat, compare Remark 3.4(b).They will also lead to different least favorable situations, compareRemark 3.4(d).Different and competing goals are induced by endogenous and exogenous outliers: In thepresence of (wide-sense)AO's we would like to attenuate their effect to avoid"false alarms",whilewhenthereare (wide-sense)iO'sthe usual goal in onlineapplications wouldbetracking,i.e.; detectstructural changes as fast as possible and/or react on the changed situation.Obviouslywearefaced withanidentification problem here:Immediately after a suspicious observation we cannot tell (wide-sense) AO's from (wide-sense) IO'sSuch a simultaneous treatment will only be possible with a certain delay -see section 5.Inother,moreoff-linesituations,suchasspectralanalysisof lowflowestimationorinter-individuaheart frequency spectra, one would like to recover the situation without structural changes and8
General setup where again L(v di t ) is arbitrary, unknown and uncontrollable and 0 ≤ rIO ≤ 1 is the corresponding IO-contamination radius. We stick to this distinction for consistency with the literature, although we will rather use these terms in the following sense: IO’s denote endogenous outliers affecting the state equation in general, hence distorting several subsequent states. This also covers level shifts or linear trends; if |Ft | < 1 these are not included in the classical definition, as then IO’s would then decay geometrically in t. We also extend the meaning of AO’s to denote general exogenous outliers which enter the observation equation only and thus only cause distortions at single time points. This also covers substitutive outliers or SO’s defined as SO :: Y re t ∼ (1 − rSO)L(Y id t ) + rSOL(Y di t ) (2.15) where again L(Y di t ) is arbitrary, unknown and uncontrollable and 0 ≤ rSO ≤ 1 is the corresponding SO-contamination radius. Apparently, the SO-ball of radius r consisting of all L(Y re t ) according to (2.15) contains the corresponding AO-ball of the same radius when Y re t = ZtXt + ε re t . However, for technical reasons, we make the additional assumption that Y id t , Y di t stochastically independent (2.16) and then the “contains”-relation no longer holds. The more general definition of AO’s and IO’s in the sequel will be labeled “wide-sense” to distinguish it from the “narrow-sense” definitions (2.13) and (2.14). Remark 2.1. Whether (narrow-sense) AO’s or SO’s are better suited to capture model deviations will depend on the actual application; seen from mathematical operability, clearly SO’s are easier to treat, compare Remark 3.4(b). They will also lead to different least favorable situations, compare Remark 3.4(d). Different and competing goals are induced by endogenous and exogenous outliers: In the presence of (wide-sense) AO’s we would like to attenuate their effect to avoid “false alarms”, while when there are (wide-sense) IO’s the usual goal in online applications would be tracking, i.e.; detect structural changes as fast as possible and/or react on the changed situation. Obviously we are faced with an identification problem here: Immediately after a suspicious observation we cannot tell (wide-sense) AO’s from (wide-sense) IO’s. Such a simultaneous treatment will only be possible with a certain delay —see section 5. In other, more off-line situations, such as spectral analysis of low flow estimation or inter-individual heart frequency spectra, one would like to recover the situation without structural changes and 8
General setuphencea cleaning from both (wide-sense)iO's and Ao'sis required;afterthis cleaning thepowerfulinstruments of spectralanalysis will be available; for this and other issues in robust density estima-tion,conferKleineretal.(1979)andSpangl(2008).Wewill notpursuethisgoal inthispaperhowever.2.3Example:SteadyStateModelOur running example willbe a one-dimensional steady state model with hyper-parametersin the ideal model: vt, et id N(0, 1)(2.17)p=q=l,Ft=Zt=1,In Figure 1, we display a typical realization of an SSM in model (2.17), where outliers are generatedaccording to rro = rAo =0.1, vdi,edid- N(10, 0.1)I-dim steady state - idealI-dim steady state - ideal1-dim steady state under AOI-dim steadystate-underIOFigure1:Model (2.17)intheidealmodeland under(narrow-sense)AO'sand IO's;whileAO'sonlyaffectsingleobservations,underIO'swe never return to the original level. Instances of outliers are marked with red circles.2.4ClassicalMethod:Kalman-FilterFilterProblemThemost importantprobleminSSMformulationistosomehowreconstructtheunobservablestatesX,basedontheobservationsYt.Forabbreviationletusdenote(2.18)Yi:t = (Y1,.., Y), Yi:0 := 0Then usingmean squarederror (MsE) risk,thereconstructionproblem becomesE|X - f(Yi:s)° = minf:(2.19)9
General setup hence a cleaning from both (wide-sense) IO’s and AO’s is required; after this cleaning the powerful instruments of spectral analysis will be available; for this and other issues in robust density estimation, confer Kleiner et al. (1979) and Spangl (2008). We will not pursue this goal in this paper, however. 2.3 Example: Steady State Model Our running example will be a one-dimensional steady state model with hyper-parameters p = q = 1, Ft = Zt = 1, in the ideal model: vt , εt i.i.d. ∼ N (0, 1) (2.17) In Figure 1, we display a typical realization of an SSM in model (2.17), where outliers are generated according to rIO = rAO = 0.1, v di t , εdi t i.i.d. ∼ N (10, 0.1). 1-dim steady state - ideal 0 50 100 X -20 -15 -10 -5 0 Y 1-dim steady state under AO 0 50 100 X -20 -15 -10 -5 0 5 Y 1-dim steady state - ideal 0 50 100 X -5 0 5 10 Y 1-dim steady state - under IO 0 50 100 X 0 20 40 60 Y Figure 1:Model (2.17) in the ideal model and under (narrow-sense) AO’s and IO’s; while AO’s only affect single observations, under IO’s we never return to the original level. Instances of outliers are marked with red circles. 2.4 Classical Method: Kalman–Filter Filter Problem The most important problem in SSM formulation is to somehow reconstruct the unobservable states Xt based on the observations Yt . For abbreviation let us denote Y1:t = (Y1, . . . , Yt), Y1:0 := ∅ (2.18) Then using mean squared error (MSE) risk, the reconstruction problem becomes E Xt − ft(Y1:s) 2 = minft (2.19) 9
General setupDependingon thehorizon s of theobservations usedto reconstruct Xt,we speak of a predictiorproblem for s < t, of a filtering problem if s = t and of a smoothing problem if s > t. In the sequelwe will confine ourselves to the filtering problem.Kalman-Filter It is well-known that the general solution to (2.19) is the corresponding condi-tional expectation E[X:|Yi:s]. Except for the Gaussian case, this exact conditional expectation however is rather expensiveto to compute.Hence similar to the Gauss-Markov setting it is a naturalrestriction to confine oneself to linear filters. In this context, the seminal work of Kalman (1960)(discrete-time setting)andKalmanand Bucy(1961)(continuous-time setting)introducedarecursivescheme to compute this optimal linear filter:Initialization:(2.20)Xoo = ao,Zo0 = QoPrediction:(2.21)Ztt-1 = FZt-1t-1FT + QtXit-1 = FXt1|t-1,Correction:Xtt=Xt/t-1+MAYt,AYt =Yt-Ztatlt-1,M = Zt-1ztA-,Zt = (Ip - M Zt)Ztt-1,(2.22)At = ZtZtlt-1ZT + Vtwhere Et/t = Cov(Xt - Xtit), Ett-1 = Cov(Xt - Xtit-1), and M0 is the so-called Kalman gain.Using orthogonality of [△Ytt we may setup similar recursions for the corresponding best linearsmoother;see,e.g.AndersonandMoore(1979),DurbinandKoopman(2001)OptimalityoftheKalman-FilterToseethatthe(classical)Kalmanfiltersolvesproblem(2.19)(for s = t) among all linear filters, let us write(2.23)lin(x) := closed linear space generated by X(2.24)oP(X) := orthogonal projection onto lin(X)and define (recursively)(2.25)AYt = Yt - oP(YYi:t-1)Hence the AYt are mutually orthogonal and(2.26)Xt/t-1 = oP(Xt/Y1:t-1) = FtoP(Xt-1|Yi:t-1) = FtXt-1t-1Xtt = oP(Xt/Y1:t) = oP(Xt/Yi:t-1) +oP(Xt|AY) =(2.27)=Xtt-1+oP(Xt-Xtt-i/AYt)=Xtt-1+M△YtFor later purposes,we also introduce a symbol for the prediction error(2.28)△Xt = Xt - Xtlt-1Similar to the Gauss-Markov Theorem, under normality, i.e.; assuming (2.3), (2.4), (2.5), this op-timality extends as follows: Xtt-1] = E[Xi|Yi:t[-1], i.e. the Kalman filter is optimal among allYi:t(-1)-measurable filters. It also is the posterior mode of C(X:/Yi:t) and Xtit can also be seen tobe the ML estimator for a regression model with random parameter; for the last property, compareDuncanandHorn(1972)10
General setup Depending on the horizon s of the observations used to reconstruct Xt , we speak of a prediction problem for s < t, of a filtering problem if s = t and of a smoothing problem if s > t. In the sequel we will confine ourselves to the filtering problem. Kalman–Filter It is well-known that the general solution to (2.19) is the corresponding conditional expectation E[Xt |Y1:s]. Except for the Gaussian case, this exact conditional expectation however is rather expensive to to compute. Hence similar to the Gauss-Markov setting it is a natural restriction to confine oneself to linear filters. In this context, the seminal work of Kalman (1960) (discrete-time setting) and Kalman and Bucy (1961) (continuous-time setting) introduced a recursive scheme to compute this optimal linear filter: Initialization: X0|0 = a0, Σ0|0 = Q0 (2.20) Prediction: Xt|t−1 = FtXt−1|t−1 , Σt|t−1 = FtΣt−1|t−1F τ t + Qt (2.21) Correction: Xt|t = Xt|t−1 + M0 t ∆Yt , ∆Yt = Yt − Ztxt|t−1 , M0 t = Σt|t−1Z τ t ∆ −1 t , Σt|t = (Ip − M0 t Zt)Σt|t−1 , ∆t = ZtΣt|t−1Z τ t + Vt (2.22) where Σt|t = Cov(Xt − Xt|t ), Σt|t−1 = Cov(Xt − Xt|t−1 ), and M0 t is the so-called Kalman gain. Using orthogonality of {∆Yt}t we may setup similar recursions for the corresponding best linear smoother; see, e.g. Anderson and Moore (1979), Durbin and Koopman (2001). Optimality of the Kalman–Filter To see that the (classical) Kalman filter solves problem (2.19) (for s = t) among all linear filters, let us write lin(X) := closed linear space generated by X (2.23) oP(·|X) := orthogonal projection onto lin(X) (2.24) and define (recursively) ∆Yt = Yt − oP(Yt |Y1:t−1) (2.25) Hence the ∆Yt are mutually orthogonal and Xt|t−1 = oP(Xt |Y1:t−1) = Ft oP(Xt−1|Y1:t−1) = FtXt−1|t−1 (2.26) Xt|t = oP(Xt |Y1:t) = oP(Xt |Y1:t−1) + oP(Xt |∆Yt) = = Xt|t−1 + oP(Xt − Xt|t−1 |∆Yt) = Xt|t−1 + M0 t ∆Yt (2.27) For later purposes, we also introduce a symbol for the prediction error ∆Xt = Xt − Xt|t−1 . (2.28) Similar to the Gauss-Markov Theorem, under normality, i.e.; assuming (2.3), (2.4), (2.5), this optimality extends as follows: Xt|t[−1] = E[Xt |Y1:t[−1]], i.e. the Kalman filter is optimal among all Y1:t[−1]-measurable filters. It also is the posterior mode of L(Xt |Y1:t) and Xt|t can also be seen to be the ML estimator for a regression model with random parameter; for the last property, compare Duncan and Horn (1972). 10