Global Change Biology(2005)11,378-397doi: 10.1111/j.1365-2486.2005.00917.xModel-data synthesis in terrestrial carbon observation:methods, data requirements and data uncertaintyspecificationsM.R.RAUPACH,P.J.RAYNERt,D.J.BARRETT,R.S.DEFRIESS,M.HEIMANN,D.S.OJIMAll,S.QUEGAN**andC.C.SCHMULLIUStt*CSIRO Earth Obseruation Centre,GPO Box 3023, Canberra,ACT 2601,Australia,LSCE-CEA de Saclay Orme des Merisiers,91191Gif/Yoette,France,CSIROLand andWater,Canberra,ACT2601,Australia,SDepartmentofGeography,UniversityofMaryland,College Park,MD20742,USA,Department of"Biogeochemical Systems",Max-Planck-Institut fuir Biogeochemie,D-07701Jena,Germany,NaturalResourceEcologyLaboratory,ColoradoStateUniversity,FortCollins,CO80523-1499,UuSA,**CentreforTerrestrialCarbonDynamics,UniversityofSheffield,SheffieldS37RH,UK,+InstitutefurGeographie,Friedrich-Schiller-Universitat,D-07743 Jena, Germany.AbstractSystematic, operational, long-term observations of the terrestrial carbon cycle (includingits interactions with water, energy and nutrient cycles and ecosystem dynamics) areimportant forthe prediction and management of climate, waterresources,foodresources, biodiversity and desertification. To contribute to these goals, a terrestrialcarbon observing system requires the synthesis of several kinds of observation intoterrestrial biosphere models encompassing the coupled cycles of carbon, water, energyand nutrients.Relevant observations include atmospheric composition (concentrationsof CO2and othergases); remotesensing;fluxandprocessmeasurementsfromintensivestudy sites; in situ vegetation and soil monitoring; weather, climate and hydrologicaldata; and contemporary and historical data on land use,land use change and disturbance(grazing,harvest, clearing,fire).A review of model-data synthesis tools for terrestrial carbon observation identifies'nonsequential' and 'sequential'approaches as major categories, differing according towhether data are treated all at once or sequentially. The structure underlying bothapproaches is reviewed, highlighting several basic commonalities in formalism and datarequirements.An essential commonality is that for all model-data synthesis problems,bothnonsequential and sequential, data uncertainties are as important as data valuesthemselves and have a comparable role in determining the outcome.Given the importance of data uncertainties, there is an urgent need for soundly baseduncertainty characterizations for the main kinds of data used in terrestrial carbonobservation.The first requirement is a specification of the main properties of the errorcovariance matrix.As a step towards this goal, semi-quantitative estimates are made of the mainproperties of the error covariance matrix for four kinds of data essential for terrestrialcarbonobservation:remote sensing ofland surfaceproperties,atmosphericcompositionmeasurements, direct flux measurements, and measurements of carbon stores.Received 8 June 2004; accepted 25August 2004Correspondence: Dr Michael Raupach, tel. + 61 2 6246 5573,e-mail:Michael.Raupach@csiro.au3782005Blackwell Publishing Ltd
Model–data synthesis in terrestrial carbon observation: methods, data requirements and data uncertainty specifications M. R. RAUPACH *, P. J . R AY NE R w , D. J. BARRETT z, R. S. DEFRIES§, M. HEIMANN } , D. S. OJIMA k, S. QUEGAN ** and C . C . S C H M U L L I U S w w *CSIRO Earth Observation Centre, GPO Box 3023, Canberra, ACT 2601, Australia, wLSCE-CEA de Saclay Orme des Merisiers, 91191 Gif/Yvette, France, zCSIRO Land and Water, Canberra, ACT 2601, Australia, §Department of Geography, University of Maryland, College Park, MD 20742, USA, }Department of ‘‘Biogeochemical Systems’’, Max-Planck-Institut fu¨r Biogeochemie, D-07701, Jena, Germany, kNatural Resource Ecology Laboratory, Colorado State University, Fort Collins, CO 80523-1499, USA, **Centre for Terrestrial Carbon Dynamics, University of Sheffield, Sheffield S37RH, UK, wwInstitute fu¨r Geographie, Friedrich-Schiller-Universita¨t, D-07743 Jena, Germany. Abstract Systematic, operational, long-term observations of the terrestrial carbon cycle (including its interactions with water, energy and nutrient cycles and ecosystem dynamics) are important for the prediction and management of climate, water resources, food resources, biodiversity and desertification. To contribute to these goals, a terrestrial carbon observing system requires the synthesis of several kinds of observation into terrestrial biosphere models encompassing the coupled cycles of carbon, water, energy and nutrients. Relevant observations include atmospheric composition (concentrations of CO2 and other gases); remote sensing; flux and process measurements from intensive study sites; in situ vegetation and soil monitoring; weather, climate and hydrological data; and contemporary and historical data on land use, land use change and disturbance (grazing, harvest, clearing, fire). A review of model–data synthesis tools for terrestrial carbon observation identifies ‘nonsequential’ and ‘sequential’ approaches as major categories, differing according to whether data are treated all at once or sequentially. The structure underlying both approaches is reviewed, highlighting several basic commonalities in formalism and data requirements. An essential commonality is that for all model–data synthesis problems, both nonsequential and sequential, data uncertainties are as important as data values themselves and have a comparable role in determining the outcome. Given the importance of data uncertainties, there is an urgent need for soundly based uncertainty characterizations for the main kinds of data used in terrestrial carbon observation. The first requirement is a specification of the main properties of the error covariance matrix. As a step towards this goal, semi-quantitative estimates are made of the main properties of the error covariance matrix for four kinds of data essential for terrestrial carbon observation: remote sensing of land surface properties, atmospheric composition measurements, direct flux measurements, and measurements of carbon stores. Received 8 June 2004; accepted 25 August 2004 Correspondence: Dr Michael Raupach, tel. 1 61 2 6246 5573, e-mail: Michael.Raupach@csiro.au Global Change Biology (2005) 11, 378–397 doi: 10.1111/j.1365-2486.2005.00917.x 378 r 2005 Blackwell Publishing Ltd
MODEL-DATASYNTHESISINTERRESTRIALCARBONOBSERVATION379Introductionservations of quantities which are not directly obser-vable (suchas carbon stores and fluxes overlarge areas)Systematicearth observation impliesthecollection andand (4) forecasting (prediction forward in time on theinterpretation of multiple kinds of data about thebasis of past and current observations).evolvingstate of the earth system acrosswidespatialThe present paper arose from a workshop held inSheffield, UK, 3-6 June 2003, to further the develop-domains and over extended time periods.Three factorshavecausedamassiveacceleration inearthobservationment of a Terrestrial Carbon Observation Systemactivities over recent years. The first is need: global(TCOS) with a particular emphasis on model-datasynthesis.Antecedents for this effort were (1)pre-change is raising issues - such as greenhouse-inducedclimate change, water shortages and imbalances, landliminary steps toward aTCOS (Cihlar etal.,2002a,b,c);(2)a wider conceptforan Integrated Global Carbondegradation, soil erosion, loss of biodiversity-whichrequire informed human responses at both global andObserving Strategy including atmosphere, oceans, landregional levels. Second, technological advances inand human activities (Ciais et al., 2004) and (3) thesensors, satellite systems and data storageand proces-research program of the Global Carbon Project (Globalsing capabilities aremakingpossible observationsandCarbon Project, 2003).interpretations which were out of reach only a fewThe paper is founded on three themes arising fromyears ago and unimaginablea few decades ago.Third,the Sheffield workshop. First, model-data synthesis,based on terrestrial biospheremodels constrained withthesynthesisofformerlydiscretedisciplinesintoaunified Earth System Science is driving new hypothesesmultiplekinds of observation,is an essential compo-nent ofaTCOS.Second,from the standpoint ofmodel-about the dynamics of the earth system and theinterconnectednessof itscomponents,includinghu-data synthesis, data uncertainties are as important asmans.Systematic earthobservation motivates and testsdata values themselves and havea comparablerole inthesehypotheses.determining the outcome.Third, and consequently,The focus of this paper is observation of the carbonthere is an urgent need for soundly based uncertaintycycle, and in particular its land-atmosphere compo-specifications for the main kinds of data used innents,as onepart of an integrated earth observationterrestrial carbon observation.These themes are devel-system. It is a significant part because of the couplingoped as follows:thenext section summarizes majorbetween the carbon cycle and theterrestrial cycles ofpurposes and attributes of a TCOS. Model-datawater,energy and nutrients,and the connections ofallsynthesis:methods'providesan overviewof model-these biospheric processes with global climate anddata synthesis in the context of terrestrial carbonhuman activities (Field & Raupach, 2004; Raupach et al.,observation,by brieflydescribing some of the main2004).The carbon cycle is integral to the growth andmethods, indicating their common characteristics, anddecay of vegetation, maintains the water cycle throughhighlighting the key role of data uncertainty.Model-transpiration and provides habitat for maintainingdata synthesis: examples' provides some examples.biodiversity. Thus, terrestrial carbon observation isData characteristics: uncertainty in measurement andimportantfor climate observation and prediction,forrepresentation' undertakes a survey of the uncertaintythe management of water resources, nutrients andcharacteristics of the main kinds of relevant data.biodiversity, and for monitoring and managing theenhanced greenhouse effect.Purposes and attributes of a TCOSIt is increasingly recognized that strategies for earthA succinct statement of the overall purpose of a TCOSobservation (including terrestrial carbon observation)require methods for combiningdata andprocessmight be:to operationallymonitor the cycles of carbonmodels in systematic ways.This is leading to researchand related entities (water, energy, nutrients) in thetowards the application in terrestrial carbon observa-terrestrial biosphere, in support of comprehensive,tion (and inearth observationmoregenerally)ofsustained earth observation and prediction,and hence'model-data synthesis', the combination of the infor-sustainable environmental management and socio-mation contained in both observations and modelseconomic development.These words are congruentthrough bothparameter-estimation and data-assimila-with theFrameworkDocument emerging fromtheSecondEarthObservation Summit,Tokyo,April 2004tion techniques.Motivations for model-data synthesisapproaches include (1) model testing and data quality(http://earthobservations.org/docs/Framework%20control (through systematic checks for agreementDoc%20Final.pdf),whichcallsforaGlobalEarthwithin specified uncertaintybands for bothdata andObservationSystem of Systems'toservenineareasofsocio-economic benefit.ATCOSis a contributor to suchmodel); (2)interpolation of spatially and temporallysparse observations;(3) inference from available ob-asystemwithrelevancetoatleastsixoftheseareas:2005Blackwell PublishingLtd,Global ChangeBiology,11,378-397
Introduction Systematic earth observation implies the collection and interpretation of multiple kinds of data about the evolving state of the earth system across wide spatial domains and over extended time periods. Three factors have caused a massive acceleration in earth observation activities over recent years. The first is need: global change is raising issues – such as greenhouse-induced climate change, water shortages and imbalances, land degradation, soil erosion, loss of biodiversity – which require informed human responses at both global and regional levels. Second, technological advances in sensors, satellite systems and data storage and processing capabilities are making possible observations and interpretations which were out of reach only a few years ago and unimaginable a few decades ago. Third, the synthesis of formerly discrete disciplines into a unified Earth System Science is driving new hypotheses about the dynamics of the earth system and the interconnectedness of its components, including humans. Systematic earth observation motivates and tests these hypotheses. The focus of this paper is observation of the carbon cycle, and in particular its land-atmosphere components, as one part of an integrated earth observation system. It is a significant part because of the coupling between the carbon cycle and the terrestrial cycles of water, energy and nutrients, and the connections of all these biospheric processes with global climate and human activities (Field & Raupach, 2004; Raupach et al., 2004). The carbon cycle is integral to the growth and decay of vegetation, maintains the water cycle through transpiration and provides habitat for maintaining biodiversity. Thus, terrestrial carbon observation is important for climate observation and prediction, for the management of water resources, nutrients and biodiversity, and for monitoring and managing the enhanced greenhouse effect. It is increasingly recognized that strategies for earth observation (including terrestrial carbon observation) require methods for combining data and process models in systematic ways. This is leading to research towards the application in terrestrial carbon observation (and in earth observation more generally) of ‘model–data synthesis’, the combination of the information contained in both observations and models through both parameter-estimation and data-assimilation techniques. Motivations for model–data synthesis approaches include (1) model testing and data quality control (through systematic checks for agreement within specified uncertainty bands for both data and model); (2) interpolation of spatially and temporally sparse observations; (3) inference from available observations of quantities which are not directly observable (such as carbon stores and fluxes over large areas) and (4) forecasting (prediction forward in time on the basis of past and current observations). The present paper arose from a workshop held in Sheffield, UK, 3–6 June 2003, to further the development of a Terrestrial Carbon Observation System (TCOS) with a particular emphasis on model–data synthesis. Antecedents for this effort were (1) preliminary steps toward a TCOS (Cihlar et al., 2002a, b, c); (2) a wider concept for an Integrated Global Carbon Observing Strategy including atmosphere, oceans, land and human activities (Ciais et al., 2004) and (3) the research program of the Global Carbon Project (Global Carbon Project, 2003). The paper is founded on three themes arising from the Sheffield workshop. First, model–data synthesis, based on terrestrial biosphere models constrained with multiple kinds of observation, is an essential component of a TCOS. Second, from the standpoint of model– data synthesis, data uncertainties are as important as data values themselves and have a comparable role in determining the outcome. Third, and consequently, there is an urgent need for soundly based uncertainty specifications for the main kinds of data used in terrestrial carbon observation. These themes are developed as follows: the next section summarizes major purposes and attributes of a TCOS. ‘Model–data synthesis: methods’ provides an overview of model– data synthesis in the context of terrestrial carbon observation, by briefly describing some of the main methods, indicating their common characteristics, and highlighting the key role of data uncertainty. ‘Model– data synthesis: examples’ provides some examples. ‘Data characteristics: uncertainty in measurement and representation’ undertakes a survey of the uncertainty characteristics of the main kinds of relevant data. Purposes and attributes of a TCOS A succinct statement of the overall purpose of a TCOS might be: to operationally monitor the cycles of carbon and related entities (water, energy, nutrients) in the terrestrial biosphere, in support of comprehensive, sustained earth observation and prediction, and hence sustainable environmental management and socioeconomic development. These words are congruent with the Framework Document emerging from the Second Earth Observation Summit, Tokyo, April 2004 (http://earthobservations.org/docs/Framework%20- Doc%20Final.pdf), which calls for a ‘Global Earth Observation System of Systems’ to serve nine areas of socio-economic benefit. A TCOS is a contributor to such a system with relevance to at least six of these areas: MODEL –DATA SYNTHESIS IN TERRESTRIAL CARBON OBSERVATION 379 r 2005 Blackwell Publishing Ltd, Global Change Biology, 11, 378–397
380 M.R.RAUPACH etal.Model-data synthesis: methods. Understanding climate, and assessing and mitigat-ing climate change impacts;In this section, we survey a range of model-data.Improvingglobal waterresourcemanagementandsynthesis methods potentially applicablein a TCOS.understanding of the water cycle;More detail and further references can be found in a.Improving weatherinformation and prediction;growing number of excellent sources, for instanceTarantola (1987)and Evans &Stark (2002)for high-. Monitoring and managing inland ecosystems, in-level treatmentsof thegeneral statistical problemofcluding forests, and land use change;inverseestimation,Grewal&Andrews(1993)and. Supporting sustainable agriculture and combatingDrecourt (2003)for introductions to theKalmanFilter,desertification;Reichle et al.(2002)for hydrological applications with.Understanding,monitoring and preventing loss ofanemphasisontheKalmanFilterandEnting(2002)biodiversity.andKasibhatla etal. (2000)for applications of a range ofmethodstobiogeochemical cycles.Tomakethesecontributionseffectively,a TCOSmustOverviewhave a number of attributes (see also Running et al.,1999; Cihlar et al.,2002a; Ciais et al.,2004).First, scientificThe central problem is: using appropriate observationscredibility is needed to maintain methodological andand models,we must determine the spatial distribu-observational rigour,and to include procedures fortions andtemporalevolutionsof theterrestrialstoresestimating uncertainties or confidence limits.Second,and fluxes of carbon and related entities (waterconsistency withglobal budgets is necessaryto respectnutrients, energy)across the earth.Important fluxesconstraints from global-scale carbon and related bud-includeland-air exchanges (atmospheric sources andgets incorporating terrestrial, atmospheric and oceanicsinks), exchanges with rivers and groundwater, andpools and anthropogenic sources such as fossil fuelexchanges betweenterrestrial pools such asbiomassburning.Third, sufficient spatial resolution is necessaryand soil. We also need to determine the main processesto resolve spatial variations in patterns of land useinfluencing the fluxes,including those under human(typicallytens of metres, consistent with high-resolu-management.No singlemodel or set of observationstion remote sensing).Fourth,enough temporal resolutioncan supplythis amountofinformation-hencethe needis needed to resolve the influence of weather, inter-for a synthesis approach.The task of combiningannual climate fluctuations and long-term climateobservations and modelscanbecarriedoutinmanychange on carbon and related cycles. Fifth, the systemways,encompassed by the umbrellaterms'model-dataneedstoencompassa broad range ofentities,eventuallysynthesis'or'model-data fusion'.Thegeneral principleincluding CO2, CH4, CO,volatile organic carbonsis to find an 'optimal match'between observations and(VOCs)and aerosol black carbon.Ofthese,thehighestmodel by varying one or more 'properties' of thepriority is CO.Water is also a high priority because ofmodel.(Words in quotes have specific meaningsits importance in modulating other terrestrial GHGdefined below). The optimal match is a choice of modelfluxes. Sixth, a sufficient range of processes must beproperties, which minimizes the'distance' between theencompassed.Ahighpriorityis resolution of netland-model representations of a system and what weknowair fluxes of greenhousegases in which all terrestrialabout thereal biophysical system from observationalsources and sinks are lumped together. However, thereand prior'data'.At this high level of generality,model-is an equally high demand for identification of thedata synthesis encompasses both'parameter estima-terms contributing to thenetfluxes,for exampletotion'and 'data assimilation'.All applications rest onpartition a net flux between vegetation and soil storagethree foundations: a model of the system, data aboutchanges.Finally,quantification of uncertainty is required.the system, and a synthesis approach.The'demand side' of the uncertainty issue is:whatlevel of uncertainty is acceptablefora TCOS to offerModel.For a TCOS, the model is a terrestrial biosphereuseful information?The answer is not simpleandmodel describing the evolving stores and fluxes ofdepends on theapplication,forexample,fromtheareascarbon, water, energy and related entities. This dynamicmentioned above.This paper does not attempt tomodel hastheformanswer the demand-side question, but rather concen-dxtrates on the'supplvside'of uncertaintv:that is,how=f(x, u,p) + noiseordtuncertainty can be determined in a TCOS based onx"+1 =p(x",u", p) + noise =x"+At f(x",u",p)+ noise,model-datasynthesisandmultipleobservation(1)sources, each with its own specified uncertainty.2005 Blackwell Publishing Ltd, Global Change Biology,11, 378-397
Understanding climate, and assessing and mitigating climate change impacts; Improving global water resource management and understanding of the water cycle; Improving weather information and prediction; Monitoring and managing inland ecosystems, including forests, and land use change; Supporting sustainable agriculture and combating desertification; Understanding, monitoring and preventing loss of biodiversity. To make these contributions effectively, a TCOS must have a number of attributes (see also Running et al., 1999; Cihlar et al., 2002a; Ciais et al., 2004). First, scientific credibility is needed to maintain methodological and observational rigour, and to include procedures for estimating uncertainties or confidence limits. Second, consistency with global budgets is necessary to respect constraints from global-scale carbon and related budgets incorporating terrestrial, atmospheric and oceanic pools and anthropogenic sources such as fossil fuel burning. Third, sufficient spatial resolution is necessary to resolve spatial variations in patterns of land use (typically tens of metres, consistent with high-resolution remote sensing). Fourth, enough temporal resolution is needed to resolve the influence of weather, interannual climate fluctuations and long-term climate change on carbon and related cycles. Fifth, the system needs to encompass a broad range of entities, eventually including CO2, CH4, CO, volatile organic carbons (VOCs) and aerosol black carbon. Of these, the highest priority is CO2. Water is also a high priority because of its importance in modulating other terrestrial GHG fluxes. Sixth, a sufficient range of processes must be encompassed. A high priority is resolution of net landair fluxes of greenhouse gases in which all terrestrial sources and sinks are lumped together. However, there is an equally high demand for identification of the terms contributing to the net fluxes, for example to partition a net flux between vegetation and soil storage changes. Finally, quantification of uncertainty is required. The ‘demand side’ of the uncertainty issue is: what level of uncertainty is acceptable for a TCOS to offer useful information? The answer is not simple and depends on the application, for example, from the areas mentioned above. This paper does not attempt to answer the demand-side question, but rather concentrates on the ‘supply side’ of uncertainty: that is, how uncertainty can be determined in a TCOS based on model–data synthesis and multiple observation sources, each with its own specified uncertainty. Model–data synthesis: methods In this section, we survey a range of model–data synthesis methods potentially applicable in a TCOS. More detail and further references can be found in a growing number of excellent sources, for instance Tarantola (1987) and Evans & Stark (2002) for highlevel treatments of the general statistical problem of inverse estimation, Grewal & Andrews (1993) and Dre´court (2003) for introductions to the Kalman Filter, Reichle et al. (2002) for hydrological applications with an emphasis on the Kalman Filter and Enting (2002) and Kasibhatla et al. (2000) for applications of a range of methods to biogeochemical cycles. Overview The central problem is: using appropriate observations and models, we must determine the spatial distributions and temporal evolutions of the terrestrial stores and fluxes of carbon and related entities (water, nutrients, energy) across the earth. Important fluxes include land–air exchanges (atmospheric sources and sinks), exchanges with rivers and groundwater, and exchanges between terrestrial pools such as biomass and soil. We also need to determine the main processes influencing the fluxes, including those under human management. No single model or set of observations can supply this amount of information – hence the need for a synthesis approach. The task of combining observations and models can be carried out in many ways, encompassed by the umbrella terms ‘model–data synthesis’ or ‘model–data fusion’. The general principle is to find an ‘optimal match’ between observations and model by varying one or more ‘properties’ of the model. (Words in quotes have specific meanings defined below). The optimal match is a choice of model properties, which minimizes the ‘distance’ between the model representations of a system and what we know about the real biophysical system from observational and prior ‘data’. At this high level of generality, model– data synthesis encompasses both ‘parameter estimation’ and ‘data assimilation’. All applications rest on three foundations: a model of the system, data about the system, and a synthesis approach. Model. For a TCOS, the model is a terrestrial biosphere model describing the evolving stores and fluxes of carbon, water, energy and related entities. This dynamic model has the form dx dt ¼fðx; u; pÞ þ noise or xnþ1 ¼uðxn; un; pÞ þ noise ¼ xn þ Dtfðxn; un; pÞ þ noise; ð1Þ 380 M. R. RAUPACH et al. r 2005 Blackwell Publishing Ltd, Global Change Biology, 11, 378–397
MODEL-DATA SYNTHESISINTERRESTRIALCARBONOBSERVATION381where x is a vector of state wariables (such as stores ofhydrological data on river flows, groundwater, andcarbon, water and related entities, or store attributesconcentrations of C,N and other entities;(6) soilsuch as age class distributions); f is a vector of rates ofpropertiesand topography;(6)disturbancerecordschange (net fluxes where components of x are stores); (both contemporary and historical) including landis the discrete analogue for f; u is a set of externallymanagement, land use, land use change and fire andspecified time-dependent forcing variables (such as(8)climate and weather data (precipitation, solarmeteorological variables and soil properties) and p is aradiation,temperature and humidity).Of these, someset of time-independent model parameters (such as rate(especiallythefirstfive)typicallyprovideobservationalconstantsandpartitionratios).Inthediscreteconstraints (z), while others provide model drivers (u).formulation, time steps are denoted by superscripts.Examples of observation models(Eqn (2))includeThenoiseterms account for both imperfectionsinradiative transfer models to map modelled surfacemodel formulation and stochastic variability in forcingsstatesintotheradiancesobserved bysatellites;(u)or parameters (p).Once themodelfunction f(x,u,p)atmospherictransportmodelstotransformmodelledor (x",u",p) is specified, then the system evolution x(t)surfacefluxes to measured atmospheric concentrations;can be determined by integrating Eqn (1) in time (withandallometricrelationstotransformmodelledbiomasszero noise), from initial conditions x(0), with specifiedto observed treediameters.external forcing u(t) and parameters p.Symthesis.The final requirement is a synthesis process,ora systematicmethod forfindingtheoptimal matchData. These are generally of two broad kinds: (1)between the data (including observations and priorobservations or measurements of a set of quantitieszestimates)and themodel.Thisprocess needstoprovideand (2)priorestimates formodel quantities (x,u and p).three kinds of output: optimal estimates for the modelBoth include uncertainty,through errors and noise. Inproperties to be adjusted, uncertainty statements aboutthis paper, the term'data'includes both observationsthese estimates, and an assessment of how well theand prior estimates, and incorporates the uncertaintymodel fits the data, given the data uncertainties. In anyinherent in each.synthesis process, there are three basic choices: (1)theThe measured quantities (z) are related to the systemmodel properties to be adjusted or 'target variables', (2)state and external forcing variables by an obseroationthe measure of distance between data and model ormodel of the form'cost function'and (3) the search strategyfor finding the(2)z= h(x,u) +noise,optimum values.Search strategies can be classifiedwhere the operator h specifies the deterministicbroadly into (3a)‘nonsequential'or“batch'strategies inrelationship between the measured quantities and thewhich the data are treated all at once, and (3b)system state.The noise term accounts for both'sequential' strategies in which the data arrive in a'measurement error(instrumental and processingtime sequence and are incorporated into the model-errors in the measurements z), and 'representationdata synthesis step by step. The rest of this sectionerror'(errors in the model representation of z,explores the choices (1),(2),(3a)and (3b).introduced by shortcomings in the observation modelh). In the rare case where we can observe all stateTarget variablesvariables directly,h reduces to the identity operator, soThe target variables are the properties of the model toz=x+ (measurement) noise. In time-discrete form, Eqn(2)becomes z"=h(x"u")+noise.Note the inter-be adjusted in the optimization process.They includepretation of the time-step superscripts:x"and u" areany model property considered to be sufficientlysimply the model state and forcings at time step n,uncertain as to benefit from constraint by the data.whereas z" is the set of new observations introduced atModel properties which can be target variables include:time step n, whatever the actual time of its measu-(1) model parameters (p); (2) forcing variables (u"), ifthere is substantial uncertainty about them; (3) initialrement.However, no observations may be used moreconditions on the state variables (x) and (4)time-than once.Examples of potential observations in a TCOSdependent components of the state vector x".Theinclude(1)atmospheric composition (concentrationsinclusion of the state vector x" as a possible targetof CO2 and other gases); (2) remote sensing of terrestrialvariable is for the following reason: in a purelyandatmosphericproperties;(3)fluxesof carbon anddeterministic model the trajectory x" is determined byrelated entities,with supportingprocessobservations,the dynamical model (f or ), the values of p and u",and the initial value x It might seem sufficient,at intensive study sites; (4) vegetation and soil stores ofcarbon from forest and ecological inventories; (5)therefore, to estimate these and allow integration of2005 Blackwell Publishing Ltd, Global Change Biology,11, 378-397
where x is a vector of state variables (such as stores of carbon, water and related entities, or store attributes such as age class distributions); f is a vector of rates of change (net fluxes where components of x are stores); u is the discrete analogue for f; u is a set of externally specified time-dependent forcing variables (such as meteorological variables and soil properties) and p is a set of time-independent model parameters (such as rate constants and partition ratios). In the discrete formulation, time steps are denoted by superscripts. The noise terms account for both imperfections in model formulation and stochastic variability in forcings (u) or parameters (p). Once the model function f(x, u, p) or u(xn , un , p) is specified, then the system evolution x(t) can be determined by integrating Eqn (1) in time (with zero noise), from initial conditions x(0), with specified external forcing u(t) and parameters p. Data. These are generally of two broad kinds: (1) observations or measurements of a set of quantities z and (2) prior estimates for model quantities (x, u and p). Both include uncertainty, through errors and noise. In this paper, the term ‘data’ includes both observations and prior estimates, and incorporates the uncertainty inherent in each. The measured quantities (z) are related to the system state and external forcing variables by an observation model of the form z ¼ hðx; uÞ þ noise; ð2Þ where the operator h specifies the deterministic relationship between the measured quantities and the system state. The noise term accounts for both ‘measurement error’ (instrumental and processing errors in the measurements z), and ‘representation error’ (errors in the model representation of z, introduced by shortcomings in the observation model h). In the rare case where we can observe all state variables directly, h reduces to the identity operator, so z 5 x 1 (measurement) noise. In time-discrete form, Eqn (2) becomes zn 5 h(xn , un ) 1 noise. Note the interpretation of the time-step superscripts: xn and un are simply the model state and forcings at time step n, whereas zn is the set of new observations introduced at time step n, whatever the actual time of its measurement. However, no observations may be used more than once. Examples of potential observations in a TCOS include (1) atmospheric composition (concentrations of CO2 and other gases); (2) remote sensing of terrestrial and atmospheric properties; (3) fluxes of carbon and related entities, with supporting process observations, at intensive study sites; (4) vegetation and soil stores of carbon from forest and ecological inventories; (5) hydrological data on river flows, groundwater, and concentrations of C, N and other entities; (6) soil properties and topography; (6) disturbance records (both contemporary and historical) including land management, land use, land use change and fire and (8) climate and weather data (precipitation, solar radiation, temperature and humidity). Of these, some (especially the first five) typically provide observational constraints (z), while others provide model drivers (u). Examples of observation models (Eqn (2)) include radiative transfer models to map modelled surface states into the radiances observed by satellites; atmospheric transport models to transform modelled surface fluxes to measured atmospheric concentrations; and allometric relations to transform modelled biomass to observed tree diameters. Synthesis. The final requirement is a synthesis process, or a systematic method for finding the optimal match between the data (including observations and prior estimates) and the model. This process needs to provide three kinds of output: optimal estimates for the model properties to be adjusted, uncertainty statements about these estimates, and an assessment of how well the model fits the data, given the data uncertainties. In any synthesis process, there are three basic choices: (1) the model properties to be adjusted or ‘target variables’, (2) the measure of distance between data and model or ‘cost function’ and (3) the search strategy for finding the optimum values. Search strategies can be classified broadly into (3a) ‘nonsequential’ or ‘batch’ strategies in which the data are treated all at once, and (3b) ‘sequential’ strategies in which the data arrive in a time sequence and are incorporated into the model– data synthesis step by step. The rest of this section explores the choices (1), (2), (3a) and (3b). Target variables The target variables are the properties of the model to be adjusted in the optimization process. They include any model property considered to be sufficiently uncertain as to benefit from constraint by the data. Model properties which can be target variables include: (1) model parameters (p); (2) forcing variables (un ), if there is substantial uncertainty about them; (3) initial conditions on the state variables (x0 ) and (4) timedependent components of the state vector xn . The inclusion of the state vector xn as a possible target variable is for the following reason: in a purely deterministic model the trajectory xn is determined by the dynamical model (f or u), the values of p and un , and the initial value x0 . It might seem sufficient, therefore, to estimate these and allow integration of MODEL –DATA SYNTHESIS IN TERRESTRIAL CARBON OBSERVATION 381 r 2005 Blackwell Publishing Ltd, Global Change Biology, 11, 378–397
382 M. R. RAUPACH et al.the model to take care of x".However, the model itselfEquation (3) defines the generalized least squares costfunction minimized by theminimum-variance estimatemay not be perfect, as indicated by the noise term inEqn (1), so there may be advantage in adjusting values(y)fory.Forany distribution of the errors in the dataof x" through the model integration.(observations z and priors y),this estimate is unbiased,To maintain generality,wedenote thevector of targetand hasthe minimum error covariance among all linear(in z), unbiased estimates (Tarantola 1987). Use of Eqnvariables by y.This vectormay or may not be a functionof time, and will usuallybea subset of all model(3)has another,additional foundation:provided thatthe probability distributions for data errors are Gaus-variables (x",u",p). Broadly speaking, parameter esti-sian, it yields a maximum-likelihood estimate for y,mation problems are those where the target variablesare restricted to model parameters (p), while dataconditional on the data and the model dynamics (Presset al.,1992,p.652;Todling2000).Outside the restrictionassimilationproblemsmayincludeanymodelpropertyas a target variable,usually with an emphasis on stateof Gaussian distributions,y as defined by minimizingavariables (x").quadratic Jis not exactly the maximum-likelihoodestimate,but it is often not far from it.AquadraticJis widely used even when the data errors are notCostfunctionGaussian; see Press et al. (1992, p. 690) for discussion.There are alternative cost functions Jin whichmodel-The cost or objective function J (a function of the targetmeasurementdifferences(z-h(y))areraisedtopowersvariablesy)definesthemismatchor distancebetweenother than 2,the choicein Eqn(3)(Tarantola,1987;the model and the data. It can take a wide range ofGershenfeld, 1999).For example, in flood event model-forms,but musthave certain properties (for example, itling,theabsolutemaximumerrorisneededtocapturemust bemonotonic in the absolutedifferencebetweenpeak flowrates,whilefor modelling baseflow rates, thedata and model-predicted values).Acommon choice ismean absolute deviation (Iz-h(y)I to the power 1)hasthe quadratic cost function:the desirable property of being less sensitive to outliersJ(y) =(z - h(y)[Covz)-'(z - h(y)than a power 2. Different powers for Iz-h(y)I produce(3)maximum-likelihood estimates for y with different+ (y -y)"[Covy]-'(y-),distributionsfordata errors; for example,a power 1Jwhere y is the vector of 'priors' (a priori estimates) foryields a maximum-likelihood estimate when the datathe target variables, and [Covz] and [Covy] areerrors are distributed exponentially,and a high-powerJcovariancematricesforzand y,respectivelypreferentially weights outliers such as peak flows.(Covz)mm=(2mzh),with Zm= zm-(zm),angle bracketsHere, we use a power 2 Jexclusively.denoting the expectation operator).The first term inEqn (3) is a sum of the squared distances betweenSearch strategies for nonsequential problemsmeasured components oftheobservationvector(z)andtheir model predictions (h(y), while the second is aIn nonsequential or batch problems,all data are treatedcorresponding sum of distances between target vari-simultaneouslyandtheminimizationproblem is solvedables and their prior estimates. The matrices [Covz]-1only once. A familiar case is least-squares parameterand [Covy]-'represent the weights accorded to theestimation.observations and the priors, and thus scale theExample.Someoftheattributesoftheseproblemsareconfidencesaccordedtoeach.Theirrolecanbeclarifiedby considering the simple casein which components zmdemonstrated by considering a simple linear exampleof the observation vector z are independent, withwhichextendstheparameter-estimation:problem.variances om; then [Covz]-1 is the diagonal matrixAlthoughmathematically straightforward,this casediag [1/o2] and the squared departures of the measure-finds important application in the atmosphericments(zm)fromthepredictions (hm(y)areseentobeinversion methods used to estimate trace gas sourcesweighted by the confidence measure 1/for eachfromatmosphericcompositionobservations(seeModel-data synthesis: Examples'). Here the targetcomponent.Themodel-data synthesis problem now becomes:variables (y) are a set of surface-air fluxes, averagedvaryy to minimize J(y), subject to the constraint thatover suitable areas; there is no dynamic model relatingx(t) must satisfy the dynamic model, Eqn (1). The valuefluxes at different times and places to each other; andof y at the minimum is the a posteriori estimate of y,theobservationoperator(h)isamodelofatmosphericincluding information from the observations as well astransport. From the linearity of the conservationthe priors. We denote it by y (so frowns and smilesequation for an inert trace gas, it follows that h isrespectivelydesignate priorand posterior estimates)linear and can hence be represented by a matrix H2005Blackwell Publishing Ltd, Global Change Biology,11,378-397
the model to take care of xn . However, the model itself may not be perfect, as indicated by the noise term in Eqn (1), so there may be advantage in adjusting values of xn through the model integration. To maintain generality, we denote the vector of target variables by y. This vector may or may not be a function of time, and will usually be a subset of all model variables (xn , un , p). Broadly speaking, parameter estimation problems are those where the target variables are restricted to model parameters (p), while data assimilation problems may include any model property as a target variable, usually with an emphasis on state variables (xn ). Cost function The cost or objective function J (a function of the target variables y) defines the mismatch or distance between the model and the data. It can take a wide range of forms, but must have certain properties (for example, it must be monotonic in the absolute difference between data and model-predicted values). A common choice is the quadratic cost function: JðyÞ ¼ðz hðyÞÞT½Cov z 1 ðz hðyÞÞ þ ðy y _Þ T½Cov y _ 1 ðyy _Þ; ð3Þ where y _ is the vector of ‘priors’ (a priori estimates) for the target variables, and [Cov z] and ½Cov y _ are covariance matrices for z and y _, respectively (½Cov z mn ¼ z0 mz0 n , with z0 m ¼ zm h i zm , angle brackets denoting the expectation operator). The first term in Eqn (3) is a sum of the squared distances between measured components of the observation vector (z) and their model predictions (h(y)), while the second is a corresponding sum of distances between target variables and their prior estimates. The matrices [Cov z] 1 and ½Cov y _ 1 represent the weights accorded to the observations and the priors, and thus scale the confidences accorded to each. Their role can be clarified by considering the simple case in which components zm of the observation vector z are independent, with variances s2 m; then [Cov z] 1 is the diagonal matrix diag ½1=s2 m and the squared departures of the measurements (zm) from the predictions (hm(y)) are seen to be weighted by the confidence measure 1=s2 m for each component. The model–data synthesis problem now becomes: vary y to minimize J(y), subject to the constraint that x(t) must satisfy the dynamic model, Eqn (1). The value of y at the minimum is the a posteriori estimate of y, including information from the observations as well as the priors. We denote it by y ^ (so frowns and smiles respectively designate prior and posterior estimates). Equation (3) defines the generalized least squares cost function minimized by the minimum-variance estimate (y ^) for y. For any distribution of the errors in the data (observations z and priors y _), this estimate is unbiased, and has the minimum error covariance among all linear (in z), unbiased estimates (Tarantola 1987). Use of Eqn (3) has another, additional foundation: provided that the probability distributions for data errors are Gaussian, it yields a maximum-likelihood estimate for y, conditional on the data and the model dynamics (Press et al., 1992, p. 652; Todling 2000). Outside the restriction of Gaussian distributions, y ^ as defined by minimizing a quadratic J is not exactly the maximum-likelihood estimate, but it is often not far from it. A quadratic J is widely used even when the data errors are not Gaussian; see Press et al. (1992, p. 690) for discussion. There are alternative cost functions J in which modelmeasurement differences (zh(y)) are raised to powers other than 2, the choice in Eqn (3) (Tarantola, 1987; Gershenfeld, 1999). For example, in flood event modelling, the absolute maximum error is needed to capture peak flow rates, while for modelling base flow rates, the mean absolute deviation (|z–h(y)| to the power 1) has the desirable property of being less sensitive to outliers than a power 2. Different powers for |z–h(y)| produce maximum-likelihood estimates for y ^ with different distributions for data errors; for example, a power 1 J yields a maximum-likelihood estimate when the data errors are distributed exponentially, and a high-power J preferentially weights outliers such as peak flows. Here, we use a power 2 J exclusively. Search strategies for nonsequential problems In nonsequential or batch problems, all data are treated simultaneously and the minimization problem is solved only once. A familiar case is least-squares parameter estimation. Example. Some of the attributes of these problems are demonstrated by considering a simple linear example, which extends the parameter-estimation problem. Although mathematically straightforward, this case finds important application in the atmospheric inversion methods used to estimate trace gas sources from atmospheric composition observations (see ‘Model–data synthesis: Examples’). Here the target variables (y) are a set of surface-air fluxes, averaged over suitable areas; there is no dynamic model relating fluxes at different times and places to each other; and the observation operator (h) is a model of atmospheric transport. From the linearity of the conservation equation for an inert trace gas, it follows that h is linear and can hence be represented by a matrix H 382 M. R. RAUPACH et al. r 2005 Blackwell Publishing Ltd, Global Change Biology, 11, 378–397