Variance Reduction with Monte Carlo Estimates of Error rates in multivariate classification C.Weihs" G.Calzolari Fachbereich Statistik Dipartimento di Statistica Universitat Dortmund Universita degli studi di Firenze M.C.Rohl Konigstein /Ts August 1999 Abstract In this paper,control variates are proposed to speed up Monte Carlo Sim- ulations to estimate expected error rates in multivariate classification. KEY WORDS:classification,control variates,error rate,Monte Carl Simulation,variance reduction 1 Introduction The aim of this paper is to speed up Monte Carlo Simulations applied to multivariate classification.The most interesting performance measure in classification is the misclassification error. In the case of given group densities,there are two possibilities to calculate the error rate:either by numerical integration or by Monte Carlo Simulation which is the only feasible method in higher dimensions.In this paper,we focus on the Monte Carlo error estimate.This approach suffers from the variability of the error rates,because the error rate is a random variable by now.Therefore,every principle to reduce this variance is welcome.In the literature various variance reduction techniques are proposed,among those antithetic variables and control variates(see,e.g.,[1]).Here, *D-44221 Dortmund,Germany,Tel.++49231-755-4363,e-m ail:weihs@amadeus.st atistik.uni- dortmund.de
Variance Reduction with Monte Carlo Estimates of Error Rates in Multivariate Classication C. Weihs Fachbereich Statistik Universitat Dortmund G. Calzolari Dipartimento di Statistica Universita degli studi di Firenze M. C. Rohl Konigstein /Ts. August 1999 Abstract In this paper, control variates are proposed to speed up Monte Carlo Simulations to estimate expected error rates in multivariate classication. KEY WORDS: classication, control variates, error rate, Monte Carlo Simulation, variance reduction 1 Introduction The aim of this paper is to speed up Monte Carlo Simulations applied to multivariate classication. The most interesting performance measure in classication is the misclassication error. In the case of given group densities, there are two possibilities to calculate the error rate: either by numerical integration or by Monte Carlo Simulation which is the only feasible method in higher dimensions. In this paper, we focus on the Monte Carlo error estimate. This approach suers from the variability of the error rates, because the error rate is a random variable by now. Therefore, every principle to reduce this variance is welcome. In the literature various variance reduction techniques are proposed, among those antithetic variables and control variates (see, e.g., [1]). Here, D-44221 Dortmund, Germany, Tel. ++49-231-755-4363, e-mail: weihs@amadeus.statistik.unidortmund.de
2 Multivariate Classification 2 ve will gorcertrate cn ccrtrd variates ard cenastrate their variarce redcticn pCtertial mnar spedal pICblem The paer isggariedas fclove mnsectic2 vewill giveabrief irtrcd cticnto mltantedassitxatic insectic3.vewill propcsetwocierert cortrdvanates whchwill bestiaedardccmaredinsectim4 by nearsosceeamles the paper dcseswthacaduscninsectims. 2 Multivariate Classification Cassifcaticncealsvithtrealccaticncfoiectsto prcetemiredgIcup,2. 4 sa.Treamis tommetre msdlassifiatime (rate cer all possible treallccaticrs giventregrap cersitiespi()(=12,...,9).Tremrial eCr Iateistreso-calledBayes error. weneaued featues (aiales ctreciectsvecarsicer im atart forcisaim raticnbetweentrecjects Thesecanbeccrticis featues(GNP,camticn etc)crasccte(nber dfms nmber d irrabitartsetc). orcetregicub cersitiesaespcafied inacertoririmietheerarrateveallccate ancbject wthfeatuevector tograb i.f pi(r)>p(x)(t). (1) Cassifcaticnnetrcos ctenassunetlegicyp cersitiesp()tobe rcmal rken trereaeat least tvonccelling pcssibiltes(see eg.,21): Etinatetresaneccvariarcenatris fral gcus(DA.lirearcisqimirart aralyss)cr estinatea creert ccvariarcenatrik for cach grap (QDA,qladatic cis cmrart aralyss). ofccuse bcthmetkccsalsocstinatecneertneanvectasfcreachgIcup.inthis papervetakeoDA as treaceiate ardths starcarddassitcaticnpIccecte oftenveacciticrally vart toredcetrecinesicnficm to=i cr2 togrkarce nanperceticn(anersmieccuicn.Treccrstncticndad-saewthmn inal ecrrate glventregicp censtspi()ina space canbeccreby nccem timatcnte(rnlques.Icreamlesimulated,Annealing,[3].Ineacnct step,apIgiecticnspaceisprocsed trenvecetemretregIcib cersties(atrer estimated by neans o tre pIciected cata cr crectly cerved ficmtre pIciected censities d thecngiral spae)[4,ardcalalate tre err ratemn thepicecticn Sace
2 Multivariate Classication 2 we will concentrate on control variates and demonstrate their variance reduction potential in our special problem. The paper is organized as follows: In section 2 we will give a brief introduction to multivariate classication. In section 3, we will propose two dierent control variates which will be studied and compared in section 4 by means of some examples. The paper closes with a conclusion in section 5. 2 Multivariate Classication Classication deals with the allocation of ob jects to g predetermined groups 1; 2;::: ; g, say. The aim is to minimize the misclassication error (rate) over all possible future allocations, given the group densities pi(x) ( i= 1 ;2;::: ;g). The minimal error rate is the so{called Bayes error. We measure d features (variables) of the ob jects we consider important for discrimination between the ob jects. These can be continuous features (GNP, consumption etc.) or discrete (number of rms, number of inhabitants etc.). Once the group densities are specied, in order to minimize the error rate we allocate an ob ject with feature vector x to group i, if pi(x) > pj (x) ( 6=j i): (1) Classication methods often assume the group densities pi(x) to be normal. Then there are at least two modelling possibilities (see, e.g., [2]): Estimate the same covariance matrix for all groups (LDA, linear discriminant analysis) or estimate a dierent covariance matrix for each group (QDA, quadratic discriminant analysis). Of course, both methods also estimate dierent mean vectors for each group. In this paper we take QDA as the adequate, and thus standard classication procedure. Often we additionally want to reduce the dimension from d to d0 = 1 or 2 to enhance human perception (dimension reduction). The construction of a d0{space with minimal error rate, given the group densities pi(x) in d{space, can be done by modern optimization techniques, for example Simulated Annealing [3]. In each optimization step, a pro jection space is proposed. Then we determine the group densities (either estimated by means of the pro jected data or directly derived from the pro jected densities of the original space) [4], and calculate the error rate in the pro jection space
3 Variance Reduction by Control Variates 3 In this paper,we suppose that the prcjection space is fixed,so that we already have the group densities available Of couse,the follcwing approach can be also applied during optimization at each optimization step. 3 Varian Ce Red u Ction by Control variates 3.1 G eneral Ideas What we have to calculate is the error rate given the group densities.In one dimen- sion,this can easily be done by numerical integration,because we only have to find the intersection points of the different group densities(determined by pi)=pi()) and then calculate integrals like pix)dab∈R, (2) where pi)denctes an arbitrary kncwn group density. But in two or more dimensions,the border lines between the group densities do not have that simple shapes,even when we assume equal group ccvariance matrices. Therefore,integration can only be done by means cf a grid in two or more dimen- sions. Ancther possibility to calculate the error rate is Monte Carlo Simulation.We gen- erate random realizations from the group densities and allocate them according to our classification rule (1).This approach suffers from the variability of the error rates,because the error rate is a random variable by ncw. In order to reduce the Monte Carlo variance of the error rate we introduce control variates (cv).The okject cf interest is the error rate errcr.We write this in a more complicated but helpful way as errar=erra+(erra-erra) (3) with a new random variable erra.We want to compute the expectation cf these error rates E(erra)=E(err)+E(rra-erra). (4) The idea behind control variates is to choose a random variable erra so that we can calculate E(erra)exactly (no variance)and error and errorG are positively correlated.So a sensible way cf estimating E(errcr)would be EG(erra)=E(Err)+E(rrar-erra) (5)
3 Variance Reduction by Control Variates 3 In this paper, we suppose that the pro jection space is xed, so that we already have the group densities available. Of course, the following approach can be also applied during optimization at each optimization step. 3 Variance Reduction by Control Variates 3.1 General Ideas What we have to calculate is the error rate given the group densities. In one dimension, this can easily be done by numerical integration, because we only have to nd the intersection points of the dierent group densities (determined by pi(x) = pj (x)) and then calculate integrals like Z b a pi(x) dx a; b 2 R; (2) where pi(x) denotes an arbitrary known group density. But in two or more dimensions, the borderlines between the group densities do not have that simple shapes, even when we assume equal group covariance matrices. Therefore, integration can only be done by means of a grid in two or more dimensions. Another possibility to calculate the error rate is Monte Carlo Simulation. We generate random realizations from the group densities and allocate them according to our classication rule (1). This approach suers from the variability of the error rates, because the error rate is a random variable by now. In order to reduce the Monte Carlo variance of the error rate we introduce control variates (cv). The ob ject of interest is the error rate error. We write this in a more complicated but helpful way as error = errorcv + ( error errorcv) (3) with a new random variable errorcv . We want to compute the expectation of these error rates E(error) = E( errorcv) + E( error errorcv): (4) The idea behind control variates is to choose a random variable errorcv so that we can calculate E(errorcv) exactly (no variance) and error and errorcv are positively correlated. So a sensible way of estimating E(error) would be E^ cv(error) = E( errorcv) + E( ^ error errorcv); (5)
3.2 Two Specific Control Variates 4 where the first term on the right hand side has no variance and the second term is computed as the sample mean of Monte-Carlo replicates.Then the variance of the right hand side of(4)is Var(error -errorc)/N, (6) where N is the sample size to determine error and errorco,and Var(error -errorc)=Var(erron+Var(erroro) -2Cov(error,errorc). (7) Now it becomes clear that a large positive correlation between error and errorc can reduce the variance compared to the "naive"estimator EMc(error),i.e.the sample mean of Monte Carlo replicates of error with variance Var(EMc(error))= Var(error)/N.We can even do better.We can use the equation E(error)=aE(errorc)+E(error-aerrorcu) (8) to select that parameter a that minimizes the variance Var(error-aerrorc), (9) leading to Q= Cov(error,errorcu) Var(errore) (10) which is almost equal to the correlation coefficient owhen Var(error)Var(errorco) holds.The final result is then min Var(error-aerrora)(1-2)Var(error), (11) i.e.there can always be a gain when o0. Considering the above arguments,what we look for as a control variate procedure is any classification method which gives results as much as possible correlated with QDA,and for which the exact expected error rate is easily computable.Moreover, one should avoid control variates for which the additional computational effort is that high that overall computation time is increased even in the case of variance reduction. 3.2 Two Specific Control Variates What is,then,a suitable control variate in our context?We will discuss two possi- bilities.In both cases the control variate procedure assumes a somewhat simplified problem situation to be true in order to simplify the Monte Carlo procedure.In the first procedure the covariance matrices of the different groups are assumed to be identical.In the second procedure the possibly high dimensional problem is optimally projected to one dimension.Note that we assumed to study problems with normal group densities with individual covariance matrices.Thus,QDA was assumed to be the standard classification method
3.2 Two Specic Control Variates 4 where the rst term on the right hand side has no variance and the second term is computed as the sample mean of Monte-Carlo replicates. Then the variance of the right hand side of (4) is Var(error errorcv)=N ; (6) where N is the sample size to determine error and errorcv , and Var(error errorcv) = Var( error) + Var( errorcv) 2 Cov( error; errorcv): (7) Now it becomes clear that a large positive correlation between error and errorcv can reduce the variance compared to the "naive" estimator E^ MC (error), i.e. the sample mean of Monte Carlo replicates of error with variance Var(E^ MC (error)) = Var(error)=N. We can even do better. We can use the equation E(error) = E(errorcv) + E( error errorcv) (8) to select that parameter that minimizes the variance Var(error errorcv); (9) leading to = Cov(error; errorcv) Var(errorcv) (10) which is almost equal to the correlation coecient % when Var(error) Var(errorcv) holds. The nal result is then min Var(error errorcv) (1 % 2)Var(error); (11) i.e. there can always be a gain when % 6= 0. Considering the above arguments, what we look for as a control variate procedure is any classication method which gives results as much as possible correlated with QDA, and for which the exact expected error rate is easily computable. Moreover, one should avoid control variates for which the additional computational eort is that high that overall computation time is increased even in the case of variance reduction. 3.2 Two Specic Control Variates What is, then, a suitable control variate in our context? We will discuss two possibilities. In both cases the control variate procedure assumes a somewhat simplied problem situation to be true in order to simplify the Monte Carlo procedure. In the rst procedure the covariance matrices of the dierent groups are assumed to be identical. In the second procedure the possibly high dimensional problem is optimally pro jected to one dimension. Note that we assumed to study problems with normal group densities with individual covariance matrices. Thus, QDA was assumed to be the standard classication method
3.2 Two Specific Control Variates 5 1.The first idea is to utilize the error rate computed by LDA as erroree based on the assumption of equal covariance matrices for all the groups.The error rate error is calculated by QDA from N random realizations drawn from the group densities.To get Emc(error)we generate W such error rates and average.Therefore we used N x W random vectors.Now we take the same random vectors and apply LDA with the same,so-called pooled,covariance matrix for all groups to calculate error rates erroree.If Ei is the assumed covariance matrix for group i,then ((N-1))/(N-g)is the pooled covariance matrix,where M is the number of realizations in group i.The differences of the w corresponding estimates error and errorc are used to calculate E(error-errore).At last we calculate E(errorce)in an exact manner (so that we have no variance)by numerical integration based on the densities with pooled covariance matrices.We now have all the ingredients we need for an efficiency comparison with the naive Monte Carlo estimator.The variance of the naive estimator is calculated by the sample of size W of the estimated error rates error and the variance of the control variate estimator by the sample of size W of(error-error).This approach has the drawback that we have to calculate an exact integral in a projection space which might be two dimensional or of even higher dimension with rather ugly borderlines. 2.A second possibility to determine the error rate error is to use another con- trol variate:the error rate of an "optimal"one dimensional projection.This can be obtained by the largest eigenvalue and the corresponding eigenvector of QDA in the original space or by direct minimization of the error rate.We do the same as in 1 to obtain Evc(error).But then we project the random vectors onto the optimally discriminating direction taking into account the different covariance structures and build the differences of corresponding er- ror estimates to compute E(error-error).Now,the exact calculation of E(error)is simply a one dimensional integration with clearcut intersection points.This speeds up the procedure compared to 1.To construct the opti- mally discriminating one dimensional projection we follow an idea in [5]where it was proposed to project on the first eigenvector v of MMT,where M=(h-4,,2-41,2g-1,,2-) (12) where the i are the group means and the Ei are (again)the group covariance matrices,i=1,...,g.The projected means,variances and feature vectors then have the form:=i=ofiv and x*=of'r. In order to represent adequate control variates the additional computation time of procedures 1 and 2 have to be small relative to the computation time of naive Monte Carlo.That this is the case not considering the computation of the exact expected error rates should be clear by the following arguments. Naive Monte Carlo estimates the means and the covariance matrices of the groups,and evaluates the corresponding estimated group densities for each observation
3.2 Two Specic Control Variates 5 1. The rst idea is to utilize the error rate computed by LDA as errorcv based on the assumption of equal covariance matrices for all the groups. The error rate error is calculated by QDA from N random realizations drawn from the group densities. To get E^ MC (error) we generate W such error rates and average. Therefore we used N W random vectors. Now we take the same random vectors and apply LDA with the same, so-called pooled, covariance matrix for all groups to calculate error rates errorcv. If i is the assumed covariance matrix for group i, then (Pg i=1(Ni 1)i)=(N g) is the pooled covariance matrix, where Ni is the number of realizations in group i. The dierences of the W corresponding estimates error and errorcv are used to calculate E( ^ error errorcv). At last we calculate E(errorcv) in an exact manner (so that we have no variance) by numerical integration based on the densities with pooled covariance matrices. We now have all the ingredients we need for an eciency comparison with the naive Monte Carlo estimator. The variance of the naive estimator is calculated by the sample of size W of the estimated error rates error and the variance of the control variate estimator by the sample of size W of (error errorcv). This approach has the drawback that we have to calculate an exact integral in a pro jection space which might be two dimensional or of even higher dimension with rather ugly borderlines. 2. A second possibility to determine the error rate error is to use another control variate: the error rate of an "optimal" one dimensional pro jection. This can be obtained by the largest eigenvalue and the corresponding eigenvector of QDA in the original space or by direct minimization of the error rate. We do the same as in 1 to obtain E^ MC (error). But then we pro ject the random vectors onto the optimally discriminating direction taking into account the dierent covariance structures and build the dierences of corresponding error estimates to compute E( ^ error errorcv). Now, the exact calculation of E(errorcv) is simply a one dimensional integration with clearcut intersection points. This speeds up the procedure compared to 1. To construct the optimally discriminating one dimensional pro jection we follow an idea in [5] where it was proposed to pro ject on the rst eigenvector v1 of MMT , where M = ( g 1; :::; 2 1; g 1; :::; 2 1) (12) where the i are the group means and the i are (again) the group covariance matrices, i = 1, ..., g. The pro jected means, variances and feature vectors then have the form: i = v T 1 i, i = v T 1 iv1 and x = v T 1 x . In order to represent adequate control variates the additional computation time of procedures 1 and 2 have to be small relative to the computation time of naive Monte Carlo. That this is the case not considering the computation of the exact expected error rates should be clear by the following arguments. Naive Monte Carlo estimates the means and the covariance matrices of the groups, and evaluates the corresponding estimated group densities for each observation.