Artificial Intelligence 215 (2014)55-78 Contents lists available at ScienceDirect rtificial Intelligence Artificial Intelligence ELSEVIER www.elsevier.com/locate/artint Multicategory large margin classification methods: CrossMark Hinge losses vs.coherence functions Zhihua Zhang3.*,Cheng Chen,Guang Daia,Wu-Jun Lib,Dit-Yan Yeung a Key Laboratory of Shanghai Education Commission for Intelligence Interaction Cognition Engineering.Department of Computer Science and Engineering.Shanghai Jiao Tong University,800 Dong Chuan Road,Shanghai,200240.China b National Key Laboratory for Novel Software Technology,Department of Computer Science and Technology.Nanjing University.Nanjing. 210023.China Department of Computer Science and Engineering.Hong Kong University of Science and Technology.Hong Kong.China ARTICLE INFO ABSTRACT Article history: Generalization of large margin classification methods from the binary classification setting Received 3 August 2013 to the more general multicategory setting is often found to be non-trivial.In this paper,we Received in revised form 9 May 2014 study large margin classification methods that can be seamlessly applied to both settings, Accepted 16 June 2014 Available online 20 June 2014 with the binary setting simply as a special case.In particular.we explore the Fisher consistency properties of multicategory majorization losses and present a construction Keywords: framework of majorization losses of the 0-1 loss.Under this framework,we conduct an Multiclass margin classification in-depth analysis about three widely used multicategory hinge losses.Corresponding to Fisher consistency the three hinge losses,we propose three multicategory majorization losses based on a Multicategory hinge losses coherence function.The limits of the three coherence losses as the temperature approaches Coherence losses zero are the corresponding hinge losses,and the limits of the minimizers of their expected Multicategory boosting algorithm errors are the minimizers of the expected errors of the corresponding hinge losses. Finally,we develop multicategory large margin classification methods by using a so-called multiclass C-loss. 2014 Elsevier B.V.All rights reserved. 1.Introduction Large margin classification methods have become increasingly popular since the advent of the support vector machine (SVM)[4]and boosting [7,8].Recent developments include the large-margin unified machine of Liu et al.[16]and the flexible assortment machine of Qiao and Zhang [19.Typically,large margin classification methods approximately solve an otherwise intractable optimization problem defined with the 0-1 loss.These algorithms were originally designed for binary classification problems.Unfortunately,generalization of them to the multicategory setting is often found to be non-trivial. The goal of this paper is to solve multicategory classification problems using the same margin principle as that for binary problems. The conventional SVM based on the hinge loss function possesses support vector interpretation (or data sparsity) but does not have uncertainty (that is,the SVM does not directly estimate the conditional class probability).The non- differentiable hinge loss function also makes it non-trivial to extend the conventional SVM from binary classification *Corresponding author. E-mail addresses:zhzhang@gmail.com (Z Zhang).jackchen1990@gmail.com (C.Chen).guang.gdai@gmail.com (G.Dai).liwujun@nju.edu.cn (W.-J.Li). dyyeung@cse.ust.hk (D.-Y.Yeung) http://dx.doi.org/10.1016/j.artint.2014.06.002 0004-3702/2014 Elsevier B.V.All rights reserved
Artificial Intelligence 215 (2014) 55–78 Contents lists available at ScienceDirect Artificial Intelligence www.elsevier.com/locate/artint Multicategory large margin classification methods: Hinge losses vs. coherence functions Zhihua Zhang a,∗, Cheng Chen a, Guang Dai a, Wu-Jun Li b, Dit-Yan Yeung c a Key Laboratory of Shanghai Education Commission for Intelligence Interaction & Cognition Engineering, Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Road, Shanghai, 200240, China b National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing, 210023, China c Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China a r t i c l e i n f o a b s t r a c t Article history: Received 3 August 2013 Received in revised form 9 May 2014 Accepted 16 June 2014 Available online 20 June 2014 Keywords: Multiclass margin classification Fisher consistency Multicategory hinge losses Coherence losses Multicategory boosting algorithm Generalization of large margin classification methods from the binary classification setting to the more general multicategory setting is often found to be non-trivial. In this paper, we study large margin classification methods that can be seamlessly applied to both settings, with the binary setting simply as a special case. In particular, we explore the Fisher consistency properties of multicategory majorization losses and present a construction framework of majorization losses of the 0–1 loss. Under this framework, we conduct an in-depth analysis about three widely used multicategory hinge losses. Corresponding to the three hinge losses, we propose three multicategory majorization losses based on a coherence function. The limits of the three coherence losses as the temperature approaches zero are the corresponding hinge losses, and the limits of the minimizers of their expected errors are the minimizers of the expected errors of the corresponding hinge losses. Finally, we develop multicategory large margin classification methods by using a so-called multiclass C-loss. © 2014 Elsevier B.V. All rights reserved. 1. Introduction Large margin classification methods have become increasingly popular since the advent of the support vector machine (SVM) [4] and boosting [7,8]. Recent developments include the large-margin unified machine of Liu et al. [16] and the flexible assortment machine of Qiao and Zhang [19]. Typically, large margin classification methods approximately solve an otherwise intractable optimization problem defined with the 0–1 loss. These algorithms were originally designed for binary classification problems. Unfortunately, generalization of them to the multicategory setting is often found to be non-trivial. The goal of this paper is to solve multicategory classification problems using the same margin principle as that for binary problems. The conventional SVM based on the hinge loss function possesses support vector interpretation (or data sparsity) but does not have uncertainty (that is, the SVM does not directly estimate the conditional class probability). The nondifferentiable hinge loss function also makes it non-trivial to extend the conventional SVM from binary classification * Corresponding author. E-mail addresses: zhzhang@gmail.com (Z. Zhang), jackchen1990@gmail.com (C. Chen), guang.gdai@gmail.com (G. Dai), liwujun@nju.edu.cn (W.-J. Li), dyyeung@cse.ust.hk (D.-Y. Yeung). http://dx.doi.org/10.1016/j.artint.2014.06.002 0004-3702/© 2014 Elsevier B.V. All rights reserved
56 Z Zhang et aL Artificial Intelligence 215(2014)55-78 problems to multiclass classification problems in the same margin principle [25,3,26,12,5,131.Thus,one seemingly natu- ral approach to constructing a classifier for the binary and multiclass problems is to consider a smooth loss function. For example,regularized logistic regression models based on the negative multinomial log-likelihood function (also called the logit loss)[31.10]are competitive with SVMs.Moreover,it is natural to exploit the logit loss in the development of a multicategory boosting algorithm [9].Recently.Zhang et al.[30]proposed a smooth loss function that called coherence function for developing binary large margin classification methods.The coherence function establishes a bridge between the hinge loss and the logit loss.In this paper,we study the application of the coherence function in the multiclass classification problem. 1.1.Multicategory margin classification methods We are concerned with an m-class(m>2)classification problem with a set of training data points ((xi,ci)where xA'C RP is an input vector and ci e(1.2....,m}is its corresponding class label.We assume that each x belongs to one and only one class.Our goal is to find a classifier中(x):X→c∈{1,..,mh. Let Pe(x)=Pr(C=clX =x),c=1,...,m,be the class conditional probabilities given x.The expected error at x is then defined by ox≠知PcX, c=1 where Ig#)is 1 if is true and 0 otherwise.The empirical error on the training data is thus given by n n i=1 Given that e is equal to its minimum value zero when all training data points are correctly classified,we wish to use e as a basis for devising classification methods. Suppose the classifier is modeled using an m-vector g(x)=(gi(x).....gm(x)),where the induced classifier is obtained via maximization in a manner akin to discriminant analysis:(x)=argmax (gj(x)).For simplicity of our analysis,we assume that for a fixed x,each gi itself lies in a compact set.We also assume that the maximizing argument of maxjgj(x) is unique.Of course this excludes the trivial case that gj=0 for all je(1.....m).However,this assumption does not imply that the maximum value is unique;indeed,adding a constant to each component gj(x)does not change the maximizing argument.To remove this redundancy,it is convenient to impose a sum-to-zero constraint.Thus we define and assume geg in this paper unless otherwise specified.Zou et al.[33]referred to such a g as the margin vector.Liu and Shen [15]referred to maxj(gj(x)-ge(x))as the generalized margin of (x,c)with respect to (w.r.t.)g. Since a margin vector g induces a classifier,we explore the minimization of e w.r.t.g.However,this minimization problem is intractable because is the 0-1 function.A wide variety of margin-based classifiers can be understood as minimizers of a surrogate loss function v(g(x)).which upper bounds the 0-1 loss )That is,various tractable surrogate loss functions ve(g(x))are thus used to upper approximate.The corresponding empirical risk function is given by (g) vc(g(xi)). i=1 If is a positive constant that does not depend on (xc).argmin(g)is equivalent to argmin(g).We thus present the following definition. Definition 1.A surrogate loss ve(g(x))is said to be the majorization of w.r.t.(x.c)if ve(g(x))xwhere a is a positive constant that does not depend on (x,c). In practice,convex majorization functions play an important role in the development of classification algorithms.On one hand,the convexity makes the resulting optimization problems computationally tractable.On the other hand,the classifica- tion methods usually have better statistical properties. Given a majorization function ve(g(x)),the classifier resulted from the minimization of R(g)w.r.t.the margin vector g is called a large margin classifier or a margin-based classification method.In the binary classification setting.a wide variety of classifiers can be understood as minimizers of a majorization loss function of the 0-1 loss.If such functions satisfy
56 Z. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 problems to multiclass classification problems in the same margin principle [25,3,26,12,5,13]. Thus, one seemingly natural approach to constructing a classifier for the binary and multiclass problems is to consider a smooth loss function. For example, regularized logistic regression models based on the negative multinomial log-likelihood function (also called the logit loss) [31,10] are competitive with SVMs. Moreover, it is natural to exploit the logit loss in the development of a multicategory boosting algorithm [9]. Recently, Zhang et al. [30] proposed a smooth loss function that called coherence function for developing binary large margin classification methods. The coherence function establishes a bridge between the hinge loss and the logit loss. In this paper, we study the application of the coherence function in the multiclass classification problem. 1.1. Multicategory margin classification methods We are concerned with an m-class (m > 2) classification problem with a set of training data points {(xi, ci)} n i=1 where xi ∈ X ⊂ Rp is an input vector and ci ∈ {1, 2,...,m} is its corresponding class label. We assume that each x belongs to one and only one class. Our goal is to find a classifier φ(x) : x → c ∈ {1,...,m}. Let Pc (x) = Pr(C = c|X = x), c = 1,...,m, be the class conditional probabilities given x. The expected error at x is then defined by m c=1 I{φ(x)=c} Pc (x), where I{#} is 1 if # is true and 0 otherwise. The empirical error on the training data is thus given by = 1 n n i=1 I{φ(xi)=ci}. Given that is equal to its minimum value zero when all training data points are correctly classified, we wish to use as a basis for devising classification methods. Suppose the classifier is modeled using an m-vector g(x) = (g1(x),..., gm(x))T , where the induced classifier is obtained via maximization in a manner akin to discriminant analysis: φ(x) = argmaxj{g j(x)}. For simplicity of our analysis, we assume that for a fixed x, each g j itself lies in a compact set. We also assume that the maximizing argument of max j g j(x) is unique. Of course this excludes the trivial case that g j = 0 for all j ∈ {1,...,m}. However, this assumption does not imply that the maximum value is unique; indeed, adding a constant to each component g j(x) does not change the maximizing argument. To remove this redundancy, it is convenient to impose a sum-to-zero constraint. Thus we define G = g1(x), . . . , gm(x) T m j=1 g j(x) = 0 and assume g ∈ G in this paper unless otherwise specified. Zou et al. [33] referred to such a g as the margin vector. Liu and Shen [15] referred to maxj(g j(x) − gc (x)) as the generalized margin of (x, c) with respect to (w.r.t.) g. Since a margin vector g induces a classifier, we explore the minimization of w.r.t. g. However, this minimization problem is intractable because I{φ(x)=c} is the 0–1 function. A wide variety of margin-based classifiers can be understood as minimizers of a surrogate loss function ψc (g(x)), which upper bounds the 0–1 loss I{φ(x)=c}. That is, various tractable surrogate loss functions ψc (g(x)) are thus used to upper approximate I{φ(x)=c}. The corresponding empirical risk function is given by Rˆ(g) = 1 n n i=1 ψci g(xi) . If α is a positive constant that does not depend on (x, c), argming(x)∈G 1 α Rˆ (g) is equivalent to argming(x)∈G Rˆ (g). We thus present the following definition. Definition 1. A surrogate loss ψc (g(x)) is said to be the majorization of I{φ(x)=c} w.r.t. (x, c) if ψc (g(x)) ≥ αI{φ(x)=c} where α is a positive constant that does not depend on (x, c). In practice, convex majorization functions play an important role in the development of classification algorithms. On one hand, the convexity makes the resulting optimization problems computationally tractable. On the other hand, the classification methods usually have better statistical properties. Given a majorization function ψc (g(x)), the classifier resulted from the minimization of Rˆ (g) w.r.t. the margin vector g is called a large margin classifier or a margin-based classification method. In the binary classification setting, a wide variety of classifiers can be understood as minimizers of a majorization loss function of the 0–1 loss. If such functions satisfy
Z Zhang et aL Artificial Intelligence 215(2014)55-78 5》 other technical conditions,the resulting classifiers can be shown to be Bayes consistent [1.It seems reasonable to pursue a similar development in the case of multicategory classification,and indeed such a proposal has been made by Zou et al. [33](also see[24.23). Definition 2.A surrogate function ve(g(x))is said to be Fisher-consistent w.r.t.a margin vector g(x)=(g(x).....gm(x))at (x,c)if(i)the following risk minimization problem m (x)=argmin ve(g(x))Pc(x) (1) g(x)EG c=1 has a unique solution g(x)=(g1(x).....gm(x));and (ii) argmax gc(x)=argmax Pc(x) Zou et al.[33]assumed ve(g(x))as an independent and identical setting:that is,ve(g(x))n(ge(x))where n is some loss function.As we see.Definition 2 does not require that the function ve(g(x))depends only on ge(x).Thus,this def- inition refines the definition of Zou et al.33.The definition is related to the notion of infinite-sample consistency (ISC) of Zhang [28].ISC says that an exact solution of Problem(1)leads to a Bayes rule.However,it does not require that the solution of Problem(1)be unique.Additionally.Zhang [28]especially discussed two other settings:pairwise comparison e(g(x)g(x)-gj(x))and constrained comparison ve(g(x))-gj(x)). In this paper,we are concerned with multicategory classification methods in which binary and multicategory problems are solved following the same principle.One of the principled approaches is due to Lee et al.[13].The authors proposed a multicategory SVM(MSVM)which treats the m-class problem simultaneously.Moreover,Lee et al.[13]proved that their MSVM satisfies a Fisher consistency condition.Unfortunately.this desirable property does not hold for many other multiclass SVMs(see,e.g.,[25,3,26,12]).The multiclass SVM of [5]possesses this property only if there is a dominating class(that is, maxj Pj(x)>1/2). Recently.Liu and Shen [15]proposed a so-called multicategory -learning algorithm by using a multicategory loss, and Wu and Liu [27]devised robust truncated-hinge-loss SVMs.These two algorithms are parallel to the multiclass SVM of Crammer and Singer [5]and enjoy a generalized pairwise comparison setting. Additionally.Zhu et al.[32]and Saberian and Vasconcelos [21]devised several multiclass boosting algorithms,which solve binary and multicategory problems under the same principle.Mukherjee and Schapire [17]created a general frame- work for studying multiclass boosting.which formalizes the interaction between the boosting algorithm and the weak learner.We note that Gao and Koller [11]applied the multiclass hinge loss of Crammer and Singer [5]to devise a multiclass boosting algorithm.However,this algorithm is cast under an output coding framework. 1.2.Contributions and outline In this paper,we study the Fisher consistency properties of multicategory surrogate losses.First,assuming that losses are twice differentiable,we present a Fisher consistency property under a more general setting.including the independent and identical,constrained comparison and generalized pairwise comparison settings.We next propose a framework for constructing a majorization function of the 0-1 loss.This framework provides us with a natural and intuitive perspective for construction of three extant multicategory hinge losses.Under this framework,we conduct an in-depth analysis on the Fisher consistency properties of these three extant multicategory hinge losses.In particular,we give a sufficient condition that the multiclass hinge loss used by Vapnik[25],Bredensteiner and Bennett[3.Weston and Watkins [26,Guermeur [12] satisfies the Fisher consistency.Moreover,we constructively derive the minimizers of the expected errors of the multiclass hinge losses of Crammer and Singer 5. The framework also inspires us to propose a class of multicategory majorization functions which are based on the coherence function [30.The coherence function is a smooth and convex majorization of the hinge function.Especially. its limit as the temperature approaches zero gives the hinge loss.Moreover,its relationship with the logit loss is also shown.Zhang et al.[30]originally exploited the coherence function in binary classification problems.We investigate its application in the development of multicategory margin classification methods.Based on the coherence function,we in particular present three multicategory coherence losses which correspond to the three extant multicategory hinge losses. These multicategory coherence losses are infinitely smooth and convex and they satisfy the Fisher consistency condition. The coherence losses have the advantage over the hinge losses that they provide an estimate of the conditional class probability,and over the multicategory logit loss that their limiting versions at zero temperature are just their corresponding multicategory hinge loss functions.Thus they are very appropriate for use in the development of multicategory large margin classification methods,especially boosting algorithms.We propose in this paper a multiclass C learning algorithm and a multiclass GentleBoost algorithm,both based on our multicategory coherence loss functions. The remainder of this paper is organized as follows.Section 2 gives a general result on Fisher consistency.In Section 3. we discuss the methodology for the construction of multicategory majorization losses and present two majorization losses
Z. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 57 other technical conditions, the resulting classifiers can be shown to be Bayes consistent [1]. It seems reasonable to pursue a similar development in the case of multicategory classification, and indeed such a proposal has been made by Zou et al. [33] (also see [24,23]). Definition 2. A surrogate function ψc (g(x)) is said to be Fisher-consistent w.r.t. a margin vector g(x) = (g1(x),..., gm(x))T at (x, c) if (i) the following risk minimization problem gˆ(x) = argmin g(x)∈G m c=1 ψc g(x) Pc (x) (1) has a unique solution gˆ(x) = (gˆ 1(x),..., gˆm(x))T ; and (ii) argmax c gˆ c (x) = argmax c Pc (x). Zou et al. [33] assumed ψc (g(x)) as an independent and identical setting; that is, ψc (g(x)) η(gc (x)) where η is some loss function. As we see, Definition 2 does not require that the function ψc (g(x)) depends only on gc (x). Thus, this definition refines the definition of Zou et al. [33]. The definition is related to the notion of infinite-sample consistency (ISC) of Zhang [28]. ISC says that an exact solution of Problem (1) leads to a Bayes rule. However, it does not require that the solution of Problem (1) be unique. Additionally, Zhang [28] especially discussed two other settings: pairwise comparison ψc (g(x)) j=c η(gc (x) − g j(x)) and constrained comparison ψc (g(x)) j=c η(−g j(x)). In this paper, we are concerned with multicategory classification methods in which binary and multicategory problems are solved following the same principle. One of the principled approaches is due to Lee et al. [13]. The authors proposed a multicategory SVM (MSVM) which treats the m-class problem simultaneously. Moreover, Lee et al. [13] proved that their MSVM satisfies a Fisher consistency condition. Unfortunately, this desirable property does not hold for many other multiclass SVMs (see, e.g., [25,3,26,12]). The multiclass SVM of [5] possesses this property only if there is a dominating class (that is, maxj P j(x) > 1/2). Recently, Liu and Shen [15] proposed a so-called multicategory ψ-learning algorithm by using a multicategory ψ loss, and Wu and Liu [27] devised robust truncated-hinge-loss SVMs. These two algorithms are parallel to the multiclass SVM of Crammer and Singer [5] and enjoy a generalized pairwise comparison setting. Additionally, Zhu et al. [32] and Saberian and Vasconcelos [21] devised several multiclass boosting algorithms, which solve binary and multicategory problems under the same principle. Mukherjee and Schapire [17] created a general framework for studying multiclass boosting, which formalizes the interaction between the boosting algorithm and the weak learner. We note that Gao and Koller [11] applied the multiclass hinge loss of Crammer and Singer [5] to devise a multiclass boosting algorithm. However, this algorithm is cast under an output coding framework. 1.2. Contributions and outline In this paper, we study the Fisher consistency properties of multicategory surrogate losses. First, assuming that losses are twice differentiable, we present a Fisher consistency property under a more general setting, including the independent and identical, constrained comparison and generalized pairwise comparison settings. We next propose a framework for constructing a majorization function of the 0–1 loss. This framework provides us with a natural and intuitive perspective for construction of three extant multicategory hinge losses. Under this framework, we conduct an in-depth analysis on the Fisher consistency properties of these three extant multicategory hinge losses. In particular, we give a sufficient condition that the multiclass hinge loss used by Vapnik [25], Bredensteiner and Bennett [3], Weston and Watkins [26], Guermeur [12] satisfies the Fisher consistency. Moreover, we constructively derive the minimizers of the expected errors of the multiclass hinge losses of Crammer and Singer [5]. The framework also inspires us to propose a class of multicategory majorization functions which are based on the coherence function [30]. The coherence function is a smooth and convex majorization of the hinge function. Especially, its limit as the temperature approaches zero gives the hinge loss. Moreover, its relationship with the logit loss is also shown. Zhang et al. [30] originally exploited the coherence function in binary classification problems. We investigate its application in the development of multicategory margin classification methods. Based on the coherence function, we in particular present three multicategory coherence losses which correspond to the three extant multicategory hinge losses. These multicategory coherence losses are infinitely smooth and convex and they satisfy the Fisher consistency condition. The coherence losses have the advantage over the hinge losses that they provide an estimate of the conditional class probability, and over the multicategory logit loss that their limiting versions at zero temperature are just their corresponding multicategory hinge loss functions. Thus they are very appropriate for use in the development of multicategory large margin classification methods, especially boosting algorithms. We propose in this paper a multiclass C learning algorithm and a multiclass GentleBoost algorithm, both based on our multicategory coherence loss functions. The remainder of this paper is organized as follows. Section 2 gives a general result on Fisher consistency. In Section 3, we discuss the methodology for the construction of multicategory majorization losses and present two majorization losses
58 Z Zhang et aL Artificial Intelligence 215 (2014)55-78 based on the coherence function.Section 4 develops another multicategory coherence loss that we call the multiclass C-loss.Based on the multiclass C-loss,a multiclass C learning algorithm and a multiclass GentleBoost algorithm are given in Section 5.We conduct empirical analysis for the multicategory large margin algorithms in Section 6,and conclude our work in Section 7.All proofs are deferred to the appendix. 2.A general result on Fisher consistency Using the notion and notation given in Section 1.1,we now consider a more general setting than the pairwise comparison. Let gf (x)=(g1(x)-gc(x).....gc-1(x)-gc(x).gc+1(x)-ge(x).....gm(x)-gc(x)).We define ve(g(x))as a function of g (x) (thereafter denoted f(g(x))).It is clear that the pairwise comparison ve(g(x))=(ge(x)-gj(x)),the multiclass hinge loss of Crammer and Singer [5].the multicategory -loss of Liu and Shen [15].and the truncated hinge loss of Wu and Liu [27]follow this generalized definition.Moreover,for these cases,we note that f(g)is symmetric. Furthermore,we present a unifying definition of(g)=(v(g).....vm(g)):Rm->Rm where we ignore the depen- dency of g on x.Let be a set of mappings(g)satisfying the conditions:(i)when fixed geve(g)is symmetric w.r.t. the remaining arguments and(ii)ve(g)=vj(gi)where gic is obtained by only exchanging ge and gj of g.Obviously,the mapping(g)defined via the independent and identical setting.the constrained comparison,or the generalized pairwise comparison with symmetric f belongs to With this notion,we give an important theorem of this paper as follows. Theorem 3.Let (g)e be a twice differentiable function from Rm to Rm.Assume that the Hessian matrix of ve(g)w.r.t.g is conditionally positive definite for c=1.....m.Then the minimizer=(.....m)ofve(g(x))Pe(x)in g exists and is unique. Furthermore.f where is negative for anythen PP implies agi The proof of the theorem is given in Appendix A.1.Note that an mxm real matrix A is said to be conditionally positive definite if y'Ay for any nonzero real vector y=(y1....ym)with=0.The condition that age agj on G for jc is not necessary for Fisher consistency.For example,in the setting ve(g(x))=n(ge(x)).Zou et al.[33] proved that if n(z)is a twice differentiable function with n(0)<0 and n"(z)>0 Vz,then ve(g(x))is Fisher-consistent. In the setting ve(g(x))=jn(-gj).we note that ejn(-gj)Pe=c=1n(-ge)(1-Pe).Based on the proof of Zou et al.33].we have that if n(z)is a twice differentiable function with n(0)<0 and n(z)>0 Vz,then ve(g(x))= ()is Fisher-consistent.That is,in these two cases,we can relax the condition thatfor any dgc agj g∈Ga5-0<0,forj≠c agi We have the following corollary,whose proof is given in Appendix A.2.We will see two concrete cases of this corollary (that is,Theorems 10 and 13). Corollary 4.Assume ve(g)=f(g)where f(z)is asymmetric and twice differentiable function from Rm-1 to R.If the Hessian matrix of f(z)w.r.t.z is positive definite,then the minimizerg=(.....gm)ofe(g(x))P(x)in G exists and is unique.Furthermore, ife恩-ae段where j≠c is negative for any g.then P1>Pk implies>gk d只c agi Theorem 3 or Corollary 4 shows that ve(g)admits the ISC of Zhang [28].Thus,under the conditions in Theorem 3 or Corollary 4,we also have the relationship between the approximate minimization of the risk based onve and the approximate minimization of the classification error.In particular,if ve((X))Pc(X inf 2 vc(E(X)P:(X 1 1 for some e>0,then there exists an e2 >0 such that Ex ≤Ex +e2 -=1,c≠(X) Lc=1,c≠φ*(X) where()=argmaxj().()=argmaxj(Pj(X))and Ex)Pe(X)]is the optimal error.This result di- rectly follows from Theorem 3 in Zhang [28. 3.Multicategory majorization losses Given x and its label c,we let g(x)be a margin vector at x and the induced classifier be(x)=argmaxj gj(x).In the binary case,it is clear that (x)=c if and only if ge(x)>0,and that ge(x)<0 is a necessary and sufficient condition of 1 A symmetric function of p variables is one whose value at any p-tuple of arguments is the same as its value at any permutation of that p-tuple
58 Z. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 based on the coherence function. Section 4 develops another multicategory coherence loss that we call the multiclass C-loss. Based on the multiclass C-loss, a multiclass C learning algorithm and a multiclass GentleBoost algorithm are given in Section 5. We conduct empirical analysis for the multicategory large margin algorithms in Section 6, and conclude our work in Section 7. All proofs are deferred to the appendix. 2. A general result on Fisher consistency Using the notion and notation given in Section 1.1, we now consider a more general setting than the pairwise comparison. Let gc (x) = (g1(x)− gc (x),..., gc−1(x)− gc (x), gc+1(x)− gc (x),..., gm(x)− gc (x))T . We define ψc (g(x)) as a function of gc (x) (thereafter denoted f (gc (x))). It is clear that the pairwise comparison ψc (g(x)) = j=c η(gc (x)− g j(x)), the multiclass hinge loss of Crammer and Singer [5], the multicategory ψ-loss of Liu and Shen [15], and the truncated hinge loss of Wu and Liu [27] follow this generalized definition. Moreover, for these cases, we note that f (gc ) is symmetric.1 Furthermore, we present a unifying definition of ψ(g) = (ψ1(g),...,ψm(g))T : Rm → Rm where we ignore the dependency of g on x. Let Ψ be a set of mappings ψ(g) satisfying the conditions: (i) when fixed gc ψc (g) is symmetric w.r.t. the remaining arguments and (ii) ψc (g) = ψj(gjc ) where gjc is obtained by only exchanging gc and g j of g. Obviously, the mapping ψ(g) defined via the independent and identical setting, the constrained comparison, or the generalized pairwise comparison with symmetric f belongs to Ψ . With this notion, we give an important theorem of this paper as follows. Theorem 3. Let ψ(g) ∈ Ψ be a twice differentiable function from Rm to Rm. Assume that the Hessian matrix of ψc(g) w.r.t. g is conditionally positive definite for c = 1,...,m. Then the minimizer gˆ = (gˆ 1,..., gˆm)T of c ψc (g(x))Pc (x) in G exists and is unique. Furthermore, if ∂ψc (g) ∂ gc − ∂ψc (g) ∂ g j where j = c is negative for any g ∈ G, then Pl > Pk implies gˆl > gˆk. The proof of the theorem is given in Appendix A.1. Note that an m×m real matrix A is said to be conditionally positive definite if yT Ay > 0 for any nonzero real vector y = (y1,..., ym)T with m j=1 y j = 0. The condition that ∂ψc (g) ∂ gc − ∂ψc (g) ∂ g j < 0 on G for j = c is not necessary for Fisher consistency. For example, in the setting ψc (g(x)) = η(gc (x)), Zou et al. [33] proved that if η(z) is a twice differentiable function with η (0) < 0 and η(z) > 0 ∀z, then ψc (g(x)) is Fisher-consistent. In the setting ψc (g(x)) = j=c η(−g j), we note that c j=c η(−g j)Pc = c=1 η(−gc )(1 − Pc ). Based on the proof of Zou et al. [33], we have that if η(z) is a twice differentiable function with η (0) < 0 and η(z) > 0 ∀z, then ψc (g(x)) = j=c η(−g j(x)) is Fisher-consistent. That is, in these two cases, we can relax the condition that ∂ψc (g) ∂ gc − ∂ψc (g) ∂ g j < 0 for any g ∈ G as ∂ψc (0) ∂ gc − ∂ψc (0) ∂ g j < 0, for j = c. We have the following corollary, whose proof is given in Appendix A.2. We will see two concrete cases of this corollary (that is, Theorems 10 and 13). Corollary 4. Assume ψc (g) = f (gc ) where f (z) is a symmetric and twice differentiable function from Rm−1 to R. If the Hessian matrix of f (z) w.r.t. z is positive definite, then the minimizer gˆ = (gˆ 1,..., gˆm)T of c ψc (g(x))Pc (x) in G exists and is unique. Furthermore, if ∂ψc (g) ∂ gc − ∂ψc (g) ∂ g j where j = c is negative for any g, then Pl > Pk implies gˆl > gˆk. Theorem 3 or Corollary 4 shows that ψc (g) admits the ISC of Zhang [28]. Thus, under the conditions in Theorem 3 or Corollary 4, we also have the relationship between the approximate minimization of the risk based on ψc and the approximate minimization of the classification error. In particular, if EX m c=1 ψc gˆ(X) Pc (X) ≤ inf g∈G EX m c=1 ψc g(X) Pc (X) + 1 for some 1 > 0, then there exists an 2 > 0 such that EX m c=1,c=φ(ˆ X) Pc (X) ≤ EX m c=1,c=φ∗(X) Pc (X) + 2 where φ(ˆ X) = argmaxj{gˆ j(X)}, φ∗(X) = argmaxj{P j(X)} and EX [ m c=1,c=φ∗(X) Pc (X)] is the optimal error. This result directly follows from Theorem 3 in Zhang [28]. 3. Multicategory majorization losses Given x and its label c, we let g(x) be a margin vector at x and the induced classifier be φ(x) = argmax j g j(x). In the binary case, it is clear that φ(x) = c if and only if gc (x) > 0, and that gc (x) ≤ 0 is a necessary and sufficient condition of 1 A symmetric function of p variables is one whose value at any p-tuple of arguments is the same as its value at any permutation of that p-tuple.
Z Zhang et aL Artificial Intelligence 215 (2014)55-78 59 (x)C.Thus,we always have Iige(x)<o=x).Furthermore,let gi(x)=-g2(x)=f(x)and encode y=1 if c=1 and y=-1 if c=2.Then the empirical error is n n (yif(x)0] =1 i=1 In the multicategory case,(x)=c implies ge(x)>0 but (x)c does not imply ge(x)<0.We shall see that ge(x)<0 is a sufficient but not necessary condition of (x)c.In general,we only have Itge(x)soc)Although gj(x)- ge(x)>0 for some jc is a necessary and sufficient condition of (x)c,it in most cases yields an optimization problem which is not easily solved.This is an important reason why it is not trivial to develop multicategory AdaBoost and SVMs with the same principle as binary AdaBoost and SVMs. 3.1.Methodology Recall that g(=0 and there is at least one j(1.....m)such that gj(x).If ge(x)0.then there exists one IE(1.....m)such that Ic and gi(x)>0.As a result,we have (x)c.Therefore,ge(x)<0 implies (x)c. Unfortunately,if (x)c.ge(x)<0 does not necessarily hold.For example,consider the case that m=3 and c=1.Assume that g(x)=(2,3,-5).Then we have (x)=2 1 and gi(x)=2>0.In addition,it is clear that (x)=c implies gc(x)>0. However,ge(x)>0 does not imply (x)=c. On the other hand,it is obvious that (x)=c is equivalent to (x)j for all jc.In terms of the above discussions, a condition of making (x)=c is gj(x)<0 for jc.To summarize,we immediately have the following theorem. Proposition 5.For (x,c),let g(x)be a margin vector at x and the induced classifier be (x)=argmaxj gj(x).Then (@)Ige)0,≤b6x≠1=Ukg影6-&x>01≤Uj*8x>0) (b) nk8x≤0≤1n*g0w-gc(x≤0!=ow=d≤gex>0 (c) 4Ukg阀>0,≤∑1g,0>01 j≠c (d1Ukgw-g<w>0,≤∑1g1x-&x>01- j女 Proposition 5 shows that ge(x)<0 is the sufficient condition of (x)c,while gj(x)>0 for some jc is its necessary condition.The following theorem shows that they become sufficient and necessary when g has one and only one positive element. Proposition 6.Under the conditions in Proposition 5.The relationship of g.国s0=1国=lUk8国-g0=Ukeg0W01=∑g0x>01 j≠ holds if and only if the margin vector g(x)has only one positive element. In the binary case,this relationship always holds because gi(x)=-g2(x).Recently.Zou et al.[33]derived multicat- egory boosting algorithms using exp(-ge(x)).In their discrete boosting algorithm,the margin vector g(x)is modeled as an m-vector function with one and only one positive element.In this case,Igx)o)is equal to I Consequently. exp(-ge(x))is a majorization of because exp(-gc(x))is an upper bound of )Therefore,this discrete Ad- aBoost algorithm still approximates the original empirical 0-1 loss function.In the general case,however,Proposition 6 implies that exp(-ge(x))is not the majorization of) 3.2.Approaches Proposition 5 provides us with approaches for constructing majorization functions of the 0-1 loss function). Clearly,andare separable.so they are more tractable respectively thanI andg()0Thus.)>0 and g)are popularly employed in practical applications. In particular,suppose n(gj(x))upper bounds Igj(x):that is.n(gj(x))(x))Note that n(gj(x))Ig)o if and only if n(-gj(x))>Itgj(x)zo).Thus n(-gj(x))upper bounds Ig;(x)zo),and hence n(gj(x)-gi(x))upper bounds It then follows from Proposition 5 thatgj())and(g(x)-gj(x))are majorizations of Consequently,we can define two classes of majorizations for I)The first one is
Z. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 59 φ(x) = c. Thus, we always have I{gc (x)≤0} = I{φ(x)=c}. Furthermore, let g1(x) = −g2(x) = 1 2 f (x) and encode y = 1 if c = 1 and y = −1 if c = 2. Then the empirical error is = 1 n n i=1 I{φ(xi)=ci} = 1 n n i=1 I{gci (xi)≤0} = 1 n n i=1 I{yi f (xi)≤0}. In the multicategory case, φ(x) = c implies gc (x) > 0 but φ(x) = c does not imply gc (x) ≤ 0. We shall see that gc (x) ≤ 0 is a sufficient but not necessary condition of φ(x) = c. In general, we only have I{gc (x)≤0} ≤ I{φ(x)=c}. Although g j(x) − gc (x) > 0 for some j = c is a necessary and sufficient condition of φ(x) = c, it in most cases yields an optimization problem which is not easily solved. This is an important reason why it is not trivial to develop multicategory AdaBoost and SVMs with the same principle as binary AdaBoost and SVMs. 3.1. Methodology Recall that m j=1 g j(x) = 0 and there is at least one j ∈ {1,...,m} such that g j(x) = 0. If gc (x) ≤ 0, then there exists one l ∈ {1,...,m} such that l = c and gl(x) > 0. As a result, we have φ(x) = c. Therefore, gc (x) ≤ 0 implies φ(x) = c. Unfortunately, if φ(x) = c, gc (x) ≤ 0 does not necessarily hold. For example, consider the case that m = 3 and c = 1. Assume that g(x) = (2, 3,−5). Then we have φ(x) = 2 = 1 and g1(x) = 2 > 0. In addition, it is clear that φ(x) = c implies gc (x) > 0. However, gc (x) > 0 does not imply φ(x) = c. On the other hand, it is obvious that φ(x) = c is equivalent to φ(x) = j for all j = c. In terms of the above discussions, a condition of making φ(x) = c is g j(x) ≤ 0 for j = c. To summarize, we immediately have the following theorem. Proposition 5. For (x, c), let g(x) be a margin vector at x and the induced classifier be φ(x) = arg maxj g j(x). Then (a) I{gc (x)≤0} ≤ I{φ(x)=c} = I{ j=c g j(x)−gc (x)>0} ≤ I{ j=c g j(x)>0} (b) I{ j=c g j(x)≤0} ≤ I{ j=c g j(x)−gc (x)≤0} = I{φ(x)=c} ≤ I{gc (x)>0} (c) I{ j=c g j(x)>0} ≤ j=c I{g j(x)>0} (d) I{ j=c g j(x)−gc (x)>0} ≤ j=c I{g j(x)−gc (x)>0}. Proposition 5 shows that gc (x) ≤ 0 is the sufficient condition of φ(x) = c, while g j(x) > 0 for some j = c is its necessary condition. The following theorem shows that they become sufficient and necessary when g has one and only one positive element. Proposition 6. Under the conditions in Proposition 5. The relationship of I{gc (x)≤0} = I{φ(x)=c} = I{ j=c g j(x)−gc (x)>0} = I{ j=c g j(x)>0} = j=c I{g j(x)>0} holds if and only if the margin vector g(x) has only one positive element. In the binary case, this relationship always holds because g1(x) = −g2(x). Recently, Zou et al. [33] derived multicategory boosting algorithms using exp(−gc (x)). In their discrete boosting algorithm, the margin vector g(x) is modeled as an m-vector function with one and only one positive element. In this case, I{gc (x)≤0} is equal to I{φ(x)=c}. Consequently, exp(−gc (x)) is a majorization of I{φ(x)=c} because exp(−gc (x)) is an upper bound of I{gc (x)≤0}. Therefore, this discrete AdaBoost algorithm still approximates the original empirical 0–1 loss function. In the general case, however, Proposition 6 implies that exp(−gc (x)) is not the majorization of I{φ(x)=c}. 3.2. Approaches Proposition 5 provides us with approaches for constructing majorization functions of the 0–1 loss function I{φ(x)=c}. Clearly, j=c I{g j(x)>0} and j=c I{g j(x)−gc (x)>0} are separable, so they are more tractable respectively than I{ j=c g j(x)>0} and I{ j=c g j(x)−gc (x)>0}. Thus, j=c I{g j(x)>0} and j=c I{g j(x)−gc (x)>0} are popularly employed in practical applications. In particular, suppose η(g j(x)) upper bounds I{g j(x)≤0}; that is, η(g j(x)) ≥ I{g j(x)≤0}. Note that η(g j(x)) ≥ I{g j(x)≤0} if and only if η(−g j(x)) ≥ I{g j(x)≥0}. Thus η(−g j(x)) upper bounds I{g j(x)≥0}, and hence η(g j(x) − gl(x)) upper bounds I{gl(x)−g j(x)>0}. It then follows from Proposition 5 that j=c η(−g j(x)) and j=c η(gc (x) − g j(x)) are majorizations of I{φ(x)=c}. Consequently, we can define two classes of majorizations for I{φ(x)=c}. The first one is