% Z Zhang et aL Artificial Intelligence 215(2014)55-78 vc(g(x))=>n(gc(x)-gn(x)). (2) l≠c while the second one is vc(g(x))=n(-gi(x)). (3) l≠c This leads us to two approaches for constructing majorization ve(g(x))of I Zhang [28]referred to them as pairwise comparison and constrained comparison.A theoretical analysis of these two classes of majorization functions has also been presented by Zhang [28].His analysis mainly focused on consistency of empirical risk minimization and the ISC property of surrogate losses.Our results in Section 3.1 show a direct and intuitive connection of these two approaches with the original 0-1 loss. 3.3.Multicategory hinge losses Using distinct n(gj(x))())in the two approaches,we can construct different multicategory losses for large margin classifiers.For example,let n(gj(x))=(1-gj(x))+which upper bounds Ig;(x)0).Then j(1-ge(x)+gj(x))+ and (1+gj(x))+are candidate majorizations for)which yield two multiclass SVM methods. In the multicategory SVM(MSVM),Lee et al.[13]employed (1+gj(x))+as a multicategory hinge loss.Moreover, Lee et al.proved that this multicategory hinge loss is Fisher-consistent.In particular.the minimizer of(1+ gj(x))+Pc(x)w.r.t.geg is g(x)=m-1 if I=argmaxj (Pj(x))and g(x)=-1 otherwise. The pairwise comparison j(1-ge(x)+gj(x))+was used by Vapnik [25].Weston and Watkins [26],Bredensteiner and Bennett [3].Guermeur [12].Unfortunately,Lee et al.[13].Zhang [28],Liu [14]showed that solutions of the corresponding optimization problem do not always implement the Bayes decision rule.However,we find that it is still Fisher-consistent under certain conditions.In particular,we have the following theorem (the proof is given in Appendix B.1). Theorem 7.Let Pj(x)>0 for j=1....,m,Pi(x)=maxj Pj(x)and Pk(x)=maxj Pj(x),and let m gx)=argmin∑Pc(x)∑(1-gc(x+gj(x)+ g(x)EC c=1 j≠c If P(x)>1/2 or Pk(x)<1/m,then (x)=1+g(x)1+j(x)for jl.k. This theorem implies that g(x)>gj(x).so the majorization function (1-ge(x)+gj(x))+is Fisher-consistent when P(x)>1/2 or Pk(x)<1/m.In the case that m=3.Liu [14]showed that this majorization function yields the Fisher consistency when Pk<while the consistency is not always satisfied when 1/2>P>Pk=1/3.Theorem 7 shows that for any m3 the consistency is also satisfied whenever Pk<. As we have seen.can be also used as a starting point to construct a majorization ofSince Imwe call this construction approach the maximum pairwise comparison.In fact,this approach was employed by Crammer and Singer [5].Liu and Shen [15]and Wu and Liu [27].Especially,Crammer and Singer 5 used the surrogate: Ec(g(x))=max gj(x)+1-Ij=c)-gc(x). (4) It is easily seen that Ukw-8e图>0≤max{8jX)+1-U=c}-8c(X≤1+8jx-gc(x)+ j≠c which implies that e(g(x))is a tighter upper bound of)than (1-ge(x)+gj(x))+.Note that Crammer and Singer [5]did not assume geg,but Liu and Shen [15]argued that this assumption is also necessary.Zhang [28]showed that sc(g(x))is Fisher-consistent only when P(x)>1/2.However,the author did not give an explicit expression of the minimizer of the expected error in question in the literature.Here we present the constructive solution of the corresponding minimization problem in the following theorem (the proof is given in Appendix B.2). Theorem 8.Consider the following optimization problem of m gx)=argmin∑ max(gj(x)+1-Itj=c))-gc(x)Pc(x). (5) g(X)EG C=1 Assume that P(x)=maxj Pi(x)
60 Z. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 ψc g(x) = l=c η gc (x) − gl(x) , (2) while the second one is ψc g(x) = l=c η −gl(x) . (3) This leads us to two approaches for constructing majorization ψc (g(x)) of I{φ(x)=c}. Zhang [28] referred to them as pairwise comparison and constrained comparison. A theoretical analysis of these two classes of majorization functions has also been presented by Zhang [28]. His analysis mainly focused on consistency of empirical risk minimization and the ISC property of surrogate losses. Our results in Section 3.1 show a direct and intuitive connection of these two approaches with the original 0–1 loss. 3.3. Multicategory hinge losses Using distinct η(g j(x)) (≥ I{g j(x)≤0}) in the two approaches, we can construct different multicategory losses for large margin classifiers. For example, let η(g j(x)) = (1 − g j(x))+ which upper bounds I{g j(x)≤0}. Then j=c (1 − gc (x) + g j(x))+ and j=c (1 + g j(x))+ are candidate majorizations for I{φ(x)=c}, which yield two multiclass SVM methods. In the multicategory SVM (MSVM), Lee et al. [13] employed j=c (1 + g j(x))+ as a multicategory hinge loss. Moreover, Lee et al. [13] proved that this multicategory hinge loss is Fisher-consistent. In particular, the minimizer of m c=1 j=c (1 + g j(x))+ Pc (x) w.r.t. g ∈ G is gˆl(x) = m − 1 if l = argmaxj(P j(x)) and gˆl(x) = −1 otherwise. The pairwise comparison j=c (1−gc (x)+g j(x))+ was used by Vapnik [25], Weston and Watkins [26], Bredensteiner and Bennett [3], Guermeur [12]. Unfortunately, Lee et al. [13], Zhang [28], Liu [14] showed that solutions of the corresponding optimization problem do not always implement the Bayes decision rule. However, we find that it is still Fisher-consistent under certain conditions. In particular, we have the following theorem (the proof is given in Appendix B.1). Theorem 7. Let P j(x) > 0 for j = 1,...,m, Pl(x) = maxj P j(x) and Pk(x) = maxj=l P j(x), and let gˆ(x) = argmin g(x)∈G m c=1 Pc (x) j=c 1 − gc (x) + g j(x) +. If Pl(x) > 1/2 or Pk(x) < 1/m, then gˆl(x) = 1 + gˆk(x) ≥ 1 + gˆ j(x) for j = l,k. This theorem implies that gˆl(x) > gˆ j(x), so the majorization function j=c (1 − gc (x) + g j(x))+ is Fisher-consistent when Pl(x) > 1/2 or Pk(x) < 1/m. In the case that m = 3, Liu [14] showed that this majorization function yields the Fisher consistency when Pk < 1 3 , while the consistency is not always satisfied when 1/2 > Pl > Pk ≥ 1/3. Theorem 7 shows that for any m ≥ 3 the consistency is also satisfied whenever Pk < 1 m . As we have seen, I{ j=c g j(x)−gc (x)>0} can be also used as a starting point to construct a majorization of I{φ(x)=c}. Since I{ j=c g j(x)−gc (x)>0} = I{max j=c g j(x)−gc (x)>0}, we call this construction approach the maximum pairwise comparison. In fact, this approach was employed by Crammer and Singer [5], Liu and Shen [15] and Wu and Liu [27]. Especially, Crammer and Singer [5] used the surrogate: ξc g(x) = max g j(x) + 1 − I{j=c} − gc (x). (4) It is easily seen that I{ j=c g j(x)−gc (x)>0} ≤ max j g j(x) + 1 − I{j=c} − gc (x) ≤ j=c 1 + g j(x) − gc (x) +, which implies that ξc (g(x)) is a tighter upper bound of I{φ(x)=c} than j=c (1 − gc (x) + g j(x))+. Note that Crammer and Singer [5] did not assume g ∈ G, but Liu and Shen [15] argued that this assumption is also necessary. Zhang [28] showed that ξc (g(x)) is Fisher-consistent only when Pl(x) > 1/2. However, the author did not give an explicit expression of the minimizer of the expected error in question in the literature. Here we present the constructive solution of the corresponding minimization problem in the following theorem (the proof is given in Appendix B.2). Theorem 8. Consider the following optimization problem of gˆ(x) = argmin g(x)∈G m c=1 max j g j(x) + 1 − I{j=c} − gc (x) Pc (x). (5) Assume that Pl(x) = maxj P j(x).
Z Zhang et aL Artificial Intelligence 215(2014)55-78 67 (1)fP1(x)>1/2,then8x)=j=-品orj=1,,m; (2)fP1(x)=1/2.then0≤gi(x-8gj(x)≤1 and gj(x)=gc(x)for c,j≠L: (3)If P(x)<1/2,then ge(x)=0 for c=1,....m. This theorem shows that the majorization function maxj(gj(x)+1-Ij=c)-ge(x)is Fisher-consistent when P(x)>1/2. Otherwise,the solution of (5)degenerates to the trivial point.As we have seen from Theorems 7 and 8,P(x)>1/2 is a sufficient condition for both maxj{gj(x)+1-Ij=c))-ge(x)and j(1-ge(x)+gj(x))+to be Fisher-consistent.Moreover. they satisfy the condition gi(x)=1+gk(x)where k =argmaxj Pj(x).However,as shown in Theorem 7.j(1-ge(x)+ gj(x))+still yields the Fisher-consistent property when Pk<.Thus,the consistency condition for the pairwise comparison hinge loss is weaker than that for the maximum pairwise comparison hinge loss. 3.4.Multicategory coherence losses To construct a smooth majorization function of we define n(ge(x))as the coherence function which was pro- posed by Zhang et al.[30].The coherence function is 1-21 r(z)≌Tlog1+exp T T>0 (6) where T is called the temperature parameter.Clearly.nr(z)>(1-z)+>Izso).Moreover,limron(z)=(1-z)+.Thus, we directly have two majorizations of based on the constrained comparison method and the pairwise comparison method. Using the constrained comparison,we give a smooth approximation to j(1+gj(x))+for the MSVM of Lee et al.[13]. That is, j≠c It is immediate that Lr(g(x,c)之∑jc(1+g(x)+and limr→oLr(g(x,c)=∑j≠e(1+gjx)+.Furthermore,we have the following theorem(the proof is given in Appendix B.3). Theorem 9.Assume that Pe(x)>0for c=1.....m.Consider the optimization problem m max 〉Lr(gx),c)Pc(x) (7) gX)∈G c=1 for a fixed T>and letg(x)=((x).....gm(x))be its solution.Theng(x)is unique.Moreover,if Pi(x)<Pj(x).we have g(x)< gj(x).Furthermore,we have m -1 ifc argmaxj Pj(x), m8w={-1 otherwise. Additionally,having obtainedg(x).Pe(x)is given by Pc(x)=1- m-1)(1+exp(-1+) (8) m+∑1exp(-1+) Although there is no explicit expression for g(x)in Problem(7),Theorem 9 shows that its limit at T=0 is equal to the minimizer of(+gj()P().which was studied by Lee et al Based on the pairwise comparison,we have a smooth alternative to multiclass hinge loss j(1+ge(x)-gj(x))+ which is Gr(g(x).c)T log 1+exp( +8j()-8c(X T (9) j≠c It is also immediate that Gr(g(x).c)j(1+ge(x)-gj(x))+and limT0GT(g(x).c)=j(1+ge(x)-gj(x))+. Theorem 10.Assume that Pe(x)>0 for c=1....,m.Let PI=maxj Pj(x)and Pk(x)=maxj Pj(x).Consider the optimization problem m max >Gr(g(x).c)Pc(x) g(x)EC c=1
Z. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 61 (1) If Pl(x) > 1/2, then gˆ j(x) = I{j=l} − 1 m for j = 1,...,m; (2) If Pl(x) = 1/2, then 0 ≤ gˆl(x) − gˆ j(x) ≤ 1 and gˆ j(x) = gˆ c (x) for c, j = l; (3) If Pl(x) < 1/2, then gˆc (x) = 0 for c = 1,...,m. This theorem shows that the majorization function max j{g j(x)+1−I{j=c}}− gc (x) is Fisher-consistent when Pl(x) > 1/2. Otherwise, the solution of (5) degenerates to the trivial point. As we have seen from Theorems 7 and 8, Pl(x) > 1/2 is a sufficient condition for both max j{g j(x)+1−I{j=c}}− gc (x) and j=c (1− gc (x)+ g j(x))+ to be Fisher-consistent. Moreover, they satisfy the condition gˆl(x) = 1 + gˆk(x) where k = argmaxj=l P j(x). However, as shown in Theorem 7, j=c (1 − gc (x) + g j(x))+ still yields the Fisher-consistent property when Pk < 1 m . Thus, the consistency condition for the pairwise comparison hinge loss is weaker than that for the maximum pairwise comparison hinge loss. 3.4. Multicategory coherence losses To construct a smooth majorization function of I{φ(x)=c}, we define η(gc (x)) as the coherence function which was proposed by Zhang et al. [30]. The coherence function is ηT (z) T log 1 + exp 1 − z T , T > 0 (6) where T is called the temperature parameter. Clearly, ηT (z) ≥ (1 − z)+ ≥ I{z≤0}. Moreover, limT→0 ηT (z) = (1 − z)+. Thus, we directly have two majorizations of I{φ(x)=c} based on the constrained comparison method and the pairwise comparison method. Using the constrained comparison, we give a smooth approximation to j=c (1+ g j(x))+ for the MSVM of Lee et al. [13]. That is, LT g(x), c T j=c log 1 + exp1 + g j(x) T . It is immediate that LT (g(x), c) ≥ j=c (1 + g j(x))+ and limT→0 LT (g(x), c) = j=c (1 + g j(x))+. Furthermore, we have the following theorem (the proof is given in Appendix B.3). Theorem 9. Assume that Pc(x) > 0 for c = 1,...,m. Consider the optimization problem max g(x)∈G m c=1 LT g(x), c Pc (x) (7) for a fixed T > 0 and let gˆ(x) = (gˆ 1(x),..., gˆm(x))T be its solution. Then gˆ(x) is unique. Moreover, if Pl(x) < P j(x), we have gˆl(x) < gˆ j(x). Furthermore, we have lim T→0 gˆ c (x) = m − 1 if c = argmaxj P j(x), −1 otherwise. Additionally, having obtained gˆ(x), Pc (x) is given by Pc (x) = 1 − (m − 1)(1 + exp(−1+gˆ c (x) T )) m + m j=1 exp(−1+gˆ j(x) T ) . (8) Although there is no explicit expression for gˆ(x) in Problem (7), Theorem 9 shows that its limit at T = 0 is equal to the minimizer of m c=1 j=c (1 + g j(x))+ Pc (x), which was studied by Lee et al. [13]. Based on the pairwise comparison, we have a smooth alternative to multiclass hinge loss j=c (1 + gc (x) − g j(x))+, which is GT g(x), c T j=c log 1 + exp1 + g j(x) − gc (x) T . (9) It is also immediate that GT (g(x), c) ≥ j=c (1 + gc (x) − g j(x))+ and limT→0 GT (g(x), c) = j=c (1 + gc (x) − g j(x))+. Theorem 10. Assume that Pc(x) > 0 for c = 1,...,m. Let Pl = maxj P j(x) and Pk(x) = maxj=l P j(x). Consider the optimization problem max g(x)∈G m c=1 GT g(x), c Pc (x)
公 Z.Zhang et aL Artificial Intelligence 215(2014)55-78 for a fixed T>0 and letg(x)=(g(x).....gm(x))T be its solution.Theng(x)is unique.Moreover,if Pi(x)<P j(x),we havegi(x)< gj(x).Additionally,if Pi(x)>1/2 or Pk(x)<1/m,then m8w=1+m8(x≥1+)m8新)orj≠l,k whenever the limits exist. The proof of Theorem 10 is given in Appendix B.4.We see that the limit of g(x)at T=0 agrees with that shown in The- orem 7.Unfortunately,based on Gr(g(x),c),it is hard to obtain an explicit expression of the class conditional probabilities Pe(x)via the ge(x). 4.Multiclass C-losses In this section,we present a smooth and Fisher-consistent majorization of the multiclass hinge loss (g(x))in(4)using the idea behind the coherence function.We call this new majorization multiclass C-loss.We will see that this multiclass C-loss bridges the multiclass hinge loss c(g(x))and the negative multinomial log-likelihood (logit)of the form g)=log∑exp(zy(x)-&x)=log1+∑exp(gx)-g(x) (10) j=1 In the 0-1 loss the misclassification costs are specified as 1.It is natural to set the misclassification costs as a positive constant u>0.This setting will reveal an important connection between the hinge loss and the logit loss.The empirical error on the training data is then E= n In this setting.we can extend the multiclass hinge loss e(g(x))as Hu(g(x).c)max gj(x)+u-ultj=c)-gc(x). (11) It is clear that Hu(g(x).c)>u)To establish the connection among the multiclass C-loss,the multiclass hinge loss and the logit loss,we employ this setting to present the definition of the multiclass C-loss. We now express max(gj(x)+u-ultj=c))as f(x)[gj(x)+u-ulltj=c)]where ω(X)= 1 j=argmax (gi(x)+u-ull=c)} 】0 otherwise. Motivated by the idea behind deterministic annealing [20].we relax this hard function (x).retaining only (x)>0 and(x)=1.With such soft f().we maximize()gj+u-uj-g(under entropy penalization: namely, max Fe∑oj(x[g0x+u-uu=]-8-w-T∑w1ogo5w (12) j=1 j=1 where T>0 is also referred to as the temperature.The maximization of F w.r.t.(x)is straightforward,and it gives rise to the following distribution exp ω(X)= ∑expr+W-u4] (13) based on the Karush-Kuhn-Tucker condition.The corresponding maximum of F is obtained by plugging(13)back into(12): CT.(g(X).c)Tog1+exp-gc(x) T>0,u>0. (14) T j≠c
62 Z. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 for a fixed T > 0 and let gˆ(x) = (gˆ 1(x),..., gˆm(x))T be its solution. Then gˆ(x) is unique. Moreover, if Pi(x) < P j(x), we have gˆi(x) < gˆ j(x). Additionally, if Pl(x) > 1/2 or Pk(x) < 1/m, then lim T→0 gˆl(x) = 1 + lim T→0 gˆk(x) ≥ 1 + lim T→0 gˆ j(x) for j = l,k, whenever the limits exist. The proof of Theorem 10 is given in Appendix B.4. We see that the limit of gˆl(x) at T = 0 agrees with that shown in Theorem 7. Unfortunately, based on GT (g(x), c), it is hard to obtain an explicit expression of the class conditional probabilities Pc (x) via the gˆ c (x). 4. Multiclass C-losses In this section, we present a smooth and Fisher-consistent majorization of the multiclass hinge loss ξc (g(x)) in (4) using the idea behind the coherence function. We call this new majorization multiclass C-loss. We will see that this multiclass C-loss bridges the multiclass hinge loss ξc (g(x)) and the negative multinomial log-likelihood (logit) of the form γc g(x) = logm j=1 exp g j(x) − gc (x) = log 1 + j=c exp g j(x) − gc (x) . (10) In the 0–1 loss the misclassification costs are specified as 1. It is natural to set the misclassification costs as a positive constant u > 0. This setting will reveal an important connection between the hinge loss and the logit loss. The empirical error on the training data is then = u n n i=1 I{φ(xi)=ci}. In this setting, we can extend the multiclass hinge loss ξc (g(x)) as Hu g(x), c = max j g j(x) + u − uI{j=c} − gc (x). (11) It is clear that Hu(g(x), c) ≥ uI{φ(x)=c}. To establish the connection among the multiclass C-loss, the multiclass hinge loss and the logit loss, we employ this setting to present the definition of the multiclass C-loss. We now express max{g j(x) + u − uI{j=c}} as m j=1 ωc j (x)[g j(x) + u − uI{j=c}] where ωc j (x) = 1 j = argmaxl{gl(x) + u − uI{l=c}} 0 otherwise. Motivated by the idea behind deterministic annealing [20], we relax this hard function ωc j (x), retaining only ωc j (x) ≥ 0 and m j=1 ωc j (x) = 1. With such soft ωc j (x), we maximize m j=1 ωc j (x)[g j(x)+u−uI{j=c}]− gc (x) under entropy penalization; namely, max {ωc j (x)} F m j=1 ωc j (x) g j(x) + u − uI{j=c} − gc (x) − T m j=1 ωc j (x)logωc j (x) , (12) where T > 0 is also referred to as the temperature. The maximization of F w.r.t. ωc j (x) is straightforward, and it gives rise to the following distribution ωc j (x) = exp[ g j(x)+u−uI{j=c} T ] l exp[ gl(x)+u−uI{l=c} T ] (13) based on the Karush–Kuhn–Tucker condition. The corresponding maximum of F is obtained by plugging (13) back into (12): CT ,u g(x), c T log 1 + j=c exp u + g j(x) − gc (x) T , T > 0, u > 0. (14)