SUPPORT VECTOR MACHINES 131 strict inequality in Eq.(12)holds.For these machines,the support vectors are the critical elements of the training set.They lie closest to the decision boundary;if all other training points were removed (or moved around,but so as not to cross H or H2),and training was repeated,the same separating hyperplane would be found. 3.2.The Karush-Kuhn-Tucker Conditions The Karush-Kuhn-Tucker(KKT)conditions play a central role in both the theory and practice of constrained optimization.For the primal problem above,the KKT conditions may be stated(Fletcher,1987): 9Lp=w-∑agw=0v=l,d (17) 苏Lp=-∑a=0 (18) (x2w+b)-1≥0i=1,…,l (19) a≥0i (20) a(5(w·x+b)-1)=0i (21) The KKT conditions are satisfied at the solution of any constrained optimization problem (convex or not),with any kind of constraints,provided that the intersection of the set of feasible directions with the set of descent directions coincides with the intersection of the set of feasible directions for linearized constraints with the set of descent directions (see Fletcher,1987;McCormick,1983)).This rather technical regularity assumption holds for all support vector machines,since the constraints are always linear.Furthermore,the problem for SVMs is convex(a convex objective function,with constraints which give a convex feasible region),and for convex problems (if the regularity condition holds),the KKT conditions are necessary and sufficient for w,b,a to be a solution (Fletcher,1987). Thus solving the SVM problem is equivalent to finding a solution to the KKT conditions. This fact results in several approaches to finding the solution(for example,the primal-dual path following method mentioned in Section 5). As an immediate application,note that,while w is explicitly determined by the training procedure,the threshold b is not,although it is implicitly determined.However b is easily found by using the KKT"complementarity"condition,Eq.(21),by choosing any i for which i0 and computing b(note that it is numerically safer to take the mean value of b resulting from all such equations). Notice that all we've done so far is to cast the problem into an optimization problem where the constraints are rather more manageable than those in Eqs.(10),(11).Finding the solution for real world problems will usually require numerical methods.We will have more to say on this later.However,let's first work out a rare case where the problem is nontrivial (the number of dimensions is arbitrary,and the solution certainly not obvious), but where the solution can be found analytically
SUPPORT VECTOR MACHINES 131 strict inequality in Eq. (12) holds. For these machines, the support vectors are the critical elements of the training set. They lie closest to the decision boundary; if all other training points were removed (or moved around, but so as not to cross H1 or H2), and training was repeated, the same separating hyperplane would be found. 3.2. The Karush-Kuhn-Tucker Conditions The Karush-Kuhn-Tucker (KKT) conditions play a central role in both the theory and practice of constrained optimization. For the primal problem above, the KKT conditions may be stated (Fletcher, 1987): ∂ ∂wν LP = wν −X i αiyixiν = 0 ν = 1, ··· , d (17) ∂ ∂bLP = − X i αiyi = 0 (18) yi(xi · w + b) − 1 ≥ 0 i = 1, ··· , l (19) αi ≥ 0 ∀i (20) αi(yi(w · xi + b) − 1) = 0 ∀i (21) The KKT conditions are satisfied at the solution of any constrained optimization problem (convex or not), with any kind of constraints, provided that the intersection of the set of feasible directions with the set of descent directions coincides with the intersection of the set of feasible directions for linearized constraints with the set of descent directions (see Fletcher, 1987; McCormick, 1983)). This rather technical regularity assumption holds for all support vector machines, since the constraints are always linear. Furthermore, the problem for SVMs is convex (a convex objective function, with constraints which give a convex feasible region), and for convex problems (if the regularity condition holds), the KKT conditions are necessary and sufficient for w, b, α to be a solution (Fletcher, 1987). Thus solving the SVM problem is equivalent to finding a solution to the KKT conditions. This fact results in several approaches to finding the solution (for example, the primal-dual path following method mentioned in Section 5). As an immediate application, note that, while w is explicitly determined by the training procedure, the threshold b is not, although it is implicitly determined. However b is easily found by using the KKT “complementarity” condition, Eq. (21), by choosing any i for which αi 6= 0 and computing b (note that it is numerically safer to take the mean value of b resulting from all such equations). Notice that all we’ve done so far is to cast the problem into an optimization problem where the constraints are rather more manageable than those in Eqs. (10), (11). Finding the solution for real world problems will usually require numerical methods. We will have more to say on this later. However, let’s first work out a rare case where the problem is nontrivial (the number of dimensions is arbitrary, and the solution certainly not obvious), but where the solution can be found analytically
132 BURGES 3.3.Optimal Hyperplanes:An Example While the main aim of this Section is to explore a non-trivial pattern recognition problem where the support vector solution can be found analytically,the results derived here will also be useful in a later proof.For the problem considered,every training point will turn out to be a support vector,which is one reason we can find the solution analytically. Consider n+1 symmetrically placed points lying on a sphere S-of radius R:more precisely,the points form the vertices of an n-dimensional symmetric simplex.It is conve- nient to embed the points in R+in such a way that they all lie in the hyperplane which passes through the origin and which is perpendicular to the(n+1)-vector(1,1,...,1)(in this formulation,the points lie onS"they span R",and are embedded in R+).Explic- itly,recalling that vectors themselves are labeled by Roman indices and their coordinates by Greek,the coordinates are given by: Rn x=-(1-62,)1 (22) n(n+1) +8imVn+1 where the Kronecker delta,is defined by i.=1 ifu=i,0 otherwise.Thus,for example,the vectors for three equidistant points on the unit circle (see Figure 12)are: 2= 信厚 3= 后 (23) One consequence of the symmetry is that the angle between any pair of vectors is the same (and is equal to arccos(-1/n)): x2=R2 (24) Xi·X=-R2/m (25) or,more succinctly, 2=4--片 (26) R2 Assigning a class label C {+1,-1}arbitrarily to each point,we wish to find that hyperplane which separates the two classes with widest margin.Thus we must maximize LD in Eq.(16),subject toi>0 and also subject to the equality constraint,Eq.(15).Our strategy is to simply solve the problem as though there were no inequality constraints.If the resulting solution does in fact satisfy a>0 Vi,then we will have found the general solution,since the actual maximum of Lp will then lie in the feasible region,provided the
132 BURGES 3.3. Optimal Hyperplanes: An Example While the main aim of this Section is to explore a non-trivial pattern recognition problem where the support vector solution can be found analytically, the results derived here will also be useful in a later proof. For the problem considered, every training point will turn out to be a support vector, which is one reason we can find the solution analytically. Consider n + 1 symmetrically placed points lying on a sphere Sn−1 of radius R: more precisely, the points form the vertices of an n-dimensional symmetric simplex. It is convenient to embed the points in Rn+1 in such a way that they all lie in the hyperplane which passes through the origin and which is perpendicular to the (n + 1)-vector (1, 1, ..., 1) (in this formulation, the points lie on Sn−1, they span Rn, and are embedded in Rn+1). Explicitly, recalling that vectors themselves are labeled by Roman indices and their coordinates by Greek, the coordinates are given by: xiµ = −(1 − δi,µ) s R n(n + 1) + δi,µr Rn n + 1 (22) where the Kronecker delta, δi,µ, is defined by δi,µ = 1 if µ = i, 0 otherwise. Thus, for example, the vectors for three equidistant points on the unit circle (see Figure 12) are: x1 = (r2 3 , −1 √6 , −1 √6 ) x2 = (−1 √6 , r2 3 , −1 √6 ) x3 = (−1 √6 , −1 √6 , r2 3 ) (23) One consequence of the symmetry is that the angle between any pair of vectors is the same (and is equal to arccos(−1/n)): kxik2 = R2 (24) xi · xj = −R2/n (25) or, more succinctly, xi · xj R2 = δi,j − (1 − δi,j ) 1 n. (26) Assigning a class label C ∈ {+1, −1} arbitrarily to each point, we wish to find that hyperplane which separates the two classes with widest margin. Thus we must maximize LD in Eq. (16), subject to αi ≥ 0 and also subject to the equality constraint, Eq. (15). Our strategy is to simply solve the problem as though there were no inequality constraints. If the resulting solution does in fact satisfy αi ≥ 0 ∀i, then we will have found the general solution, since the actual maximum of LD will then lie in the feasible region, provided the
SUPPORT VECTOR MACHINES 133 equality constraint,Eq.(15),is also met.In order to impose the equality constraint we introduce an additional Lagrange multiplier A.Thus we seek to maximize n+1 1 n+1 n+1 a,Ha-A∑a, (27) i1 i,j=1 i=l where we have introduced the Hessian H≡hX1·X5: 28) Setting0 gives (Ha):+λ:=1i (29) Now H has a very simple structure:the off-diagonal elements are-R2/n,and the diagonal elements are R2.The fact that all the off-diagonal elements differ only by factors of y;suggests looking for a solution which has the form: a4= ()+() (30) where a and b are unknowns.Plugging this form in Eq.(29)gives: ()(生)-9(生)= (31) where p is defined by +1 p=∑ (32) Thus 2n a+b= 2(n+1) (33) and substituting this into the equality constraint Eq.(15)to find a,b gives a1 +(-n),b=0+可(+) (34) which gives for the solution a=+(-) (35) Also. (ah=1-%p n+1 (36)
SUPPORT VECTOR MACHINES 133 equality constraint, Eq. (15), is also met. In order to impose the equality constraint we introduce an additional Lagrange multiplier λ. Thus we seek to maximize LD ≡ n X +1 i=1 αi − 1 2 n X +1 i,j=1 αiHijαj − λ n X +1 i=1 αiyi, (27) where we have introduced the Hessian Hij ≡ yiyjxi · xj . (28) Setting ∂LD ∂αi = 0 gives (Hα)i + λyi = 1 ∀i (29) Now H has a very simple structure: the off-diagonal elements are −yiyjR2/n, and the diagonal elements are R2. The fact that all the off-diagonal elements differ only by factors of yi suggests looking for a solution which has the form: αi = µ1 + yi 2 ¶ a + µ1 − yi 2 ¶ b (30) where a and b are unknowns. Plugging this form in Eq. (29) gives: µn + 1 n ¶ µa + b 2 ¶ − yip n µa + b 2 ¶ = 1 − λyi R2 (31) where p is defined by p ≡ n X +1 i=1 yi. (32) Thus a + b = 2n R2(n + 1) (33) and substituting this into the equality constraint Eq. (15) to find a, b gives a = n R2(n + 1) µ 1 − p n + 1¶ , b = n R2(n + 1) µ 1 + p n + 1¶ (34) which gives for the solution αi = n R2(n + 1) µ 1 − yip n + 1¶ (35) Also, (Hα)i = 1 − yip n + 1. (36)
134 BURGES Hence i,j=1 = (-黑)--()(-()月 (37) i1 Note that this is one of those cases where the Lagrange multiplier A can remain undeter- mined (although determining it is trivial).We have now solved the problem.since all the are clearly positive or zero(in fact the i will only be zero if all training points have the same class).Note that wl depends only on the number of positive (negative)polarity points,and not on how the class labels are assigned to the points in Eq.(22).This is clearly not true of w itself,which is given by n+1 2(n+1) -n+1/ (38) i=1 The margin,M=2/wll,is thus given by 2R M=- (39) m(1-(p/(n+1)2 Thus when the number of points n+1 is even,the minimum margin occurs when p=0 (equal numbers of positive and negative examples),in which case the margin is Mmin =2R/Vn.If n+1 is odd,the minimum margin occurs when p=+1,in which case Mmmin =2R(n +1)/(nvn+2).In both cases,the maximum margin is given by Mmaz=R(n+1)/n.Thus,for example,for the two dimensional simplex consisting of three points lying on S(and spanning R2),and with labeling such that not all three points have the same polarity,the maximum and minimum margin are both 3R/2(see Figure (12). Note that the results of this Section amount to an alternative,constructive proof that the VC dimension of oriented separating hyperplanes in R"is at least n+1. 3.4.Test Phase Once we have trained a Support Vector Machine,how can we use it?We simply determine on which side of the decision boundary(that hyperplane lying half way between Hi and H2 and parallel to them)a given test pattern x lies and assign the corresponding class label, i.e.we take the class of x to be sgn(w.x+b) 3.5.The Non-Separable Case The above algorithm for separable data,when applied to non-separable data,will find no feasible solution:this will be evidenced by the objective function(i.e.the dual Lagrangian)
134 BURGES Hence kwk2 = n X +1 i,j=1 αiαjyiyjxi · xj = αT Hα = n X +1 i=1 αi µ 1 − yip n + 1¶ = n X +1 i=1 αi = ³ n R2 ´ Ã 1 − µ p n + 1¶2 ! (37) Note that this is one of those cases where the Lagrange multiplier λ can remain undetermined (although determining it is trivial). We have now solved the problem, since all the αi are clearly positive or zero (in fact the αi will only be zero if all training points have the same class). Note that kwk depends only on the number of positive (negative) polarity points, and not on how the class labels are assigned to the points in Eq. (22). This is clearly not true of w itself, which is given by w = n R2(n + 1) n X +1 i=1 µ yi − p n + 1¶ xi (38) The margin, M = 2/kwk, is thus given by M = 2R pn (1 − (p/(n + 1))2) . (39) Thus when the number of points n + 1 is even, the minimum margin occurs when p = 0 (equal numbers of positive and negative examples), in which case the margin is Mmin = 2R/√n. If n + 1 is odd, the minimum margin occurs when p = ±1, in which case Mmin = 2R(n + 1)/(n √n + 2). In both cases, the maximum margin is given by Mmax = R(n + 1)/n. Thus, for example, for the two dimensional simplex consisting of three points lying on S1 (and spanning R2), and with labeling such that not all three points have the same polarity, the maximum and minimum margin are both 3R/2 (see Figure (12)). Note that the results of this Section amount to an alternative, constructive proof that the VC dimension of oriented separating hyperplanes in Rn is at least n + 1. 3.4. Test Phase Once we have trained a Support Vector Machine, how can we use it? We simply determine on which side of the decision boundary (that hyperplane lying half way between H1 and H2 and parallel to them) a given test pattern x lies and assign the corresponding class label, i.e. we take the class of x to be sgn(w · x + b). 3.5. The Non-Separable Case The above algorithm for separable data, when applied to non-separable data, will find no feasible solution: this will be evidenced by the objective function (i.e. the dual Lagrangian)
SUPPORT VECTOR MACHINES 135 growing arbitrarily large.So how can we extend these ideas to handle non-separable data? We would like to relax the constraints (10)and (11).but only when necessary,that is,we would like to introduce a further cost (i.e.an increase in the primal objective function)for doing so.This can be done by introducing positive slack variables ,i=1,...,I in the constraints (Cortes and Vapnik,1995),which then become: xi·w+b≥+1-:for=+1 (40) xi·w+b≤-1+Ei for yi=-1 (41) 5≥0i. (42) Thus,for an error to occur,the corresponding must exceed unity,so is an upper bound on the number oftraining errors.Hence a natural way to assign an extra cost for errors is to change the objective function to be minimized from‖wll2/2to‖wl2/2+C(∑:s)k where C is a parameter to be chosen by the user,a larger C corresponding to assigning a higher penalty to errors.As it stands,this is a convex programming problem for any positive integer k;for k=2 and k=1 it is also a quadratic programming problem,and the choice k=1 has the further advantage that neither the i,nor their Lagrange multipliers, appear in the Wolfe dual problem,which becomes: Maximize. tn=∑a,-∑aaxy (43) ij subject to: 0≤a,≤C (44) ∑a助=0. (45) The solution is again given by Ns (46) i=1 where Ns is the number of support vectors.Thus the only difference from the optimal hyperplane case is that the a;now have an upper bound of C.The situation is summarized schematically in Figure 6. We will need the Karush-Kuhn-Tucker conditions for the primal problem.The primal Lagrangian is Lp=wP+c∑a-∑akw+创-1+-∑ (47)
SUPPORT VECTOR MACHINES 135 growing arbitrarily large. So how can we extend these ideas to handle non-separable data? We would like to relax the constraints (10) and (11), but only when necessary, that is, we would like to introduce a further cost (i.e. an increase in the primal objective function) for doing so. This can be done by introducing positive slack variables ξi, i = 1, ··· , l in the constraints (Cortes and Vapnik, 1995), which then become: xi · w + b ≥ +1 − ξi for yi = +1 (40) xi · w + b ≤ −1 + ξi for yi = −1 (41) ξi ≥ 0 ∀i. (42) Thus, for an error to occur, the corresponding ξi must exceed unity, so P i ξi is an upper bound on the number of training errors. Hence a natural way to assign an extra cost for errors is to change the objective function to be minimized from kwk2/2 to kwk2/2 +C ( P i ξi) k , where C is a parameter to be chosen by the user, a larger C corresponding to assigning a higher penalty to errors. As it stands, this is a convex programming problem for any positive integer k; for k = 2 and k = 1 it is also a quadratic programming problem, and the choice k = 1 has the further advantage that neither the ξi, nor their Lagrange multipliers, appear in the Wolfe dual problem, which becomes: Maximize: LD ≡ X i αi − 1 2 X i,j αiαjyiyjxi · xj (43) subject to: 0 ≤ αi ≤ C, (44) X i αiyi = 0. (45) The solution is again given by w = X NS i=1 αiyixi. (46) where NS is the number of support vectors. Thus the only difference from the optimal hyperplane case is that the αi now have an upper bound of C. The situation is summarized schematically in Figure 6. We will need the Karush-Kuhn-Tucker conditions for the primal problem. The primal Lagrangian is LP = 1 2 kwk2 + CX i ξi −X i αi{yi(xi · w + b) − 1 + ξi} −X i µiξi (47)