当前位置：和泉文库 > 计算机 > 浏览文档

北京大学：《模式识别》课程教学资源（参考资料）A Tutorial on Support Vector Machines for Pattern Recognition

文件格式：PDF，文件大小：292.3KB，售价：13.22元

文档详细内容（约47页）

SUPPORT VECTOR MACHINES 131 strict inequality in Eq.(12)holds.For these machines,the support vectors are the critical elements of the training set.They lie closest to the decision boundary;if all other training points were removed (or moved around,but so as not to cross H or H2),and training was repeated,the same separating hyperplane would be found. 3.2.The Karush-Kuhn-Tucker Conditions The Karush-Kuhn-Tucker(KKT)conditions play a central role in both the theory and practice of constrained optimization.For the primal problem above,the KKT conditions may be stated(Fletcher,1987): 9Lp=w-∑agw=0v=l,d (17) 苏Lp=-∑a=0 (18) (x2w+b)-1≥0i=1,…,l (19) a≥0i (20) a(5(w·x+b)-1)=0i (21) The KKT conditions are satisfied at the solution of any constrained optimization problem (convex or not),with any kind of constraints,provided that the intersection of the set of feasible directions with the set of descent directions coincides with the intersection of the set of feasible directions for linearized constraints with the set of descent directions (see Fletcher,1987;McCormick,1983)).This rather technical regularity assumption holds for all support vector machines,since the constraints are always linear.Furthermore,the problem for SVMs is convex(a convex objective function,with constraints which give a convex feasible region),and for convex problems (if the regularity condition holds),the KKT conditions are necessary and sufficient for w,b,a to be a solution (Fletcher,1987). Thus solving the SVM problem is equivalent to finding a solution to the KKT conditions. This fact results in several approaches to finding the solution(for example,the primal-dual path following method mentioned in Section 5). As an immediate application,note that,while w is explicitly determined by the training procedure,the threshold b is not,although it is implicitly determined.However b is easily found by using the KKT"complementarity"condition,Eq.(21),by choosing any i for which i0 and computing b(note that it is numerically safer to take the mean value of b resulting from all such equations). Notice that all we've done so far is to cast the problem into an optimization problem where the constraints are rather more manageable than those in Eqs.(10),(11).Finding the solution for real world problems will usually require numerical methods.We will have more to say on this later.However,let's first work out a rare case where the problem is nontrivial (the number of dimensions is arbitrary,and the solution certainly not obvious), but where the solution can be found analytically

SUPPORT VECTOR MACHINES 131 strict inequality in Eq. (12) holds. For these machines, the support vectors are the critical elements of the training set. They lie closest to the decision boundary; if all other training points were removed (or moved around, but so as not to cross H1 or H2), and training was repeated, the same separating hyperplane would be found. 3.2. The Karush-Kuhn-Tucker Conditions The Karush-Kuhn-Tucker (KKT) conditions play a central role in both the theory and practice of constrained optimization. For the primal problem above, the KKT conditions may be stated (Fletcher, 1987): ∂ ∂wν LP = wν −X i αiyixiν = 0 ν = 1, ··· , d (17) ∂ ∂bLP = − X i αiyi = 0 (18) yi(xi · w + b) − 1 ≥ 0 i = 1, ··· , l (19) αi ≥ 0 ∀i (20) αi(yi(w · xi + b) − 1) = 0 ∀i (21) The KKT conditions are satisfied at the solution of any constrained optimization problem (convex or not), with any kind of constraints, provided that the intersection of the set of feasible directions with the set of descent directions coincides with the intersection of the set of feasible directions for linearized constraints with the set of descent directions (see Fletcher, 1987; McCormick, 1983)). This rather technical regularity assumption holds for all support vector machines, since the constraints are always linear. Furthermore, the problem for SVMs is convex (a convex objective function, with constraints which give a convex feasible region), and for convex problems (if the regularity condition holds), the KKT conditions are necessary and sufficient for w, b, α to be a solution (Fletcher, 1987). Thus solving the SVM problem is equivalent to finding a solution to the KKT conditions. This fact results in several approaches to finding the solution (for example, the primal-dual path following method mentioned in Section 5). As an immediate application, note that, while w is explicitly determined by the training procedure, the threshold b is not, although it is implicitly determined. However b is easily found by using the KKT “complementarity” condition, Eq. (21), by choosing any i for which αi 6= 0 and computing b (note that it is numerically safer to take the mean value of b resulting from all such equations). Notice that all we’ve done so far is to cast the problem into an optimization problem where the constraints are rather more manageable than those in Eqs. (10), (11). Finding the solution for real world problems will usually require numerical methods. We will have more to say on this later. However, let’s first work out a rare case where the problem is nontrivial (the number of dimensions is arbitrary, and the solution certainly not obvious), but where the solution can be found analytically

132 BURGES 3.3. Optimal Hyperplanes: An Example While the main aim of this Section is to explore a non-trivial pattern recognition problem where the support vector solution can be found analytically, the results derived here will also be useful in a later proof. For the problem considered, every training point will turn out to be a support vector, which is one reason we can find the solution analytically. Consider n + 1 symmetrically placed points lying on a sphere Sn−1 of radius R: more precisely, the points form the vertices of an n-dimensional symmetric simplex. It is convenient to embed the points in Rn+1 in such a way that they all lie in the hyperplane which passes through the origin and which is perpendicular to the (n + 1)-vector (1, 1, ..., 1) (in this formulation, the points lie on Sn−1, they span Rn, and are embedded in Rn+1). Explicitly, recalling that vectors themselves are labeled by Roman indices and their coordinates by Greek, the coordinates are given by: xiµ = −(1 − δi,µ) s R n(n + 1) + δi,µr Rn n + 1 (22) where the Kronecker delta, δi,µ, is defined by δi,µ = 1 if µ = i, 0 otherwise. Thus, for example, the vectors for three equidistant points on the unit circle (see Figure 12) are: x1 = (r2 3 , −1 √6 , −1 √6 ) x2 = (−1 √6 , r2 3 , −1 √6 ) x3 = (−1 √6 , −1 √6 , r2 3 ) (23) One consequence of the symmetry is that the angle between any pair of vectors is the same (and is equal to arccos(−1/n)): kxik2 = R2 (24) xi · xj = −R2/n (25) or, more succinctly, xi · xj R2 = δi,j − (1 − δi,j ) 1 n. (26) Assigning a class label C ∈ {+1, −1} arbitrarily to each point, we wish to find that hyperplane which separates the two classes with widest margin. Thus we must maximize LD in Eq. (16), subject to αi ≥ 0 and also subject to the equality constraint, Eq. (15). Our strategy is to simply solve the problem as though there were no inequality constraints. If the resulting solution does in fact satisfy αi ≥ 0 ∀i, then we will have found the general solution, since the actual maximum of LD will then lie in the feasible region, provided the

134 BURGES Hence kwk2 = n X +1 i,j=1 αiαjyiyjxi · xj = αT Hα = n X +1 i=1 αi µ 1 − yip n + 1¶ = n X +1 i=1 αi = ³ n R2 ´ Ã 1 − µ p n + 1¶2 ! (37) Note that this is one of those cases where the Lagrange multiplier λ can remain undetermined (although determining it is trivial). We have now solved the problem, since all the αi are clearly positive or zero (in fact the αi will only be zero if all training points have the same class). Note that kwk depends only on the number of positive (negative) polarity points, and not on how the class labels are assigned to the points in Eq. (22). This is clearly not true of w itself, which is given by w = n R2(n + 1) n X +1 i=1 µ yi − p n + 1¶ xi (38) The margin, M = 2/kwk, is thus given by M = 2R pn (1 − (p/(n + 1))2) . (39) Thus when the number of points n + 1 is even, the minimum margin occurs when p = 0 (equal numbers of positive and negative examples), in which case the margin is Mmin = 2R/√n. If n + 1 is odd, the minimum margin occurs when p = ±1, in which case Mmin = 2R(n + 1)/(n √n + 2). In both cases, the maximum margin is given by Mmax = R(n + 1)/n. Thus, for example, for the two dimensional simplex consisting of three points lying on S1 (and spanning R2), and with labeling such that not all three points have the same polarity, the maximum and minimum margin are both 3R/2 (see Figure (12)). Note that the results of this Section amount to an alternative, constructive proof that the VC dimension of oriented separating hyperplanes in Rn is at least n + 1. 3.4. Test Phase Once we have trained a Support Vector Machine, how can we use it? We simply determine on which side of the decision boundary (that hyperplane lying half way between H1 and H2 and parallel to them) a given test pattern x lies and assign the corresponding class label, i.e. we take the class of x to be sgn(w · x + b). 3.5. The Non-Separable Case The above algorithm for separable data, when applied to non-separable data, will find no feasible solution: this will be evidenced by the objective function (i.e. the dual Lagrangian)

点击进入文档下载页（PDF格式）

共47页，可试读17页，点击继续阅读 ↓↓

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录