当前位置：和泉文库 > 计算机 > 浏览文档

北京大学：《模式识别》课程教学资源（参考资料）Introduction to Support Vector Learning

文件格式：PDF，文件大小：394.22KB，售价：11.66元

文档详细内容（约41页）

9 Feature Spaces and Kernels captures our intuition of the problem:as thehyperplane(cf.figure1.1)is complet ely determined by the patterns closest to it,the solution should not depend on the oth er examples. By substituting (1.13)and (1.14)into L,one eliminates the primal variables and arrives at the Wolfe dual of the optimiation problem (eg.Bertsckas,1995):find Dual multipliers ai whid Optimiz ation Problem maximt e W(a)= 1 aiailili(Xi·Xi) (1.16) subject to a:≥01i=11..11amd>a:=0. (1.17) 讹， The hyperplane decision function can thus be writt en as 4 f(x)=sgn :a·(x·x)+b (1.18 where b is computed using (1.15). The structure of the optimi ation problem cosely resembles those that typically arise in Logranges formulation of mecanics (eg.Goldstein,1986).Also there often only a subset of the constraints become active For instance f we keep a ball in a b ox,then it will typically roll into one of the corners.The constraints corresponding to the walls whic are not touded by the ball are irrelevant,the walls could just as well be removed. Seen in this light,it is not too surprising that it is possible to give a mecanical interpret ation of optimal margin hyperplanes(Burges and Sdolkopf,1997):If we assume that ead support vector xi exerts a perp endicular force of sie ai and sign yi on a solid plane sheet lying along the hyperplane,then the solution satisfies the requirements of medanical stability.The constraint(1.13)states that the forces on the sheet sum to zero;and(1.14)implies that the torques also sum to zero,via ∑：x:×yaw/lw‖=w×w/lw‖=0. There are several theoretical arguments supporting the good generaliz ation per- formance of the optimal hyp erplane (Vapnik and Chervonenkis (1974);Vapnik (1979),cf.chapters 3 and 4).In addition,it is computationally attractive,since it can be constructed by solving a quadratic programming problem.But how can this begeneralized to the case of decision functions whic,unlike (1.7),are nonlinear in the data? 1.3 Feature Spaces and Kernels To construct SVmadines,the optimal hyperplane algorithm had to be augment ed by a method for computing dot products in feature spaces nonlinearly related to input space (Aizerman et al.,1964;Boser et al.,1992).The b asic idea is to map the Feature space data into some other dot product space (called the feature space)F via a nonlinear .(0,1),9.-6

Feature Spaces and Kernels captures our intuition of the problem as the hyperplane cf gure is completely determined by the patterns closest to it the solution should not depend on the other examples By substituting and into L one eliminates the primal variables and arrives at the Wolfe dual of the optimization problem eg Bertsekas nd Dual multipliers i which Optimization Problem maximize W X i i X ij ij yiyj xi xj sub ject to i i and X i iyi The hyperplane decision function can thus be written as f x sgn X i yii x xi b where b is computed using The structure of the optimization problem closely resembles those that typically arise in Lagranges formulation of mechanics eg Goldstein Also there often only a subset of the constraints become active For instance if we keep a ball in a box then it will typically roll into one of the corners The constraints corresponding to the walls which are not touched by the ball are irrelevant the walls could just as well be removed Seen in this light it is not too surprising that it is possible to give a mechanical interpretation of optimal margin hyperplanes Burges and Scholkopf If we assume that each support vector xi exerts a perpendicular force of size i and sign yi on a solid plane sheet lying along the hyperplane then the solution satis es the requirements of mechanical stability The constraint states that the forces on the sheet sum to zero and implies that the torques also sum to zero via P i xi yii wkwk w wkwk There are several theoretical arguments supporting the good generalization per formance of the optimal hyperplane Vapnik and Chervonenkis Vapnik cf chapters and In addition it is computationally attractive since it can be constructed by solving a quadratic programming problem But how can this be generalized to the case of decision functions which unlike are nonlinear in the data Feature Spaces and Kernels To construct SV machines the optimal hyperplane algorithm had to be augmented by a method for computing dot products in feature spaces nonlinearly related to input space Aizerman et al Boser et al The basic idea is to map the Feature Space data into some other dot product space called the feature space F via a nonlinear

Introduction to Support Vector Learning map 重：RN,F (1.19× and per orm the above linear algorithm in Fi For instance2 suppose we are given patterns x e RN where most in ormation is contained in the dith or der products.monomials=ofentries zj ofxziixj.,.rj2 where j.e...id e {1e.,.N)I In that case2 we might prefer to extract these monomial eatures fir st2 and work in the eature space F o fall products ofd entriesi This approach2 however2 fails fr realistically sized problems:fr N/dimensional input patterns2there exist.N+d-14/.d!.N-14=different monomialsI Already 16<16 pixel input images egi in character recognition=and a monomial degree d=5 yield a dimensionality of101 This problem can be overcome by noticing that both the construction ofthe optimal hyperplane in F.cf.1116-and the evaluation o fthe corresponding decision finction.1118-only require the evaluation ofdot products..x=.y=2and never the mapped patterns .x=in explicit ormi This is crucial2 since in some cases2the Mer cer Kernel dot products can be evaluated by a simple kernel ‖.xy==.Φ.x=TΦ.y= (1.20× For inst ance2the polynomial kernel ll.xty-=.xry_d (1.21× can be shown to correspond to a map into the space spanned by all products of exactly d dimensions ofRN.Poggio.1a75=Boser et al1.1002=Burges.1008fr a proof see chapter 20-1 For d=2 and xey e R22egn we have.Vapnik21a05= xy2=.x2xV2x.x2=2yV2.2T=.重.x=r重.y (1.22× defining .x==.x?xv2r.2- By using ll.xey==.x Ty +c-d with c>02 we can take into account all product o for der up to d.iel including those o forder smaller than d- More gener ally2 the pllowing theorem of finctional analysis shows that kernels llofpositive integral operators give rise to maps such that.120-holds.Mercer2 1a0a:Aizerman et al2 1064:Boser et al121002= Theorem 1.1 (Mercer) If is a continuous symmetric kernel ofa positive integral operator T2 ie .Tf-y-= ‖.xeyf.x= (1.23× with 厂xty-f.x-f.y-w>0 (1.2-× or all f e L2.C=.C being a compact subset of RN=it can be expanded in a uniprmly convergent series.on C<C-in termsofT's eigen finctions j and positive .(0,1),9.-6

Introduction to Support Vector Learning map RN F and perform the above linear algorithm in F For instance suppose we are given patterns x RN where most information is contained in the dth order products monomials of entries xj of x ie xj xjd where jjd fNg In that case we might prefer to extract these monomial features rst and work in the feature space F of all products of d entries This approach however fails for realistically sized problems for Ndimensional input patterns there exist N d d N di erent monomials Already pixel input images eg in character recognition and a monomial degree d yield a dimensionality of This problem can be overcome by noticing that both the construction of the optimal hyperplane in F cf and the evaluation of the corresponding decision function only require the evaluation of dot products x y and never the mapped patterns x in explicit form This is crucial since in some cases the Mercer Kernel dot products can be evaluated by a simple kernel kx y x y For instance the polynomial kernel kx yx yd can be shown to correspond to a map into the space spanned by all products of exactly d dimensions of RN Poggio Boser et al Burges for a proof see chapter For d and x y R eg we have Vapnik x y x x p xxy y p yy x y de ning xx x p xx By using kx yx y cd with c we can take into account all product of order up to d ie including those of order smaller than d More generally the following theorem of functional analysis shows that kernels k of positive integral operators give rise to maps such that holds Mercer Aizerman et al Boser et al Theorem Mercer If k is a continuous symmetric kernel of a positive integral operator T ie T f y Z C kx yf x dx with Z CC kx yf xf y dx dy for all f LC C being a compact subset of RN it can be expanded in a uniformly convergent series on CC in terms of T s eigenfunctions j and positive

Feature Spaces and Kernels eigenvalues j kx y X NF j jj xj y where NF is the number of positive eigenvalues Note that originally proven for the case where C a b ab R this theorem also holds true for general compact spaces Dunford and Schwartz An equivalent way to characterize Mercer kernels is that they give rise to positive matrices Kij kxi xj for all fx xg Saitoh One of the implications that need to be proven to show this equivalence follows from the fact that Kij is a Gram matrix for R we have K kP i ixik From it is straightforward to construct a map into a potentially in nite dimensional l space which satis es For instance we may use xp x p x Rather than thinking of the feature space as an l space we can alternatively represent it as the Hilbert space Hk containing all linear combinations of the functions f kxi xi C To ensure that the map C Hk which in this case is de ned as x kx satis es we need to endow Hk with a suitable dot product h i In view of the de nition of this dot product needs to satisfy hkx ky i kx y Reproducing which amounts to saying that k is a reproducing kernel for Hk For a Mercer kernel Kernel such a dot product does exist Since k is symmetric the i i NF can be chosen to be orthogonal with respect to the dot product in LC ie j nL C jn using the Kronecker jn From this we can construct h i such that hp jj p nni jn Substituting into then proves the desired equality for further details see chapter and Aronsza jn Wahba Girosi Scholkopf Besides SV practicioners use sigmoid kernels kx y tanhx y ! for suitable values of gain and threshold ! cf chapter and radial basis function kernels as for instance Aizerman et al Boser et al Scholkopf et al b kx y exp kx yk

点击进入文档下载页（PDF格式）

共41页，可试读14页，点击继续阅读 ↓↓

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录