e Feature Spaces and Kernels 5 captures our intuition of the problem:as the hyperplane(cf.figure 1.1)is completely determined by the patterns closest to it,the solution should not dependon the other examples. By substituting (1.13)and (1.14)into L,one eliminates the primal variables and arrives at the Wolfe dual of the optimization problem (e.g.Bertsekas,1995):find Dual multipliers 2 which Optimization Problem 1 maximize W2) 2t一 2i2y(Xi·X)】 (1.16) subject to 2i≥01i=11999181and2ii=09 (1.17) ). The hyperplane decision function can thus be written as 0 f(x)=sgn ∑2i(Kxi)+b (1.18 ). where bis computed using (1.15). The structure of the optimization problem closely resembles those that typically arise in Lagrange's formulation of mechanics (e.g.Goldstein,1986).Also there, often only a subset of the constraints become active.For instance,if we keep a ball in a box,then it will typically roll into one of the corners.The constraints corresponding to the walls which are not touded by the ball are irrelevant,the walls could just as well be removed. Seen in this light,it is not too surprising that it is possible to give a mechanical interpret ation of optimal margin hyperplanes(Burges and Scolkopf,1997):If we assume that each support vector xi exerts a perpendi cular for ce of size 2i and sign yi on a solid plane sheet lying along the hy perplane,then the solution satisfies the requirements of mechanical stability.The constraint (1.13)states that the forces on the sheet sum to zero;and (1.14)implies that the torques also sum to zero,via ixi、i2iw6w‖=w、w6lw‖=0. There are several theoretical arguments supporting the good generalization per- formance of the optimal hyperplane (Vapnik and Chervonenkis (1974);Vapnik (1979),cf.chapters 3 and 4).In addition,it is comput ationally attr active,since it can be constructed by solving a quadratic programming problem.But how can this be gener alized to the case of decision functions which,unlike (1.7),are nonlinear in the data? 1.3 Feature Spaces and Kernels To construct SV machines,the optimal hyperplane algorithm had to be augmented by a method for computing dot products in feature spaces nonlinearly related to input space (Aizerman et al.,1964;Boser et al.,1992).The basicidea is to map the Feature Space data into some other dot product space (called the feature space)F via a nonlinear .'(.'9:
Feature Spaces and Kernels captures our intuition of the problem as the hyperplane cf gure is completely determined by the patterns closest to it the solution should not depend on the other examples By substituting and into L one eliminates the primal variables and arrives at the Wolfe dual of the optimization problem eg Bertsekas nd Dual multipliers i which Optimization Problem maximize W X i i X ij ij yiyj xi xj sub ject to i i and X i iyi The hyperplane decision function can thus be written as f x sgn X i yii x xi b where b is computed using The structure of the optimization problem closely resembles those that typically arise in Lagranges formulation of mechanics eg Goldstein Also there often only a subset of the constraints become active For instance if we keep a ball in a box then it will typically roll into one of the corners The constraints corresponding to the walls which are not touched by the ball are irrelevant the walls could just as well be removed Seen in this light it is not too surprising that it is possible to give a mechanical interpretation of optimal margin hyperplanes Burges and Scholkopf If we assume that each support vector xi exerts a perpendicular force of size i and sign yi on a solid plane sheet lying along the hyperplane then the solution satis es the requirements of mechanical stability The constraint states that the forces on the sheet sum to zero and implies that the torques also sum to zero via P i xi yii wkwk w wkwk There are several theoretical arguments supporting the good generalization per formance of the optimal hyperplane Vapnik and Chervonenkis Vapnik cf chapters and In addition it is computationally attractive since it can be constructed by solving a quadratic programming problem But how can this be generalized to the case of decision functions which unlike are nonlinear in the data Feature Spaces and Kernels To construct SV machines the optimal hyperplane algorithm had to be augmented by a method for computing dot products in feature spaces nonlinearly related to input space Aizerman et al Boser et al The basic idea is to map the Feature Space data into some other dot product space called the feature space F via a nonlinear
Introduction to Support Vector Learning m ap 重:Rw.F. and perform the above linear al }orithm in Fe For instance<supp ose we are iven patterns x)RN where most information is contained in the d-th or der products.monom ialsEofentries xj ofx<iteexj((((xj< where je.(((.jd f1.(((Nge In that case<we mipht prefer to extract these m onom ial e atures >rst<and wor in the eature sp ace F of all productsod entrie se This ap proach<howe ver<fails for realistically size d problems:for N-dimensional input p atterns<there exist.N+d7 1e-.d!.N 7 1de diderent monomialse Already 16,16 pixel input imales.eete in character reconitionE and a monomial de free d=5 yield a dimensionality of10oe This problem can be overcome by noticin}that both the construction of the op timal hyp erplane in F.c.1e16eEandthe evaluation ofthe correspondin}decision function.ll8∈only require the evaluation o{dot products.Φ.x∈'Φ.ye and never the m apped p atterns xEin explicit forme This is crucial<since in some case s<the Mer cer Kernel dot products can be evaluated by a simpleernel k.x.y∈=.④x∈④.yg For instance<the polynomial ernel k.x.ye=.x 'yed can be shown to correspond to a m ap into the sp ace sp anned by all produts df exactly d dimensions ofR.Po}}io.1a-5 Boser et ale.1002e Bur }es.1a086 for a proofsee chapter 20e For d=2 and x.y)R2<eche<we have Vapnil 1aa5e xye=xxP2xx2y呢.y.P2y2E=.重.xe重.ye dnm}重.x∈=.x2.x3.P2xx2a By usin}k.x.ye=.x'y +ced with cfi 0<we can tale into account all product ofor der up to d.icee includin those oforder smaller than dee More Henerally<the followin}theorem of functional analysis shows that ernels k ofpositive inte }ral op erators }ive rise to maps such that.120Eholds.Mercer< 1a0a:Aizerman et ale<1064:Boser et ale<1a02e Theorem 1.1 (Mercer) Ifk is a continuous symm etric ernel ofa positive inte}ral op erator T<itee .Tf∈ye= k.x.yf.x∈bk with 9 厂k.xyf.xf.y∈kdwi0 2 for all f L2.CE.C bein}a compact subset of RE<it can be exp anded in a unifor mly conver fent serie s.on C,CEin term soT's eifenunctions:j andp ositive 1998/08/2516f
Introduction to Support Vector Learning map RN F and perform the above linear algorithm in F For instance suppose we are given patterns x RN where most information is contained in the dth order products monomials of entries xj of x ie xj xjd where jjd fNg In that case we might prefer to extract these monomial features rst and work in the feature space F of all products of d entries This approach however fails for realistically sized problems for Ndimensional input patterns there exist N d d N di erent monomials Already pixel input images eg in character recognition and a monomial degree d yield a dimensionality of This problem can be overcome by noticing that both the construction of the optimal hyperplane in F cf and the evaluation of the corresponding decision function only require the evaluation of dot products x y and never the mapped patterns x in explicit form This is crucial since in some cases the Mercer Kernel dot products can be evaluated by a simple kernel kx y x y For instance the polynomial kernel kx yx yd can be shown to correspond to a map into the space spanned by all products of exactly d dimensions of RN Poggio Boser et al Burges for a proof see chapter For d and x y R eg we have Vapnik x y x x p xxy y p yy x y de ning xx x p xx By using kx yx y cd with c we can take into account all product of order up to d ie including those of order smaller than d More generally the following theorem of functional analysis shows that kernels k of positive integral operators give rise to maps such that holds Mercer Aizerman et al Boser et al Theorem Mercer If k is a continuous symmetric kernel of a positive integral operator T ie T f y Z C kx yf x dx with Z CC kx yf xf y dx dy for all f LC C being a compact subset of RN it can be expanded in a uniformly convergent series on CC in terms of T s eigenfunctions j and positive
:Feature Spaces and Kernels eigenvalues 2j, NE I(xy)=∑260)加 (1.25) j=1 where Nr oo is the number of positive eigenvalues. Note that originally proven for the case where C=[acb(a <bER),this theorem also holds true for gener al compact spaces (Dunford and Schwartz,1963). An equivalent way to characterize Mer cer kernels is that they give rise to positive matrices Kij:=(xiexj)for all {x1...x}(Saitoh,1988).One of the implications that need to be proven to show this equivalence follows from the fact that Kij is a Gram matrix:fora∈R,we have(a·Ka)=‖∑t-1aiΦ(xi)l2≥0. From(1.25),it is straightforward to construct a map into a potentially infinite- dimensional 12 space which satisfies (1.20).For instance,we may use ④(x)=(V21:1(x)eV22:2)e.). (1.26) Rather than thinking of the feature space as an 12 space,we can alternatively represent it as the Hilbert space H containing all linear combinations of the functions{()=‖(ie.)(xi∈C).To ensure that the map重:C→Hk,which in this case is defined as Φ(c)=‖(ce.)e (1.27) satisfies (1.20),we need to endow H with a suitable dot product (.e.).In view of the definition of this dot product needs to satisfy (Ⅻ(xe.)e‖e)》=‖(xey)e (1.28 Repro ducing which amounts to saying that is a reprodu cing kernel for H.For a Mer cer kernel Kernel (1.25),such a dot product does exist.Since is symmetric,the i (i=1e...Nr) can be chosen to be orthogonal with respect to the dot product in L2(C),i.e. (j:n)L(C)=6in,using the Kronecker 6in.From this,we can construct (.e.) such that (V2:itV2n:n)=din: (1.29) Substituting (1.25)into (1.28)then proves the desired equality (for further details, see chapter 6 and Aronszajn(1950);Wahba(1973);Girosi (1998);Scholkopf(1997)). Besides (1.21),SV practicioners use sigmoid kernels I(xy)=tanh(0(·y)+Θ) (1.30) for suitable values of gain 0 and threshold(cf.chapter 7),andradial basisfunction kernels,as for instance (Aizerman et al.,1964;Boser et al.,1992;Scholkopf et al., 1997b) ll(xey)=exp(-llx-yll2(22))e (1.31) 1998/08/2516f
Feature Spaces and Kernels eigenvalues j kx y X NF j jj xj y where NF is the number of positive eigenvalues Note that originally proven for the case where C a b ab R this theorem also holds true for general compact spaces Dunford and Schwartz An equivalent way to characterize Mercer kernels is that they give rise to positive matrices Kij kxi xj for all fx xg Saitoh One of the implications that need to be proven to show this equivalence follows from the fact that Kij is a Gram matrix for R we have K kP i ixik From it is straightforward to construct a map into a potentially in nite dimensional l space which satis es For instance we may use xp x p x Rather than thinking of the feature space as an l space we can alternatively represent it as the Hilbert space Hk containing all linear combinations of the functions f kxi xi C To ensure that the map C Hk which in this case is de ned as x kx satis es we need to endow Hk with a suitable dot product h i In view of the de nition of this dot product needs to satisfy hkx ky i kx y Reproducing which amounts to saying that k is a reproducing kernel for Hk For a Mercer kernel Kernel such a dot product does exist Since k is symmetric the i i NF can be chosen to be orthogonal with respect to the dot product in LC ie j nL C jn using the Kronecker jn From this we can construct h i such that hp jj p nni jn Substituting into then proves the desired equality for further details see chapter and Aronsza jn Wahba Girosi Scholkopf Besides SV practicioners use sigmoid kernels kx y tanhx y ! for suitable values of gain and threshold ! cf chapter and radial basis function kernels as for instance Aizerman et al Boser et al Scholkopf et al b kx y exp kx yk
3 Introduction to Support Vector Learning input space feature space O ◆ ◆ O Figure.0 TheideaofSV mEGine:map thetlEining daanonlineEy into a highEldim Ehsiona aulespeviae/and ConstIiG asepaEting hypEplanewith mEimum mEIgin thEe This yieds anonlinEEr dEGision boundEly in input spe By theuseofakEne nGion 90120s/it is possibleto Computetheseaaing hypEplewithout pliGitly (aiying out themap into thekaulespe withaO Notetha wheh using Gassiah k hes fTinsta thefaulespae Hk thus GntEins a supEpositions o Gassiahs on Coplus limit pointss/whE by dEfinition ofΦ9o12s8/only singlebumps‖sx8 do haepIimaye unde重1 1.4 Support Vector Machines To ConstTiG SV mECine/oneGmpute a optima hypEplaein Kaulespe To this ed/wesubstitutepoxis brEEC tIEining &anplexi ThewEight vEGor C90104ss thEh bECmea Epasion in Kaulespe ad will thus typi(aNy no molecnepond to geimaeofasinglev(Gor fom input saefsGolkopf e a 90228G toraloInulahow to ComputethepIeimaeilit istss1 Sin(ea patEls only o(Grin dot pDduGs/onecan substituteMECErkETnes rthe dot ploduGs oBosETA E/0222;Guyon Ch/0223s/IEEding to deGision unGions DEGsion oIthemolegehela bIin 9C 90 108ss FunGion foxs=sgn ∑yi3i·9师x8·重xi88+b i=1 sgn 三商不+ (1.32) ad thebllowing quEdlaicpDgIan0106ss: m&imize (1.33) subje to (0a9-9 (1.34) ..'9:i
Introduction to Support Vector Learning input space feature space Φ ◆ ◆ ◆ ◆ ❍ ❍ ❍ ❍ ❍ ❍ Figure The idea of SV machines map the training data nonlinearly into a higherdimensional feature space via and construct a separating hyperplane with maximum margin there This yields a nonlinear decision boundary in input space By the use of a kernel function it is possible to compute the separating hyperplane without explicitly carrying out the map into the feature space with Note that when using Gaussian kernels for instance the feature space Hk thus contains all superpositions of Gaussians on C plus limit points whereas by de nition of only single bumps kx do have preimages under Support Vector Machines To construct SV machines one computes an optimal hyperplane in feature space To this end we substitute xi for each training example xi The weight vector cf then becomes an expansion in feature space and will thus typically no more correspond to the image of a single vector from input space cf Scholkopf et al c for a formula how to compute the preimage if it exists Since all patterns only occur in dot products one can substitute Mercer kernels k for the dot products Boser et al Guyon et al leading to decision functions Decision of the more general form cf Function f x sgn X i yii x xi b sgn X i yii kx xi b and the following quadratic program cf maximize W X i i X ij ij yiyjkxi xj sub ject to i i and X i iyi
9 Support Vector Machines 9 0 0 0 0 @ 0 0 0 Figure.Example of a Support Vector classis er found by using a radial basis function kernel kexoy.1 expe)kx)yk Both coordinate axes range from o to+o Circles and disks are two classes of training examples;the middle line is the decision surface;the out er lines precisely meet the constraint so oOs Note that the Supp ort Vect ors found by the algorithm smarked by extra circless are not cent ers of clustersi but examples which are critical for the given classis cation taski Grey values code the modulus of the argumentkxbof the decision funetion32 9From Scholkopf et al1 90226as/see also Burges 90228s1s In practice/a sep arating hyperplane may not exist/eigi if a high noise level causes a Soft Margin large wverlap of the classesi To allow for the possibility of examples violating 90100s! Hyperplane one introduces slack variables aCortes and Vapnik/0225;Vapnk/022 58 5≥0,i=o,,l, (1.35) along with relaxed constraints y…99w·X:8+bs≥0-5i,i=0,,0. (1.36) A classiser which generalizes well is then found by controlling both the classis er cap acity sviaws and the number of training errors/minimiing the objective fun ction T9w,58= w2+c (1.37) i=1 subject to the constraints 90135s and 90136s/for some value of the constant C>0 determining the tradesoff Here and below/we use boldface greek letters as a shorthand for corresp on ding vectorsξ=eξ1,.,Ets Incorp orating kernels/and ..'9:
Support Vector Machines Figure Example of a Support Vector classi er found by using a radial basis function kernel kx y expkxyk Both coordinate axes range from to Circles and disks are two classes of training examples the middle line is the decision surface the outer lines precisely meet the constraint Note that the Support Vectors found by the algorithm marked by extra circles are not centers of clusters but examples which are critical for the given classi cation task Grey values code the modulus of the argument P i yii kx xi b of the decision function From Scholkopf et al a see also Burges In practice a separating hyperplane may not exist eg if a high noise level causes a Soft Margin large overlap of the classes To allow for the possibility of examples violating Hyperplane one introduces slack variables Cortes and Vapnik Vapnik i i along with relaxed constraints yi w xi b i i A classi er which generalizes well is then found by controlling both the classi er capacity via kwk and the number of training errors minimizing the ob jective function w kwk CX i i sub ject to the constraints and for some value of the constant C determining the tradeo Here and below we use boldface greek letters as a shorthand for corresponding vectors Incorporating kernels and