Introducion to Support Vector Learning rewriting it in terms of Lagrange multipliers,this again leads to the problem of maximizing (1.33),subject to the constraints 0.2i-Ci=11999181and∑2ii=0s ,.38) The only difference from the separable case is the upper bound C on the Lagrange multipliers 2.This way,the influence of the individual patterns (which could always be outliers)gets limited.As above,the solution takes the form (1.32). The threshold b can be computed by exploiting the fact that for all SVs xi with 2i/C,the slack variable 8 is zero (this again follows from the Karush-Kuhn- Tucker complementarity conditions),and hence ∑2k(xi1x)+b=i0 ,.39) ). If one uses an optimizer that works with the double dual (e.g.Vanderbei,1997),one can also recover the value of the primal variable b directly from the corresponding double dual variable. t.5 Support Vector Regression The concept of the margin is specific to pattern recognition.To generalize the SV algorithm to regression estimation (Vapnik,1995),an analogue of the margin is constructed in the space of the target values y (note that in regression,we have yER)by using Vapnik's 5-insensitive loss function (figure 1.4) jy-f (x)is :max{0i jy-f (x)j-5}9 ,.40) To estimate a linear regression f(x)=(w·x)+b ,…4.) with precision 5,one minimizes 2wP+c∑i-f6 ,.42) ). Written as a constrained optimization problem,this reads (Vapnik,1995): minimizo (w)w) ,43) 2). subject to ((w.xi)+b)-yi-5+8 ,.44) i-(w·xi)+b)-5+8臂 ,.45) &18>0 ,.4r) 1998/08/2516f
Introduction to Support Vector Learning rewriting it in terms of Lagrange multipliers this again leads to the problem of maximizing sub ject to the constraints i C i and X i iyi The only di erence from the separable case is the upper bound C on the Lagrange multipliers i This way the in"uence of the individual patterns which could always be outliers gets limited As above the solution takes the form The threshold b can be computed by exploiting the fact that for all SVs xi with i C the slack variable i is zero this again follows from the KarushKuhn Tucker complementarity conditions and hence X j yjj kxi xj b yi If one uses an optimizer that works with the double dual eg Vanderbei one can also recover the value of the primal variable b directly from the corresponding double dual variable Support Vector Regression The concept of the margin is speci c to pattern recognition To generalize the SV algorithm to regression estimation Vapnik an analogue of the margin is constructed in the space of the target values y note that in regression we have y R by using Vapniks insensitive loss function gure jy f xj maxf jy f xj g To estimate a linear regression f xw x b with precision one minimizes kwk CX i jyi f xij Written as a constrained optimization problem this reads Vapnik minimize w kwk CX i i i sub ject to w xi b yi i yi w xi b i i i
9s Support Vector Regression +E -E +E Figure :In SV IEIesion/adeilEd E(GulEGy-is spEGifEd aplioI Onetheh atenpts to fit atubewith IEdius -to thedaa ThetIEdeoff bAweeh mode Cmpleity ad points lying outsideolthetubeowith positivesI vEEbIe-s is dEEIninEd by minimizing 90143s1 rai.((()Notetha GTding to 0144s ad 9145s1 a Tsmaier tha 5 doe not IequileanonzETo or8/ad hehcedoe not EhtErtheobjeGive nGion 9014381 GEhEaizaion to nonlinEEr IEylesion EtimEtion is (EIIEd out using kee finGions/in ComplGeaaogy to the (e ofpa I(Cognition IntTduGng LEylangemultipliEl onethus EIe a thelollowing optimizaion pDbien:or Cf0.5≥0 Cose apliol/ m&imize 9-9i809-9jsksxi.xj (1.47) subjEG to (1.48) REIEsion TheleIesion etimaete thefTn FunGion 9iskoxi.x8+b. (1.49) wheeb is GomputEl using the tha 044s bCom uaity with=0if 0,9i,C/ad 901458 beCome a equaity with 8/=00,9i/,C1 SaREsions ofthis agolthm aepossibTom st point ofvi wejust neEd sometarye fnGion whiG deehds on thevEGorow.sG 90143ss1 ThEeaemultipledeyIEe oEdom rCnst TiGing it/inCu ding someleEdom how to pEaizE oIIEulalizE diffEIEht pans oIthevEGol ad someleEdom how to usethekeme tIiC For insta(e molegeEla loss hnGions (a beusEd r 1998/08/251631
Support Vector Regression x x x x x x x x x x x x x x −ε +ε x +ε ξ −ε 0 ξ Figure In SV regression a desired accuracy is speci ed a priori One then attempts to t a tube with radius to the data The tradeo between model complexity and points lying outside of the tube with positive slack variables is determined by minimizing for all i Note that according to and any error smaller than does not require a nonzero i or i and hence does not enter the ob jective function Generalization to nonlinear regression estimation is carried out using kernel functions in complete analogy to the case of pattern recognition Introducing Lagrange multipliers one thus arrives at the following optimization problem for C chosen a priori maximize W X i i i X i i iyi X ij i i j j kxi xj sub ject to i i C i and X i i i Regression The regression estimate takes the form Function f x X i i ikxi x b where b is computed using the fact that becomes an equality with i if i C and becomes an equality with i if i C Several extensions of this algorithm are possible From an abstract point of view we just need some target function which depends on the vector w cf There are multiple degrees of freedom for constructing it including some freedom how to penalize or regularize di erent parts of the vector and some freedom how to use the kernel trick For instance more general loss functions can be used for
Introduction to Support Vector Learning (/Iding to pblens tha (an still besolvEd Ghtly Smolaand SQolkopf 0228b;Smolaet a/02288 Molevel noIins other tha the26nolin (a be usEd to IEulalzethesolution aaptEl 08 ad 02s1 Ye aothEranpleis tha polynomia kEnes (an beinClolaEd which Cnsist of multiplelaesu tha.thefilt laEronly Cmpute pIoduGs within (ElEin spEGfiEd subsEs oIthe EhtIie ofw 9SGolkopre a/0228dsi FinENy!theEgolit hm (a bemodifiEd suG thEt s nEEd not bespfGfiEd aploIn Instead onespEGfie uppErb ound 0-v-o on thellEGion ofpoints EowEd to lieoutsidethetubeBymptoticay thenumbroS Vss and thecIIEponding sis CmputEd EtomEi(ENy This is ECieEd by using 2 pIima objEGivelnGion gllw ll)c ve+∑:-f8 (1.50) insteEd of 00142s/ad tleaing s>0 2 apaaneer tha we minimize over 9SGolkopre a/022 8 output o (vk(x.x ) weights dot product ((x)-(x ))=k(x,x,) mapped vectors (x),(x) support vectors x1X 1 test vector x Figure(AIGiteGuleofsV mGine Theinput x and thesuppon VEGoB x:aenonlinEEy mapEd oby s into akaulespeF/wheedot pDduGs ae Comput Edh By theuseothekee k/theetwo laEl alein pIEGicecmput Ed in onesinglestep ThelBults aelinEEy CmbinEd by wEghts v:/bound by solving aquEdIaiCplglan oin patten IECgnitionvi=yiai;in IeIesion etimaion/ vi =ai-ais Thelinear Cmbinaion is Ed into thelnGion o oin patel IeOognition/a()sgn(x+b);in Ieglesion Etimaion/()=r +bs1 1998/08/251631
Introduction to Support Vector Learning leading to problems that can still be solved e#ciently Smola and Scholkopf b Smola et al a Moreover norms other than the norm kk can be used to regularize the solution cf chapters and Yet another example is that polynomial kernels can be incorporated which consist of multiple layers such that the rst layer only computes products within certain speci ed subsets of the entries of w Scholkopf et al d Finally the algorithm can be modi ed such that need not be speci ed a priori Instead one speci es an upper bound on the fraction of points allowed to lie outside the tube asymptotically the number of SVs and the corresponding is computed automatically This is achieved by using as primal ob jective function kwk C X i jyi f xij instead of and treating as a parameter that we minimize over Scholkopf et al a Σ . . . output σ (Σ υi k (x,xi )) υ weights 1 υ2 υm . . . . . . test vector x support vectors x1 ... xn mapped vectors Φ(xi Φ(x) Φ(x ), Φ(x) n ) dot product (Φ(x).Φ(xi ))= k(x,xi ( ) . ) ( . ) ( . ) Φ(x1) Φ(x2) σ ( ) Figure Architecture of SV machines The input x and the Support Vectors xi are nonlinearly mapped by into a feature space F where dot products are computed By the use of the kernel k these two layers are in practice computed in one single step The results are linearly combined by weights i found by solving a quadratic program in pattern recognition i yii in regression estimation i i i The linear combination is fed into the function in pattern recognition x sgnx b in regression estimation x x b
96 EmpilCaReults2 ImpienEtaions2 chd FulhEIDecpmets 0 196 Em pirical Results:Im plem entations:and Further Developm ents Hoving described the basics of SV machines'we now summarize empirical findings and theoretical dev elopments whic were to follow:We cannot report all contri5 butions that have advanced the state of the art in sV learning since the time the algorithm was first prop osed:Not even the present book can do this job'let alone a single section:Presently'we merely give a concise overview: By the use of kernels'the optimal margin classifier was turned into a classifier which b ecame a serious comp etitor of high performance cassifiers:Surprisingly' it was noticed that when di-erent kernel functions are used in SV machines ,specifically',(2(.',(:30.'and (:3(..'they lead to very similar dassification accuracies and SV sets,Scholkopf et al:'(115.:In this sense'the SV set seems to caracterize,or compress.the given task in a manner which up to a certain degree is indep endent of the type of kerndl,i:e:the type of cassifier.used: mitial work at AT&T Bell Labs focused on OCR,optical caracter recognition.' a problem where the two main issues are classification accuracy and classification speed:Consequently'some eort went into the improvement of SV machines on these issues'leading to the Virtual SV method for incorporating prior knowledge ab out transformation inv ariances by transforming sVs'and the reduced set meth od for speeding up dassification:This way'SV machines became competitive with the best av ail able dlassifiers on both OCR and object recognition tasks;Scholkopf et al: (116a Burges'(116;Burges and Scholkopf(119;Scholkopf(119.:Twoyears later' the ab ove are still topics of ongoing researc'as shown by chapter (6 and,Scolkopf et al:'(118b.'prop osing alternative Reduced Set methods'as well as by chapter 9 and;Scolkopf et al:'(118d.'constructing kerne functions which incorporate prior knowledge ab out a given problem: Another initial weakness of SV machines'less apparent in OCR applications which are characterized by low noise levels'was that the size of the quadratic programming problem scaled with the number of Support Vectors:This was due to the fact that in ,(:33.'the quadratic part contained at least all SVs-the common practice was to extract the Svs by going through the training data in chunks while regularly testing for the possibility that some of the patterns that were initially not identified as svs turn out to become svs at a later stage note that without chunking'the sie of the matrix would be<where is the numb er of all training xamples.:What happens if we have a high noise problem?In this case'many of the slac variables &i will become nonzero'and all the corresponding examples will become SVs:For this case'a decomposition algorithm was proposed,Osuna et al:'(119a.'whic is based on the observation that not only can we leave out the nonasv examples,i:e:the xi with 3i =0. from the current chunk'but also some of the SVs'especially those that hit the upper boundary i:e:3i C.:In fact'one can use chunks which do not even contain all SVs'and maximize over the corresponding subproblems:Chapter (2 explores an extreme case'where the sub problems are chosen so small that one 1998/08/251631
Empirical Results Implementations and Further Developments Empirical Results Implementations and Further Developments Having described the basics of SV machines we now summarize empirical ndings and theoretical developments which were to follow We cannot report all contri butions that have advanced the state of the art in SV learning since the time the algorithm was rst proposed Not even the present book can do this job let alone a single section Presently we merely give a concise overview By the use of kernels the optimal margin classi er was turned into a classi er which became a serious competitor of highperformance classi ers Surprisingly it was noticed that when di erent kernel functions are used in SV machines speci cally and they lead to very similar classi cation accuracies and SV sets Scholkopf et al In this sense the SV set seems to characterize or compress the given task in a manner which up to a certain degree is independent of the type of kernel ie the type of classi er used Initial work at AT$T Bell Labs focused on OCR optical character recognition a problem where the two main issues are classi cation accuracy and classi cation speed Consequently some e ort went into the improvement of SV machines on these issues leading to the Virtual SV method for incorporating prior knowledge about transformation invariances by transforming SVs and the Reduced Set method for speeding up classi cation This way SV machines became competitive with the best available classi ers on both OCR and ob ject recognition tasks Scholkopf et al a Burges Burges and Scholkopf Scholkopf Two years later the above are still topics of ongoing research as shown by chapter and Scholkopf et al b proposing alternative Reduced Set methods as well as by chapter and Scholkopf et al d constructing kernel functions which incorporate prior knowledge about a given problem Another initial weakness of SV machines less apparent in OCR applications which are characterized by low noise levels was that the size of the quadratic programming problem scaled with the number of Support Vectors This was due to the fact that in the quadratic part contained at least all SVs the common practice was to extract the SVs by going through the training data in chunks while regularly testing for the possibility that some of the patterns that were initially not identi ed as SVs turn out to become SVs at a later stage note that without chunking the size of the matrix would be where is the number of all training examples What happens if we have a highnoise problem In this case many of the slack variables i will become nonzero and all the corresponding examples will become SVs For this case a decomposition algorithm was proposed Osuna et al a which is based on the observation that not only can we leave out the nonSV examples ie the xi with i from the current chunk but also some of the SVs especially those that hit the upper boundary ie i C In fact one can use chunks which do not even contain all SVs and maximize over the corresponding subproblems Chapter explores an extreme case where the subproblems are chosen so small that one