95 Introduction to Support Vector Learning rewriting it in terms of Lagr ange multipliers2 this again leads to the problem of m axim izing .1133-2 subject to the constraints 0.ai-cti=1,,t-tand∑ a则i=0, (1.38) The only difference fom the separable case is the upper bound c on the Lagrange multipliers ai This way2 the influence of the individual patterns.which could always be outliers=gets limitedi As above2 the solution takes the frm .1132=1 The threshold b can be computed by exploiting the fact that fr all SVs xi with ai c2 the slack variable Ei is zero.this again pllows fom the Karush /Kuhn Tucker complementarity conditions=2 and hence ∑0:xx=+b=6 (1.39) Ifone uses an optimizer that works with the double dual.eigi Vanderbei2 10a7-2one can also recover the value ofthe primal variable b directly fom the corresponding double dual variabler (15 Support Vector Regression The concept ofthe margin is specific to pattern recognition To generalize the SV algorithm to regression estimation.Vapnik2 1005=2 an analogue of the margin is constructed in the space ofthe target values y note that in regression2 we have y)R-by using Vapnik's s jinsensitive loss finction.figure 114= jy s f.xje :maxf Ocjy s f.xjs g. (1.40) To estimate a linear regression f.x==.w:x=+6 (1.41) with precision s2 one minimizes wkX+业 (1.42)) 8.( Written as a constrained optimization problem2 this reads.Vapnik2 1a05= minimize wuE-=wkx+c25i+经- (1.43) .气 subject to ..w:xi=+b=s yi-+i (1.44) yi5.w:xi=+b=-e+分 (1.45) ξ05tf0 (1.46) .(0,1),97.-6
Introduction to Support Vector Learning rewriting it in terms of Lagrange multipliers this again leads to the problem of maximizing sub ject to the constraints i C i and X i iyi The only di erence from the separable case is the upper bound C on the Lagrange multipliers i This way the in"uence of the individual patterns which could always be outliers gets limited As above the solution takes the form The threshold b can be computed by exploiting the fact that for all SVs xi with i C the slack variable i is zero this again follows from the KarushKuhn Tucker complementarity conditions and hence X j yjj kxi xj b yi If one uses an optimizer that works with the double dual eg Vanderbei one can also recover the value of the primal variable b directly from the corresponding double dual variable Support Vector Regression The concept of the margin is speci c to pattern recognition To generalize the SV algorithm to regression estimation Vapnik an analogue of the margin is constructed in the space of the target values y note that in regression we have y R by using Vapniks insensitive loss function gure jy f xj maxf jy f xj g To estimate a linear regression f xw x b with precision one minimizes kwk CX i jyi f xij Written as a constrained optimization problem this reads Vapnik minimize w kwk CX i i i sub ject to w xi b yi i yi w xi b i i i
(s Support Vector Regression ( * r+E -8 -E +E Figure,· In SV regression/a desired accuracy -is specised a priori One then attempts fo st a tube with radius -to the datar The tradeso9 between model complexity and points lying outside of the tube owith positive slack variables Es is determined by minimizing 9o40s for all i =01...1 Note that according to 90144s and 90145s/any error smaller than 5 does not require a nonzero or 8/and hence does not enter the objective nction 90140s1 Generalization to nonlinear regression estimation is carried out using kernel ainctions/in complete analogy to the case of pattern recognition Introducing Lagrange multipliers/one thus arrives at the following optimization problemufor Cfi K15>k chosen a priori/ m aximiz e W,,8=-5∑90+a8+∑9a-a:: i=1 =1 -a-skx1x (1.47) i,j=1 subject to k≤a:1a≤Cii=o1..1日and∑9a:-ag8=k (1.48) Regression The regression estimate takes the form Function fxs=∑9ag-akx1xs+b1 (1.49) 1 where b is computed using the fact that 90144s becomes an equality with 8:=kif K2 ai2 C/and 90145s becomes an equality with 8=kifk2 a 2 C Several extensions ofthis algorithm are p ossibler From an abstract point ofview/ we just need some target function which dep ends on the vect or sw1.s scf 90140ss1 There are multiple degrees of freedom for constructing it/including some freedom how to penalizel or regularizel digerent parts ofthe vector/and some freedom how to use the kernel tricki For instance/more general loss functions can be used for .(0,1),97.-6
Support Vector Regression x x x x x x x x x x x x x x −ε +ε x +ε ξ −ε 0 ξ Figure In SV regression a desired accuracy is speci ed a priori One then attempts to t a tube with radius to the data The tradeo between model complexity and points lying outside of the tube with positive slack variables is determined by minimizing for all i Note that according to and any error smaller than does not require a nonzero i or i and hence does not enter the ob jective function Generalization to nonlinear regression estimation is carried out using kernel functions in complete analogy to the case of pattern recognition Introducing Lagrange multipliers one thus arrives at the following optimization problem for C chosen a priori maximize W X i i i X i i iyi X ij i i j j kxi xj sub ject to i i C i and X i i i Regression The regression estimate takes the form Function f x X i i ikxi x b where b is computed using the fact that becomes an equality with i if i C and becomes an equality with i if i C Several extensions of this algorithm are possible From an abstract point of view we just need some target function which depends on the vector w cf There are multiple degrees of freedom for constructing it including some freedom how to penalize or regularize di erent parts of the vector and some freedom how to use the kernel trick For instance more general loss functions can be used for
(1 IrtkelctiantoSipprt Vekr lemirg ./leading to problems that can still be solved effi ciently .Smola and Scholkopf 0228b;Smola et all 0228as1Moreover/norms other than the 2enormk.k can be used to regularize the solution cflchapters o8 and o2s1Yet ancther example is that poly nomial kernels can be incorporated which consist of mltiple lay ers sud that thes rst layer only computes products within certain specis ed subsets of the entries of w .Scholkopf et all o228ds 1 Finally/the algorithmcan be modis ed such that s need not be specis ed a prioril Instead one specis es anupper bound 0o on the fraction of points allowed to lie cutside the tube.asy mptctically/the mmber of SVss and the corresponding s is computed automatically1This is achieved by using as primal objective function kwk的+C vle+jvi7 f.xisje (1.50) 论, instead of.42s/and treating s fi 0 as a parameter that we minimize over Scholkopf et all o22 8as1 output o (vk(x.x ) weights () dot product(Φ(x)Φp(x)=k(x,x) mapped vectors (x),(x) support vectors x1 1 test vector x Figure 1.5 Architecture of SV macines1 The input x and the Support Vectors xi are nonlinearly mapped.by s into a feature space F/where dat products are computed1By theuse of the kernel k/these two lay ers are in practice computed in one single stepl The results are linearly combined by weights v:/found by solving a quadratic program.in pattern recognitionv:=yi;in regression estimation v:=a;-ais1 The linear combination is fed into the function o.in pattern recognition/a(r)=sgn(+b);in regression estimation()=r+bs1 1998/08/251631
Introduction to Support Vector Learning leading to problems that can still be solved e#ciently Smola and Scholkopf b Smola et al a Moreover norms other than the norm kk can be used to regularize the solution cf chapters and Yet another example is that polynomial kernels can be incorporated which consist of multiple layers such that the rst layer only computes products within certain speci ed subsets of the entries of w Scholkopf et al d Finally the algorithm can be modi ed such that need not be speci ed a priori Instead one speci es an upper bound on the fraction of points allowed to lie outside the tube asymptotically the number of SVs and the corresponding is computed automatically This is achieved by using as primal ob jective function kwk C X i jyi f xij instead of and treating as a parameter that we minimize over Scholkopf et al a Σ . . . output σ (Σ υi k (x,xi )) υ weights 1 υ2 υm . . . . . . test vector x support vectors x1 ... xn mapped vectors Φ(xi Φ(x) Φ(x ), Φ(x) n ) dot product (Φ(x).Φ(xi ))= k(x,xi ( ) . ) ( . ) ( . ) Φ(x1) Φ(x2) σ ( ) Figure Architecture of SV machines The input x and the Support Vectors xi are nonlinearly mapped by into a feature space F where dot products are computed By the use of the kernel k these two layers are in practice computed in one single step The results are linearly combined by weights i found by solving a quadratic program in pattern recognition i yii in regression estimation i i i The linear combination is fed into the function in pattern recognition x sgnx b in regression estimation x x b
9s Empirical Results,Implementations,and Further Developments 90 1.6 Empirical Resu Its.Implementations and Further Dev elopments HairgceqibedtebaiG oSVmacire,wero suhmaieen pirkcafircirg ardtecetka ceecmerf wh weetoroo lwecarrctIerat al cam bitia tathaead arcdtestateo teatins Mieargsircetetiethe agonthm wa firtpcoeclNcteenthepeertbcak cancothi jd,lEtace asIgesectarIpleerty,wemedy syeaccrdeoeI By theveo kemres,tegtina magincasiierwa turedingacasiier 光2t8e (specicaly,(121).(1 30).ad(131)),the 1eadtover simiar casicatia acuace adsVset (akcp ctal199)1nthi sere teSVsetscens toGaacteke(r compress)tegyentak mnamarerwhG iptoacetamn cegceis ircerercerto tet feokerre (ieltet rec casiie)uecl Intia wak atAT&T Be las fccuzedaocr (cptica Gaacterieccgriticn apdlen wheethetom ankste aecaslicatimacuac ardcaslicatian sreeclcaeterty,smeeratwet mnothem poehetc Vmaire a theeksle,legtothe virtual sy methcdfarircaraatirgprkrovlecse aattarsfomatioinaiarce by trarch irgadered ced s mcd fr'sIeecirgupcas catialhis wa,SVmacire bean ecch rctiti ewe betaaalecashiers abch ocr ardbjectieccsntiantaks (sGokcoetal 1996aBUgS,1996:BugG ardSOkC 1997:CkCn 1997)HwCycaer theaoeaes t topc ocsargreeag,asho nby Gateri6 ard(okc etal199sb),pcroirgaterratverectcedsetmehcd,a we a by capter? ardsokcetal199sc caatnctirgkerre furctia whia ircaraatepiar krovleceaatasyenpclenl Arcther irtia weakres Vmacire,ies affaet inocr apkcaticn whG aegaacteedby Io rdselees,wa that theskec thegladatic pcganmirgnoieh scardwh ennberc Siprat dasIThi wa de totherat fhat m(33),thequadaticrat cofaredaticata-the Cmapaticewa toetatesbygirg etarirgcltain Guks whiereguary teturgrorterosbiit tatschec terattem tat wae intially.rct icerticda fumat tobeces at alater stage (rcte thatwIat aukIrg teske c tematik wcudbe<(,whee tenmberc a tarrgeanpe)lwhatharen f wehaeahlgj-rde pien?Inths cae ma c tesac vaiable si wll becermao ard a thecmeraargeanpe wil beccheslFati cae aceconroitian agcnthm wa pcoed(ostraet al1997a whK baedathecseratim tatrct ay canweicaeatteras vean pe (ielthex:wha:=0) fon tearetauk,bit asosceo tesv erecaly tgetathitte urerbarcar (iela,c)Infat aecaueauks whic corctee (atinan admaineoerteceracirgst-polehs1chater12 epae ane tiecae wheethest-pdblens aegcensosmal that ae .(0,1),97.-6
Empirical Results Implementations and Further Developments Empirical Results Implementations and Further Developments Having described the basics of SV machines we now summarize empirical ndings and theoretical developments which were to follow We cannot report all contri butions that have advanced the state of the art in SV learning since the time the algorithm was rst proposed Not even the present book can do this job let alone a single section Presently we merely give a concise overview By the use of kernels the optimal margin classi er was turned into a classi er which became a serious competitor of highperformance classi ers Surprisingly it was noticed that when di erent kernel functions are used in SV machines speci cally and they lead to very similar classi cation accuracies and SV sets Scholkopf et al In this sense the SV set seems to characterize or compress the given task in a manner which up to a certain degree is independent of the type of kernel ie the type of classi er used Initial work at AT$T Bell Labs focused on OCR optical character recognition a problem where the two main issues are classi cation accuracy and classi cation speed Consequently some e ort went into the improvement of SV machines on these issues leading to the Virtual SV method for incorporating prior knowledge about transformation invariances by transforming SVs and the Reduced Set method for speeding up classi cation This way SV machines became competitive with the best available classi ers on both OCR and ob ject recognition tasks Scholkopf et al a Burges Burges and Scholkopf Scholkopf Two years later the above are still topics of ongoing research as shown by chapter and Scholkopf et al b proposing alternative Reduced Set methods as well as by chapter and Scholkopf et al d constructing kernel functions which incorporate prior knowledge about a given problem Another initial weakness of SV machines less apparent in OCR applications which are characterized by low noise levels was that the size of the quadratic programming problem scaled with the number of Support Vectors This was due to the fact that in the quadratic part contained at least all SVs the common practice was to extract the SVs by going through the training data in chunks while regularly testing for the possibility that some of the patterns that were initially not identi ed as SVs turn out to become SVs at a later stage note that without chunking the size of the matrix would be where is the number of all training examples What happens if we have a highnoise problem In this case many of the slack variables i will become nonzero and all the corresponding examples will become SVs For this case a decomposition algorithm was proposed Osuna et al a which is based on the observation that not only can we leave out the nonSV examples ie the xi with i from the current chunk but also some of the SVs especially those that hit the upper boundary ie i C In fact one can use chunks which do not even contain all SVs and maximize over the corresponding subproblems Chapter explores an extreme case where the subproblems are chosen so small that one