Preface Statistical Learning Theory now plays a more active role:after the general analysis of learning processes,the res earch in the area of synthesis of optimal algorithms was started.These studies,however,do not belong to history yet.They are a sulject of today's research activities. Vladimir Vapnik (1995) The Support Vector Machine has recently been introduced as a new technique for solving a variety of learning and function estimation problems.During a workshop at the annual Neur al Information Processing Systems (NIPS)conference,held in Breckenridge,Colorado in December 1997,a snapshot of the state of the art in Support Vector learning was recorded.A variety of people helped in this,among them our co-organizer Leon Bottou,the NIPS workshop chairs Steve Nowlan and Rich Zemel,and all the workshop speakers and attendees who contributed to lively discussions.After the workshop,we decided that it would be worthwhile to invest some time to have the snapshot printed. We invited all the speakers as well as other researchers to submit papers for this collection,and integr ated the results into the present book.We believe that it covers the full range of current Support Vector research at an early point in time.This is possible for two reasons.First,the field of SV learning is in its early (and thus exciting)days.Second,this book gathers expertise from all contributers,whom we wholeheartedly thank for all the work they have put into our joint effort.Any single person trying to accomplish this task would most likely have failed:either by writing a book which is less comprehensive,or by taking more time to complete the book. It is our hope that this outweighs the shortcomings of the book,most notably the fact that a collection of chapters can never be as homogeneous as a book conceived by a single person.We have tried to compensate for this by the selection and refereeing process of the submissions.In addition,we have written an introductory chapter describing the SV algorithm in some detail (chapter 1),and added a roadmap (chapter 2)which describes the actual contributions which are to follow in chapters 3 through 20. Bernhar d Scholkopf,Christopher J.C.Burges,Alexander J.Smola Berlin,Holmdel,July 1998 1998/08/251631
Preface Statistical Learning Theory now plays a more active role after the general analysis of learning processes the research in the area of synthesis of optimal algorithms was started These studies however do not belong to history yet They are a subject of todays research activities Vladimir Vapnik The Support Vector Machine has recently been introduced as a new technique for solving a variety of learning and function estimation problems During a workshop at the annual Neural Information Processing Systems NIPS conference held in Breckenridge Colorado in December a snapshot of the state of the art in Support Vector learning was recorded A variety of people helped in this among them our coorganizer Leon Bottou the NIPS workshop chairs Steve Nowlan and Rich Zemel and all the workshop speakers and attendees who contributed to lively discussions After the workshop we decided that it would be worthwhile to invest some time to have the snapshot printed We invited all the speakers as well as other researchers to submit papers for this collection and integrated the results into the present book We believe that it covers the full range of current Support Vector research at an early point in time This is possible for two reasons First the eld of SV learning is in its early and thus exciting days Second this book gathers expertise from all contributers whom we wholeheartedly thank for all the work they have put into our joint e ort Any single person trying to accomplish this task would most likely have failed either by writing a book which is less comprehensive or by taking more time to complete the book It is our hope that this outweighs the shortcomings of the book most notably the fact that a collection of chapters can never be as homogeneous as a book conceived by a single person We have tried to compensate for this by the selection and refereeing process of the submissions In addition we have written an introductory chapter describing the SV algorithm in some detail chapter and added a roadmap chapter which describes the actual contributions which are to follow in chapters through Bernhard Scholkopf Christopher JC Burges Alexander J Smola Berlin Holmdel July
Introduction to Support Vector Learning The goal of this chapter,which describes the central ideas of SV learning,is twofold. First,we want to provide an introduction for readers unfamiliar with this field. Second,this introduction serves as a source of the basic equations for the chapters of this book.For more exhaustive treatments,we refer the interested reader to Vapnik (1995);Scholkopf (1997);Burges (1998). 1.1 Learning Pattern Recognition from Examples Let us start with the problem of learning how to recognize patterns.Formally,we Training want to estimate a function f:RN{1}using input-output training data Data (1,1),,(K,)∈R×{±1, (1.1) such that f will correctly classify unseen examples (x,y),i.e.f(x)=y for examples (x,y)that were gener ated from the same underlying probability distribution P(x,y) as the training dat a.If we put no restriction on the class of functions that we choose our estimate f from,however,even a function which does well on the training data,e.g.by satisfying f(xi)=yi for all i=1,...,C,need not generalize well Test to unseen examples.To see this,note that for each function f and any test set Data (依1,1),,(,)∈RN×{±1},satisfying{1,.,x}n{x1,,x}={, there exists another function f*such that f*(xi)=f(xi)for all i=1,...,, yet f*()f()for all i 1,...,6.As we are only given the training data, we have no means of selecting which of the two functions(and hence which of the completely different sets of test outputs)is preferable.Hence,only minimizing the Empirical training error (or empirical risk), Risk Rm=∑fx)-l 1 (1.2) =1 Risk does not imply a small test error (called risk),averaged over test examples dr awn from the underlying distribution P(x,y), Rf]=f(x)-ul dP(). (1.3) Statistical learning theory (Vapnik and Chervonenkis,1974;Vapnik,1979),or VC (Vapnik-Chervonenkis)theory,shows that it is imperative to restrict the class of 1998/08/251631
Introduction to Support Vector Learning The goal of this chapter which describes the central ideas of SV learning is twofold First we want to provide an introduction for readers unfamiliar with this eld Second this introduction serves as a source of the basic equations for the chapters of this book For more exhaustive treatments we refer the interested reader to Vapnik Scholkopf Burges Learning Pattern Recognition from Examples Let us start with the problem of learning how to recognize patterns Formally we want to estimate a function f R Training N fg using inputoutput training data Data x y x y RN fg such that f will correctly classify unseen examples x y ie f x y for examples x y that were generated from the same underlying probability distribution P x y as the training data If we put no restriction on the class of functions that we choose our estimate f from however even a function which does well on the training data eg by satisfying f xi yi for all i need not generalize well Test to unseen examples To see this note that for each function f and any test set Data x y x y RN fg satisfying fx x gfx xg fg there exists another function f such that f xi f xi for all i yet f xi f xi for all i As we are only given the training data we have no means of selecting which of the two functions and hence which of the completely di erent sets of test outputs is preferable Hence only minimizing the Empirical training error or empirical risk Risk Rempf X i jf xi yi j Risk does not imply a small test error called risk averaged over test examples drawn from the underlying distribution P x y Rf Z jf x yj dP x y Statistical learning theory Vapnik and Chervonenkis Vapnik or VC VapnikChervonenkis theory shows that it is imperative to restrict the class of
Introduction to Support Vector Learning functions that f is chosen from to one which has a aity that is suitable for the amount of available training data.VC theory provides bounds on the test error.The minimization of these bounds,which depend on both the empirical risk and the capacity of the function dlass,leads to the principle of stluGula Isk minimizdion (Vapnik,1979).The best-known capacity concept of VC theory is the VC dimension VC dimesion,defined as the largest number h of points that can be separated in all possible ways using functions of the given class (cf.chapter 4).An example of a VC bound is the following:if h/s is the VC dimension of the cass of functions that the learning machine can implement,then for all functions of that class,with a probability of at least 1 7 o,the bound Re)-Remp(a)+5( hlog(o) (1.4) holds,where the Onfidetm s is defined as h1og号+1)7log(o64) (1.5) Tighter bounds can be formulated in terms of other concepts,such as the a VC etDopy or the Glowth lnGion.These are usually considered to be harder to evaluate (cf.,however,chapter 9),but they play a fundamental role in the conceptual part of VC theory (Vapnik,1995).Alternative capacity concepts that can be used to formulate bounds include the shateling dimehsion,cf.chapter 4. The bound (1.4)deserves some further explanatory remarks.Suppose we wanted to learn a "dependency"where P(xiy)=P(x)'P(y),i.e.where the pattern x contains no information about the label y,with uniform P(y).Given a training sample of fixed size,we can then surely come up with a learning machine which achieves zero training error (provided we have no examples contr adicting each other).However,in order to reproduce the random labellings,this machine will necessarily require a large VC dimension h.Thus,the confidence term (1.5), increasing monotonically with h,will be large,and the bound(1.4)will not support possible hopes that due to the small training error,we should expect a small test error.This makes it under standable how (1.4)can hold independent of assumptions about the underlying distribution P(xy):it always holds(provided that h/s), but it does not always make a nontrivial prediction-a bound on an error rate becomes void if it is larger than the maximum error rate.In or der to get nontrivial predictions from (1.4),the function space must be restricted such that the capacity (e.g.VC dimension)is small enough (in relation to the available amount of data). 1.2 Hyperplane Classifiers To design learning algorithms,one thus needs to come up with a class of functions whose capacity can be computed.Vapnik and Lerner (1963)and Vapnik and 1998/08/251631
Introduction to Support Vector Learning functions that f is chosen from to one which has a capacity that is suitable for the amount of available training data VC theory provides bounds on the test error The minimization of these bounds which depend on both the empirical risk and the capacity of the function class leads to the principle of structural risk minimization Vapnik The bestknown capacity concept of VC theory is the VC dimension VC dimension de ned as the largest number h of points that can be separated in all possible ways using functions of the given class cf chapter An example of a VC bound is the following if h is the VC dimension of the class of functions that the learning machine can implement then for all functions of that class with a probability of at least the bound R Remp h log holds where the con dence term is de ned as h log s h log h log Tighter bounds can be formulated in terms of other concepts such as the annealed VC entropy or the Growth function These are usually considered to be harder to evaluate cf however chapter but they play a fundamental role in the conceptual part of VC theory Vapnik Alternative capacity concepts that can be used to formulate bounds include the fat shattering dimension cf chapter The bound deserves some further explanatory remarks Suppose we wanted to learn a dependency where P x y P x P y ie where the pattern x contains no information about the label y with uniform P y Given a training sample of xed size we can then surely come up with a learning machine which achieves zero training error provided we have no examples contradicting each other However in order to reproduce the random labellings this machine will necessarily require a large VC dimension h Thus the con dence term increasing monotonically with h will be large and the bound will not support possible hopes that due to the small training error we should expect a small test error This makes it understandable how can hold independent of assumptions about the underlying distribution P x y it always holds provided that h but it does not always make a nontrivial prediction a bound on an error rate becomes void if it is larger than the maximum error rate In order to get nontrivial predictions from the function space must be restricted such that the capacity eg VC dimension is small enough in relation to the available amount of data Hyperplane Classiers To design learning algorithms one thus needs to come up with a class of functions whose capacity can be computed Vapnik and Lerner and Vapnik and
oa Hyperplane Classi/ers Chervonenkis;(164.considered the dlass of hyperplanes ,w·x.+b=0w∈RWib∈B (1.6) corresponding to decision functions f,x.=sgm,w·X.+b.1 (1.7) and proposed a learning algorithm for separable problems'termed the Generalized Portrait'for constructing f from empirical data:It is based on two facts:First' among all hyperplanes separating the data'there exists a unique one yielding the Optimal maximum margin of separation b etw een the classes' Hyperplane gmnx-x:x∈Rv,w·x.+b=0i=(aoo18}9 (1.8) Segond'the capacity decreases with increasing margin: (wx)+b=+1 {x|(wx)+b=-1} Note: (w.x)+b=+1 ◆X1 y=+1 (w.x2+b=-1 => (w·(1-x2》=2 W 二7 (同动)=奇 {x|(wx)+b=0} Figure 1.1 A binary dassification toy problem:separate balls from diamonds:The optimal hyperplane is orthogonal to the shortest line connecting the convex hulls of the two dlasses;dotted.'and int ersects it half way between the two classes:The problem being separable'there exists a weight vect or w and a threshold b such that yi((wxi)+b)>0,i=1,...,e.:Rescaling w and b such that the point,s.dosest to the hyperplane satisfy (w xi)+b=1'we obt ain a canonical form (w,b)of the hyperplane'satisfying yi((w-x:)+b)>1:Note that in this case'the margin'measured perp endicularly to the hyperplane'equals 2/w:This can be seen by considering two points x1,x2 on opposite sides of the margin'i:e:(w-x1)+b=1,(w-x2)+b=-1' and projecting them onto the hyp erplane normal vector w/w: 1998/08/2516f
Hyperplane Classiers Chervonenkis considered the class of hyperplanes w x b w RN b R corresponding to decision functions f x sgnw x b and proposed a learning algorithm for separable problems termed the Generalized Portrait for constructing f from empirical data It is based on two facts First among all hyperplanes separating the data there exists a unique one yielding the Optimal maximum margin of separation between the classes Hyperplane max wb minfkx xik x RN w x b i g Second the capacity decreases with increasing margin . w {x | (w x) + . b = 0} {x | (w x) + . b = −1} {x | (w x) + . b = +1} x2 x1 Note: (w x1) + b = +1 (w x2) + b = −1 => (w (x1−x2)) = 2 => (x1−x2) = w (||w|| ) . . . . 2 ||w|| yi = −1 yi ❍ = +1 ❍ ❍ ❍ ❍ ◆ ◆ ◆ ◆ Figure A binary classi cation toy problem separate balls from diamonds The optimal hyperplane is orthogonal to the shortest line connecting the convex hulls of the two classes dotted and intersects it halfway between the two classes The problem being separable there exists a weight vector w and a threshold b such that yi w xi b i Rescaling w and b such that the points closest to the hyperplane satisfy jw xi bj we obtain a canonical form w b of the hyperplane satisfying yi wxib Note that in this case the margin measured perpendicularly to the hyperplane equals kwk This can be seen by considering two points x x on opposite sides of the margin ie wxb wxb and pro jecting them onto the hyperplane normal vector wkwk
1 Introducion to Support Vector Learning To construct this Optimal Hyperplane (cf.figure 1.1),one solves the following optimization problem: minimize ·w)=wk (1.9) subject to y:'(w‘x)+b)61ei=1e..t∈ (1.10) This constrained optimization problem is dealt with by introducing Lagrange Lagrangian multipliers a;fi 0 and a Lagrangian L(wibea)= kwk7∑a:'(k‘w)+b)71). (1.11) The Lagrangian L has to be minimized with respect to the primal variables w and b and maximized with respect to the dual variables ai (i.e.a saddle point has to be found).Let us try to get some intuition for this.If a constraint (1.10)is violated,then yi ((wxi)+b)7 1 <0,in which case L can be increased by increasing the corresponding ai.At the same time,w and b will have to change such that L decreases.To prevent 7 a;(yi((w'x:)+b)7 1)from becoming arbitrarily large,the change in w and b will ensure that,provided the problem is separable,the constr aint will eventually be satisfied.Similarly,one can understand that for all constr aints which are not precisely met as equalities,i.e. for which yi((w xi)+b)7 1>0,the corresponding a;must be 0:this is the value of oi that maximizes L.The latter is the statement of the Karush-Kuhn- KKT Tucker complementarity conditions of optimization theory (Karush,1939;Kuhn Conditions and Tucker,1951;Bertsekas,1995). The condition that at the saddle point,the derivatives of L with respect to the primal variables must vanish, 品L(w:bcc)=0e L(webia)=Oe w (1.12) leads to ∑a:=0 (1.13) an (1.14) The solution vector thus has an expansion in terms of a subset of the training Support Vector patterns,namely those patterns whose a;is non-zero,called Support Vectors.By the Karush-Kuhn-Tucker complementarity conditions a:‘[ly:(x:'w)+b)71=0ei=1e..eE (1.15) the Support Vectors lie on the margin (cf.figure 1.1).All remaining examples of the training set are irrelevant:their constraint (1.10)does not play a role in the optimization,and they do not appear in the expansion (1.14).This nicely 1998/08/2516f
Introduction to Support Vector Learning To construct this Optimal Hyperplane cf gure one solves the following optimization problem minimize w kwk sub ject to yi w xi b i This constrained optimization problem is dealt with by introducing Lagrange Lagrangian multipliers i and a Lagrangian Lw b kwk X i i yi xi w b The Lagrangian L has to be minimized with respect to the primal variables w and b and maximized with respect to the dual variables i ie a saddle point has to be found Let us try to get some intuition for this If a constraint is violated then yi w xi b in which case L can be increased by increasing the corresponding i At the same time w and b will have to change such that L decreases To prevent i yi w xi b from becoming arbitrarily large the change in w and b will ensure that provided the problem is separable the constraint will eventually be satis ed Similarly one can understand that for all constraints which are not precisely met as equalities ie for which yi w xi b the corresponding i must be this is the value of i that maximizes L The latter is the statement of the KarushKuhn KKT Tucker complementarity conditions of optimization theory Karush Kuhn Conditions and Tucker Bertsekas The condition that at the saddle point the derivatives of L with respect to the primal variables must vanish bLw b w Lw b leads to X i iyi and w X i iyixi The solution vector thus has an expansion in terms of a subset of the training patterns namely those patterns whose i Support Vector is nonzero called Support Vectors By the KarushKuhnTucker complementarity conditions i yixi w b i the Support Vectors lie on the margin cf gure All remaining examples of the training set are irrelevant their constraint does not play a role in the optimization and they do not appear in the expansion This nicely