T R E ND S C O N T R O V E R S I E S Support vector machines By Marti A.Hearst University of California,Berkeley hearst@sims.berkeley.edu My first exposure to Support Vector Machines came this spring when I heard Sue Dumais present impressive results on text categorization using this analysis technique. This issue's collection of essays should help familiarize our readers with this interest- mate a function f:RN→{±l}using training ing new racehorse in the Machine Learning stable.Bernhard Scholkopf,in an intro- data-that is,N-dimensional patterns x ductory overview,points out that a particular advantage of SVMs over other learning and class labelsy algorithms is that it can be analyzed theoretically using concepts from computational learning theory,and at the same time can achieve good performance when applied to (31乃,,ke,e)∈RWX{仕1h, real problems.Examples of these real-world applications are provided by Sue Dumais. who describes the aforementioned text-categorization problem,yielding the best re- such that /will correctly classify new exam- sults to date on the Reuters collection,and Edgar Osuna,who presents strong results ples (x,y)-that is,Ax)=y for examples (x,y), on application to face detection.Our fourth author,John Platt,gives us a practical which were generated from the same under- guide and a new technique for implementing the algorithm efficiently. lying probability distribution P(x)as the training data.If we put no restriction on the -Marti Hearst class of functions that we choose our estimate ffrom,however,even a function that does well on the training data-for example by satisfyingx=y(here and below,the index basis function(RBF)nets.and polynomial SVMs-a practical consequence of iis understood to run over 1,.e)need classifiers as special cases.Yet it is simple not generalize well to unseen examples.Sup- learning theory enough to be analyzed mathematically, pose we know nothing additional about f(for Bernhard Scholkopf.GMD First because it can be shown to correspond to a example,about its smoothness).Then the Is there anything worthwhile to learn inear method in a high-dimensional fea- values on the training patterns carry no infor about the new SVM algorithm,or does it ture space nonlinearly related to input mation whatsoever about values on novel fall into the category of"yet-another-algo- space.Moreover,even though we can think patterns.Hence leamning is impossible,and rithm."in which case readers should stop of it as a linear algorithm in a high-dimen- minimizing the training error does not imply here and save their time for something sional space,in practice,it does not involve a small expected test error. more useful?In this short overview,I will any computations in that high-dimensional Statistical learning theory,2 or VC(Vap- try to argue that studying support-vector space.By the use of kernels,all necessary nik-Chervonenkis)theory,shows that it is learning is very useful in two respects. computations are performed directly in crucial to restrict the class of functions First,it is quite satisfying from a theoreti- input space.This is the characteristic twist that the learning machine can implement cal point of view:SV learning is based on of SV methods-we are dealing with com- to one with a capacity that is suitable for some beautifully simple ideas and provides plex algorithms for nonlinear pattern the amount of available training data. a clear intuition of what learning from ex- recognition,regression,2 or feature extrac- amples is about.Second,it can lead to high tion,3 but for the sake of analysis and algo- Hyperplane classifiers performances in practical applications. rithmics,we can pretend that we are work- To design learning algorithms,we thus In the following sense can the SV algo- ing with a simple linear algorithm. must come up with a class of functions rithm be considered as lying at the intersec- I will explain the gist of SV methods by whose capacity can be computed.SV clas- tion of learning theory and practice:for describing their roots in learning theory, sifiers are based on the class of hyperplanes certain simple types of algorithms,statisti- the optimal hyperplane algorithm,the ker- cal learning theory can identify rather pre- nel trick,and SV function estimation.For (wx+b=0w∈Rb∈R (2) cisely the factors that need to be taken into details and further references,see Vladimir account to learn successfully.Real-world Vapnik's authoritative treatment,2 the col- corresponding to decision functions applications,however,often mandate the lection my colleagues and I have put to- use of more complex models and algori- gether,and the SV Web page at http://svm. a=sign(wx刈+. 3) thms-such as neural networks-that are first.gmd.de. much harder to analyze theoretically.The We can show that the optimal hyper- SV algorithm achieves both.It constructs Learning pattern recognition from plane,defined as the one with the maximal models that are complex enough:it con- examples margin of separation between the two tains a large class of neural nets,radial For pattern recognition,we try to esti- classes(see Figure 1),has the lowest ca- 18 IEEE INTELLIGENT SYSTEMS
SVMs—a practical consequence of learning theory Bernhard Schölkopf, GMD First Is there anything worthwhile to learn about the new SVM algorithm, or does it fall into the category of “yet-another-algorithm,” in which case readers should stop here and save their time for something more useful? In this short overview, I will try to argue that studying support-vector learning is very useful in two respects. First, it is quite satisfying from a theoretical point of view: SV learning is based on some beautifully simple ideas and provides a clear intuition of what learning from examples is about. Second, it can lead to high performances in practical applications. In the following sense can the SV algorithm be considered as lying at the intersection of learning theory and practice: for certain simple types of algorithms, statistical learning theory can identify rather precisely the factors that need to be taken into account to learn successfully. Real-world applications, however, often mandate the use of more complex models and algorithms—such as neural networks—that are much harder to analyze theoretically. The SV algorithm achieves both. It constructs models that are complex enough: it contains a large class of neural nets, radial basis function (RBF) nets, and polynomial classifiers as special cases. Yet it is simple enough to be analyzed mathematically, because it can be shown to correspond to a linear method in a high-dimensional feature space nonlinearly related to input space. Moreover, even though we can think of it as a linear algorithm in a high-dimensional space, in practice, it does not involve any computations in that high-dimensional space. By the use of kernels, all necessary computations are performed directly in input space. This is the characteristic twist of SV methods—we are dealing with complex algorithms for nonlinear pattern recognition,1 regression,2 or feature extraction,3 but for the sake of analysis and algorithmics, we can pretend that we are working with a simple linear algorithm. I will explain the gist of SV methods by describing their roots in learning theory, the optimal hyperplane algorithm, the kernel trick, and SV function estimation. For details and further references, see Vladimir Vapnik’s authoritative treatment,2 the collection my colleagues and I have put together,4 and the SV Web page at http://svm. first.gmd.de. Learning pattern recognition from examples For pattern recognition, we try to estimate a function f:RN→{±1} using training data—that is, N-dimensional patterns xi and class labels yi , (x1,y1), …,(x`, y`) ∈ RN × {±1}, (1) such that f will correctly classify new examples (x,y)—that is, f(x) = y for examples (x,y), which were generated from the same underlying probability distribution P(x,y) as the training data. If we put no restriction on the class of functions that we choose our estimate f from, however, even a function that does well on the training data—for example by satisfying f(xi ) = yi (here and below, the index i is understood to run over 1, …, `)—need not generalize well to unseen examples. Suppose we know nothing additional about f (for example, about its smoothness). Then the values on the training patterns carry no information whatsoever about values on novel patterns. Hence learning is impossible, and minimizing the training error does not imply a small expected test error. Statistical learning theory,2 or VC (Vapnik-Chervonenkis) theory, shows that it is crucial to restrict the class of functions that the learning machine can implement to one with a capacity that is suitable for the amount of available training data. Hyperplane classifiers To design learning algorithms, we thus must come up with a class of functions whose capacity can be computed. SV classifiers are based on the class of hyperplanes (w⋅x) + b = 0 w ∈RN, b ∈R, (2) corresponding to decision functions f(x) = sign((w⋅x) + b). (3) We can show that the optimal hyperplane, defined as the one with the maximal margin of separation between the two classes (see Figure 1), has the lowest ca- 18 IEEE INTELLIGENT SYSTEMS Support vector machines TRENDS & CONTROVERSIES TRENDS & CONTROVERSIES By Marti A. Hearst University of California, Berkeley hearst@sims.berkeley.edu My first exposure to Support Vector Machines came this spring when I heard Sue Dumais present impressive results on text categorization using this analysis technique. This issue’s collection of essays should help familiarize our readers with this interesting new racehorse in the Machine Learning stable. Bernhard Schölkopf, in an introductory overview, points out that a particular advantage of SVMs over other learning algorithms is that it can be analyzed theoretically using concepts from computational learning theory, and at the same time can achieve good performance when applied to real problems. Examples of these real-world applications are provided by Sue Dumais, who describes the aforementioned text-categorization problem, yielding the best results to date on the Reuters collection, and Edgar Osuna, who presents strong results on application to face detection. Our fourth author, John Platt, gives us a practical guide and a new technique for implementing the algorithm efficiently. –Marti Hearst
区|w-x)+b=+1) x|wx)+b=-1) Note: radial basis function(RBF)kernels such as (w.X,)+b=+1 1 、y=+1 (w·x2)+b=1 红y)=ep(-k-y1(2o2, ® ● => (w(1-x2》=2 W >(网)商 and sigmoid kernels(with gain K and offset 0) (x,y)=tanh(x(x-y)+). (9) (x(w-x)+b=0) SVMs We now have all the tools to construct nonlinear classifiers(see Figure 2).To this Figure 1.A separable classification toy problem:separate balls from diamonds.The optimal hyperplane is orthogonal end,we substitute(x)for each training to the shortest line connecting the convex hulls of the two classes(dotted),and intersects it half way.There is a weight example x,and perform the optimal hyper- vector w and a threshold bsuch that y;((w-x)+b)>0.Rescaling w and bsuch that the point(s)closest to the plane algorithm in F.Because we are using hyperplane satisfy (wx)+=1,we obtain a form(w.b)of the hyperplane with y((w-x+b)1.Note that the kernels,we will thus end up with nonlinear margin,measured perpendicularly to the hyperplane,equals 2/wTo maximize the margin,we thus have to decision function of the form minimize |w]subject to y(w-x)+b)1. ()-sign(.)+). =1 (10) other dot product space (called the feature Input space Feature space space)Fvia a nonlinear map The parameters v,are computed as the so- ◆ lution of a quadratic programming D:RNF ( problem. In input space,the hyperplane corres- and perform the above linear algorithm in ponds to a nonlinear decision function F.As I've noted,this only requires the whose form is determined by the kernel evaluation of dot products. (see Figures 3 and 4). The algorithm I've described thus far has k(x,y=(x)(y), (5) a number of astonishing properties: Fiqure 2.The idea of SV machines:map the training data nonlinearly into a higher-dimensional feature space via Clearly,if Fis high-dimensional,the right- It is based on statistical learning theory, and construct a separating hyperplane with maximum margin there.This yields a nonlinear decision boundary in hand side of Equation 5 will be very expen- It is practical (as it reduces to a quad- input space.By the use of a kemnel function,it is possible sive to compute.In some cases,however, ratic programming problem with a to ompute the separating hyperplane without explicitly there is a simple kernelk that can be evalu- unique solution),and carrying out the map into the feature space. ated efficiently.For instance,the polyno- It contains a number of more or less mial kernel heuristic algorithms as special cases:by the choice of different kernel functions, pacity.It can be uniquely constructed by k(x,y)=(x-y)d (6) we obtain different architectures(Fig- solving a constrained quadratic optimiza- ure 4),such as polynomial classifiers tion problem whose solution w has an ex- can be shown to correspond to a map (Equation 6).RBF classifiers(Equation pansion wa=∑,in terms of a subset of into the space spanned by all products of 8 and Figure 3),and three-layer neural training patterns that lie on the margin(see exactly d dimensions of RV.For d=2 and x, nets(Equation 9). Figure 1).These training patterns,called ye R2,for example,we have support vectors,carry all relevant informa- The most important restriction up to now tion about the classification problem. has been that we were only considering the Omitting the details of the calcu- ar-图月 case of classification.However,a general- lations,there is just one crucial property ization to regression estimation-that is,to of the algorithm that we need to empha- ye R,can be given.In this case,the algo- size:both the quadratic programming (7 rithm tries to construct a linear function in problem and the final decision function the feature space such that the training f(x)=sign(v(xx)+b)depend only on =((x)y) points lie within a distance s>0.Similar to dot products between patterns.This is the pattern-recognition case,we can write precisely what lets us generalize to the defining (x)=(xx).More gen- this as a quadratic programming problem nonlinear case. erally,we can prove that for every kernel in terms of kernels.The nonlinear regres- that gives rise to a positive matrix( sion estimate takes the form Feature spaces and kernels we can construct a map d such that Equa- Figure 2 shows the basic idea of SV ma- tion 5 holds. fx)=) vik(xx)+b (11) chines,which is to map the data into some Besides Equation 6,SV practitioners use -1 JULY/AUGUST 1998 19
JULY/AUGUST 1998 19 pacity. It can be uniquely constructed by solving a constrained quadratic optimization problem whose solution w has an expansion in terms of a subset of training patterns that lie on the margin (see Figure 1). These training patterns, called support vectors, carry all relevant information about the classification problem. Omitting the details of the calculations, there is just one crucial property of the algorithm that we need to emphasize: both the quadratic programming problem and the final decision function depend only on dot products between patterns. This is precisely what lets us generalize to the nonlinear case. Feature spaces and kernels Figure 2 shows the basic idea of SV machines, which is to map the data into some other dot product space (called the feature space) F via a nonlinear map Φ:RN→F, (4) and perform the above linear algorithm in F. As I’ve noted, this only requires the evaluation of dot products. (5) Clearly, if F is high-dimensional, the righthand side of Equation 5 will be very expensive to compute. In some cases, however, there is a simple kernel k that can be evaluated efficiently. For instance, the polynomial kernel (6) can be shown to correspond to a map Φ into the space spanned by all products of exactly d dimensions of RN. For d=2 and x, y∈R2, for example, we have (7) defining More generally, we can prove that for every kernel that gives rise to a positive matrix (k(xi ,xj ))ij, we can construct a map Φ such that Equation 5 holds. Besides Equation 6, SV practitioners use radial basis function (RBF) kernels such as (8) and sigmoid kernels (with gain κ and offset Θ) (9) SVMs We now have all the tools to construct nonlinear classifiers (see Figure 2). To this end, we substitute Φ(xi ) for each training example xi , and perform the optimal hyperplane algorithm in F. Because we are using kernels, we will thus end up with nonlinear decision function of the form (10) The parameters vi are computed as the solution of a quadratic programming problem. In input space, the hyperplane corresponds to a nonlinear decision function whose form is determined by the kernel (see Figures 3 and 4). The algorithm I’ve described thus far has a number of astonishing properties: • It is based on statistical learning theory, • It is practical (as it reduces to a quadratic programming problem with a unique solution), and • It contains a number of more or less heuristic algorithms as special cases: by the choice of different kernel functions, we obtain different architectures (Figure 4), such as polynomial classifiers (Equation 6), RBF classifiers (Equation 8 and Figure 3), and three-layer neural nets (Equation 9). The most important restriction up to now has been that we were only considering the case of classification. However, a generalization to regression estimation—that is, to y∈R, can be given. In this case, the algorithm tries to construct a linear function in the feature space such that the training points lie within a distance ε > 0. Similar to the pattern-recognition case, we can write this as a quadratic programming problem in terms of kernels. The nonlinear regression estimate takes the form (11) f vi k i b i (x)= ⋅ (x , x) + = ∑ 1 l f vi k i b i (x)= ( ⋅ (x, x ) + ). = sign ∑ 1 l k(x, y)= tanh(κ(x ⋅ y) + Θ). k(x, y)= exp(− x − y / ( )), 2 2 2σ Φ(x) = (x , x x , x ). 1 2 1 2 2 2 2 ( ) ( ( ) ( )), x y x y ⋅ = ⋅ = ⋅ = ⋅ 2 1 2 1 2 2 1 2 1 2 2 2 1 2 1 2 2 2 2 2 x x y y x x x x y y y y Φ Φ k d (x, y)=(x⋅ y) k(x, y):=(Φ(x)⋅Φ(y)). f vi i b i (x)=sign(∑ (x ⋅ x ) + ) w =∑ vixi i . . {x | (w x) + b = –1 } . {x | (w x) + b = + 1} . x 1 Note: (w x 1) + b = +1 (w x 2) + b = 1 => (w (x 1 x 2 – 2 )) = => (x 1– x2 ) ) = w ( ||w|| . . . . 2 ||w|| y y i = +1 w {x | (w x) + b = 0 } x 2 i = – 1 Figure 1. A separable classification toy problem: separate balls from diamonds. The optimal hyperplane is orthogonal to the shortest line connecting the convex hulls of the two classes (dotted), and intersects it half way. There is a weight vector w and a threshold bsuch that yi ⋅ ((w⋅xi ) + b) > 0. Rescaling w and bsuch that the point(s) closest to the hyperplane satisfy |(w⋅xi ) + b| = 1, we obtain a form (w,b) of the hyperplane with yi ((w⋅xi ) + b) ≥ 1. Note that the margin, measured perpendicularly to the hyperplane, equals 2/|| w ||. To maximize the margin, we thus have to minimize |w| subject to yi ((w⋅xi ) + b) ≥ 1. Input space Feature space Φ Figure 2. The idea of SV machines: map the training data nonlinearly into a higher-dimensional feature space via Φ, and construct a separating hyperplane with maximum margin there. This yields a nonlinear decision boundary in input space. By the use of a kernel function, it is possible to compute the separating hyperplane without explicitly carrying out the map into the feature space.
respect,several fields have emerged Training methods for speeding up the quadratic program,such as the one de- scribed later in this installment of Trends Controversies by John Platt. Speeding up the evaluation of the deci- sion function is of interest in a variety of applications,such as optical-charac- ter recognition.6 The choice of kernel functions,and hence of the feature space to work in,is of both theoretical and practical inter- est.It determines both the functional form of the estimate and,via the objec- tive function of the quadratic program, the type of regularization that is used to constrain the estimate.7,8 However,even though different kernels lead to differ- Figure 3.Example of an SVclassifier found by using a radial basisfunion kemel (Equation).Cirles and disks are two casses of training examples;thesoidine is the decisionsurface:the suppor vtors found by the aorithm lie ent types of learning machines,the choice of kernel seems to be less crucial on,or between,the dashed lines.Colors code the modulus of the argument+of the decision function in Equation 10. than it may appear at first sight.In OCR applications,the kernels(Equations 6, 9,and 8)lead to very similar perfor- mance and to strongly overlapping sets of support vectors. Output(EkK,x)》 Although the use of SV methods in ap- plications has only recently begun,ap- plication developers have already re- Weights ported state-of-the-art performances in a variety of applications in pattern recog- 水) nition,regression estimation,and time Dot product他(x)Φ(x)=k(xX) series prediction.However,it is proba- bly fair to say that we are still missing Mapped vectorsΦ),b) an application where SV methods sig- nificantly outperform any other avail- able algorithm or solve a problem that 7 Support vectors x...x has so far been impossible to tackle.For the latter,SV methods for solving in- verse problems are a promising candi- 1 Test vector x date.Sue Dumais and Edgar Osuna describe promising applications in this discussion. Figure 4.Architecture of SVmethods.The inputx and the support vectorsx(in this example:digits)are nonlinearly mapped (by into a feature space F,where dot products are computed.By the use of the kernel k,these two layers Using kernels for other algorithms are in practice computed in one single step.The results are linearly combined by weights v found by solving a qua- emerges as an exciting opportunity for dratic program(in pattern recognition,=yin regression estimation,v=or an eigenvalue problem developing new learning techniques. (in kernel PCA3).The linear combination is fed into the function o(in pattern recognition,(x)=sign(x+b):in The kernel method for computing dot regression stimation,o(x)=x+b:in kernel PCA,o(x)=x. products in feature spaces is not re- stricted to SV machines.We can use it to derive nonlinear generalizations of To apply the algorithm,we either Current developments and open any algorithm that can be cast in terms specify e a priori,or we specify an issues of dot products.As a mere start,we upper bound on the fraction of training decided to apply this idea to one of the points allowed to lie outside of a dis- Chances are that those readers who are most widely used algorithms for data tance g from the regression estimate still with me might be interested to hear analysis,principal component analysis. (asymptotically,the number of SVs) how researchers have built on the above. This leads to kernel PCA,3 an algorithm and the corresponding E is computed applied the algorithm to real-world prob- that performs nonlinear PCA by carry- automatically.> lems,and developed extensions.In this ing out linear PCA in feature space.The 20 IEEE INTELLIGENT SYSTEMS
To apply the algorithm, we either specify ε a priori, or we specify an upper bound on the fraction of training points allowed to lie outside of a distance ε from the regression estimate (asymptotically, the number of SVs) and the corresponding ε is computed automatically. 5 Current developments and open issues Chances are that those readers who are still with me might be interested to hear how researchers have built on the above, applied the algorithm to real-world problems, and developed extensions. In this respect, several fields have emerged. • Training methods for speeding up the quadratic program, such as the one described later in this installment of Trends & Controversies by John Platt. • Speeding up the evaluation of the decision function is of interest in a variety of applications, such as optical-character recognition.6 • The choice of kernel functions, and hence of the feature space to work in, is of both theoretical and practical interest. It determines both the functional form of the estimate and, via the objective function of the quadratic program, the type of regularization that is used to constrain the estimate. 7,8 However, even though different kernels lead to different types of learning machines, the choice of kernel seems to be less crucial than it may appear at first sight. In OCR applications, the kernels (Equations 6, 9, and 8) lead to very similar performance and to strongly overlapping sets of support vectors. • Although the use of SV methods in applications has only recently begun, application developers have already reported state-of-the-art performances in a variety of applications in pattern recognition, regression estimation, and time series prediction. However, it is probably fair to say that we are still missing an application where SV methods significantly outperform any other available algorithm or solve a problem that has so far been impossible to tackle. For the latter, SV methods for solving inverse problems are a promising candidate. 9 Sue Dumais and Edgar Osuna describe promising applications in this discussion. • Using kernels for other algorithms emerges as an exciting opportunity for developing new learning techniques. The kernel method for computing dot products in feature spaces is not restricted to SV machines. We can use it to derive nonlinear generalizations of any algorithm that can be cast in terms of dot products. As a mere start, we decided to apply this idea to one of the most widely used algorithms for data analysis, principal component analysis. This leads to kernel PCA,3 an algorithm that performs nonlinear PCA by carrying out linear PCA in feature space. The 20 IEEE INTELLIGENT SYSTEMS Figure 3. Example of an SV classifier found by using a radial basis function kernel (Equation 8). Circles and disks are two classes of training examples; the solid line is the decision surface; the support vectors found by the algorithm lie on, or between, the dashed lines. Colors code the modulus of the argument of the decision function in Equation 10. vi k i b i ∑ ⋅ (x, x ) + Σ . . . Output σ(Σ υi k (x,xi )) υ Weights 1 υ2 υn . . . Test vector x Support vectors x 1 ... xn Mapped vectors Φ(xi Φ(x) Φ(x ), Φ(x) n) Dot product (Φ(x).Φ(xi ))= k(x,xi ( ) . ) ( . ) ( . ) Φ(x1 ) Φ(x2) σ( ) 7 4 1 1 Figure 4. Architecture of SV methods. The input x and the support vectors xi (in this example: digits) are nonlinearly mapped (by Φ) into a feature space F, where dot products are computed. By the use of the kernel k, these two layers are in practice computed in one single step. The results are linearly combined by weights νi , found by solving a quadratic program (in pattern recognition, νi = yi αi ; in regression estimation, νi = α*i – αi ) 2 or an eigenvalue problem (in kernel PCA3 ). The linear combination is fed into the function σ (in pattern recognition, σ(x) = sign(x + b); in regression stimation, σ(x) = x + b; in kernel PCA, σ(x) = x.
Susan T.Dumais is a senior researcher in the Decision Theory and Adap tive Systems Group at Microsoft Research.Her research interests include algorithms and interfaces for improved information retrieval and classifica- tion,human-computer interaction,combining search and navigation,user method consists of solving a linear modeling,individual differences,collaborative filtering,and organizational eigenvalue problem for a matrix whose impacts of new technology.She received a BA in mathematics and psychol- ogy from Bates College and a PhD in cognitive psychology from Indiana elements are computed using the kernel University.She is a member of the ACM,the ASIS,the Human Factors and function.The resulting feature extrac- Ergonomic Society,and the Psychonomic Society.Contact her at Microsoft tors have the same architecture as SV Research,1 Microsoft Way,Redmond,WA 98052;sdumais@microsoft. machines(see Figure 4).A number of com;http://research.microsoft.com/-sdumais. researchers have since started to"ker- Edgar Osuna has just returned to his native Venezuela after receiving his nelize"various other linear algorithms. PhD in operations research from the Massachusetts Institute of Technology His research interests include the study of different aspects and properties of Vapnik's SVM.He received his BS in computer engineering from the References Universidad Simon Bolivar,Caracas,Venezuela.Contact him at IESA, 1.B.E.Boser,I.M.Guyon,and V.N.Vapnik, POBA International #646,PO Box 02-5255.Miami,FL 33102-5255: "A Training Algorithm for Optimal Margin eosuna/@usb.ve. Classifiers,"Proc.Fifth Ann.Workshop Computational Learning Theory,ACM Press,New York,1992,pp.144-152. 2. V.Vapnik,The Nature of Statistical Learn- John Platt is a senior researcher in the Signal Processing Group at Micro- ing Theory,Springer-Verlag,New York, 1995. soft Research.Recently,he has concentrated on fast general-purpose ma- chine-learing algorithms for use in processing signals.More generally,his 3. B.Scholkopf,A.Smola,and K.-R.Muller, research interests include neural networks,machine learning,computer "Nonlinear Component Analysis as a Ker- nel Eigenvalue Problem,"Neural Computa- vision,and signal processing.He received his PhD in computer science at Caltech in 1989,where he studied computer graphics and neural networks. tion,Vol.10,1998,pp.1299-1319. His received his BS in chemistry from California State University at Long 4. B.Scholkopf,C.J.C.Burges,and A.J. Beach.Contact him at Microsoft Research,I Microsoft Way,Redmond, Smola,Advances in Kernel Methods-Sup- WA 98005;jplatt @microsoft.com;http://research.microsoft.com/-jplatt. port Vector Learning,to appear,MIT Press, Cambridge,Mass,1998. Bernhard Scholkopf is a researcher at GMD First,Berlin.His research 5. B.Scholkopfetal,"Support Vector Regres- interests are in machine leaming and perception,in particular using SVMs sion with Automatic Accuracy Control,"to be and Kemel PCA.He has an MSc in mathematics from the University of published in Proc.Eighth Int'I Conf.Artifi- London,and a Diploma in physics and a PhD in computer science,both cial Neural Networks,Perspectives in Neural from the Max Planck Institute.Contact him at GMD-First,Rm.208, Computing,Springer-Verlag,Berlin,1998. Rudower Chaussee 5,D-12489 Berlin;bs @first.gmd.de:http://www.first. 6. C.J.C.Burges,"Simplified Support Vector gmd.de/-bs. Decision Rules,"Proc.13th Int'l Conf. Machine Learning,Morgan Kaufmann,San Francisco,1996,pp.71-77. 1. A.Smola and B.Scholkopf,"From Regu- larization Operators to Support Vector Ker- nels,"Advances in Neural Information Pro- cessing Systems 10,M.Jordan,M.Kearns, methods,including SVMs,have tremendous bility-especially for large or rapidly and S.Solla,eds.,MIT Press,1998. potential for helping people more effectively changing collections.Consequently,inter- 8.F.Girosi.An Equivalence benveen Sparse organize electronic resources. est is growing in developing technologies Approximation and Support Vector Machines, Today,most text categorization is done by for(semi)automatic text categorization. AI Memo No.1606,MIT,Cambridge,Mass. 1997. people.We all save hundreds of files,e-mail Rule-based approaches similar to those 9. J.Weston et al.,Density Estimation Using messages,and URLs in folders every day. employed in expert systems have been used Support Vector Machines.Tech.Report We are often asked to choose keywords but they generally require manual construc- CSD-TR-97-23,Royal Holloway,Univ.of from an approved set of indexing terms for tion of the rules.make rigid binary deci- London,1997. describing our technical publications.On a sions about category membership,and are much larger scale,trained specialists assign typically difficult to modify.Another strat- new items to categories in large taxonomies egy is to use inductive-learning techniques Using SVMs for text categorization such as the Dewey Decimal or Library of to automatically construct classifiers using Congress subject headings,Medical Subject labeled training data.Researchers have ap- Susan Dumais.Decision Theory and Adap- Headings(MeSH),or Yahoo!'s Internet di- plied a growing number of learning tech- tive Systems Group,Microsoft Research rectory.Between these two extremes,people niques to text categorization,including As the volume of electronic information organize objects into categories to support a multivariate regression,nearest-neighbor increases,there is growing interest in devel- wide variety of information-management classifiers,probabilistic Bayesian models, oping tools to help people better find,filter, tasks,including information routing/filter- decision trees,and neural networks.2Re- and manage these resources.Text categoriza- ing/push,identification of objectionable cently,my colleagues and I and others have tion-the assignment of natural-language materials or junk mail,structured search and used SVMs for text categorization with texts to one or more predefined categories browsing,and topic identification for topic- very promising results.3.4 In this essay,I based on their content-is an important com- specific processing operations. briefly describe the results of experiments ponent in many information organization Human categorization is very time-con- in which we use SVMs to classify newswire and management tasks.Machine-leaming suming and costly,thus limiting its applica- stories from Reuters.4 JULY/AUGUST 1998 21
JULY/AUGUST 1998 21 method consists of solving a linear eigenvalue problem for a matrix whose elements are computed using the kernel function. The resulting feature extractors have the same architecture as SV machines (see Figure 4). A number of researchers have since started to “kernelize” various other linear algorithms. References 1. B.E. Boser, I.M. Guyon, and V.N. Vapnik, “A Training Algorithm for Optimal Margin Classifiers,” Proc. Fifth Ann. Workshop Computational Learning Theory, ACM Press, New York, 1992, pp. 144–152. 2. V. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995. 3. B. Schölkopf, A. Smola, and K.-R. Müller, “Nonlinear Component Analysis as a Kernel Eigenvalue Problem,” Neural Computation, Vol. 10, 1998, pp. 1299–1319. 4. B. Schölkopf, C.J.C. Burges, and A.J. Smola, Advances in Kernel Methods—Support Vector Learning, to appear, MIT Press, Cambridge, Mass, 1998. 5. B. Schölkopf et al., “Support Vector Regression with Automatic Accuracy Control,” to be published in Proc. Eighth Int’l Conf. Artificial Neural Networks, Perspectives in Neural Computing, Springer-Verlag, Berlin, 1998. 6. C.J.C. Burges, “Simplified Support Vector Decision Rules,” Proc. 13th Int’l Conf. Machine Learning, Morgan Kaufmann, San Francisco, 1996, pp. 71–77. 7. A. Smola and B. Schölkopf, “From Regularization Operators to Support Vector Kernels,” Advances in Neural Information Processing Systems 10, M. Jordan, M. Kearns, and S. Solla, eds., MIT Press, 1998. 8. F. Girosi, An Equivalence between Sparse Approximation and Support Vector Machines, AI Memo No. 1606, MIT, Cambridge, Mass., 1997. 9. J. Weston et al., Density Estimation Using Support Vector Machines, Tech. Report CSD-TR-97-23, Royal Holloway, Univ. of London, 1997. Using SVMs for text categorization Susan Dumais, Decision Theory and Adaptive Systems Group, Microsoft Research As the volume of electronic information increases, there is growing interest in developing tools to help people better find, filter, and manage these resources. Text categorization—the assignment of natural-language texts to one or more predefined categories based on their content—is an important component in many information organization and management tasks. Machine-learning methods, including SVMs, have tremendous potential for helping people more effectively organize electronic resources. Today, most text categorization is done by people. We all save hundreds of files, e-mail messages, and URLs in folders every day. We are often asked to choose keywords from an approved set of indexing terms for describing our technical publications. On a much larger scale, trained specialists assign new items to categories in large taxonomies such as the Dewey Decimal or Library of Congress subject headings, Medical Subject Headings (MeSH), or Yahoo!’s Internet directory. Between these two extremes, people organize objects into categories to support a wide variety of information-management tasks, including information routing/filtering/push, identification of objectionable materials or junk mail, structured search and browsing, and topic identification for topicspecific processing operations. Human categorization is very time-consuming and costly, thus limiting its applicability—especially for large or rapidly changing collections. Consequently, interest is growing in developing technologies for (semi)automatic text categorization. Rule-based approaches similar to those employed in expert systems have been used, but they generally require manual construction of the rules, make rigid binary decisions about category membership, and are typically difficult to modify. Another strategy is to use inductive-learning techniques to automatically construct classifiers using labeled training data. Researchers have applied a growing number of learning techniques to text categorization, including multivariate regression, nearest-neighbor classifiers, probabilistic Bayesian models, decision trees, and neural networks.1,2 Recently, my colleagues and I and others have used SVMs for text categorization with very promising results.3,4 In this essay, I briefly describe the results of experiments in which we use SVMs to classify newswire stories from Reuters.4 Susan T. Dumais is a senior researcher in the Decision Theory and Adaptive Systems Group at Microsoft Research. Her research interests include algorithms and interfaces for improved information retrieval and classification, human-computer interaction, combining search and navigation, user modeling, individual differences, collaborative filtering, and organizational impacts of new technology. She received a BA in mathematics and psychology from Bates College and a PhD in cognitive psychology from Indiana University. She is a member of the ACM, the ASIS, the Human Factors and Ergonomic Society, and the Psychonomic Society. Contact her at Microsoft Research, 1 Microsoft Way, Redmond, WA 98052; sdumais@microsoft. com; http://research.microsoft.com/~sdumais. Edgar Osuna has just returned to his native Venezuela after receiving his PhD in operations research from the Massachusetts Institute of Technology. His research interests include the study of different aspects and properties of Vapnik’s SVM. He received his BS in computer engineering from the Universidad Simon Bolivar, Caracas, Venezuela. Contact him at IESA, POBA International #646, PO Box 02-5255, Miami, FL 33102-5255; eosuna@usb.ve. John Platt is a senior researcher in the Signal Processing Group at Microsoft Research. Recently, he has concentrated on fast general-purpose machine-learning algorithms for use in processing signals. More generally, his research interests include neural networks, machine learning, computer vision, and signal processing. He received his PhD in computer science at Caltech in 1989, where he studied computer graphics and neural networks. His received his BS in chemistry from California State University at Long Beach. Contact him at Microsoft Research, 1 Microsoft Way, Redmond, WA 98005; jplatt@microsoft.com; http://research.microsoft.com/~jplatt. Bernhard Schölkopf is a researcher at GMD First, Berlin. His research interests are in machine learning and perception, in particular using SVMs and Kernel PCA. He has an MSc in mathematics from the University of London, and a Diploma in physics and a PhD in computer science, both from the Max Planck Institute. Contact him at GMD-First, Rm. 208, Rudower Chaussee 5, D-12489 Berlin; bs@first.gmd.de; http://www.first. gmd.de/~bs.
Table 1.Break-even perfommance for five learing algorithms. FINDSIM NAIVE BAYES BAYESNETS TREES LINEARSVM (%) (%) (%) (%) (%) earn 92.9 95.9 95.8 978 98.0 classification,simpler binary feature values acq 64.7 87.8 88.3 89.7 93.6 (a word either occurs or does not occur in a money-fx 46.7 56.6 58.8 66.2 74.5 document)are often used instead. grain 67.5 78. 81.4 85.0 94.6 Text collections containing millions of crude 70.1 5 79.6 85.0 88.9 trade 63. 69.0 72.5 75.9 unique terms are quite common.Thus,for 65.1 interest 63.4 77.7 both efficiency and efficacy,feature selection ship 49.2 85.6 is widely used when applying machine- wheat 68.9 69.7 82.7 92.5 91.8 learning methods to text categorization.To corn 48.2 65.3 76.4 91.8 90.3 reduce the number of features,we first re- Avg.top 10 64.6 81.5 85.0 88.4 92.0 move features based on overall frequency Avg.all 61.7 75.2 80.0 N/A 87.0 counts,and then select a small number of features based on their fit to categories.We use the mutual information between each feature and a category to further reduce the LSVM feature space.These much smaller document Naive Bayes descriptions then serve as input to the SVM. 0.8 Learning SVMs.We used simple linear SVMs because they provide good general- 0.6 Decision tree ization accuracy and are fast to learn. Thorsten Joachims has explored two classes of nonlinear SVMs-polynomial classifiers and radial basis functions-and observed only small benefits compared to linear models.3 We used John Platt's Sequential 0.2 Find similar Minimal Optimization method(described in a later essay)to learn the vector of fea- ture weights,w.Once the weights are 0 learned,new items are classified by com- 0 0.2 0.4 0.6 0.8 puting w where w is the vector of Recall learned weights,and is the binary vector Figure5.ROC curve. representing a new document.We also learned two parameters of a sigmoid func- tion to transform the output of the SVM to Learning text categorizers confidence(“interest'”category)= probabilities. 0.3*interest+0.4*rate+0.7*quarterly The goal of automatic text-categoriza- An example-Reuters tion systems is to assign new items to one The key idea behind SVMs and other in- The Reuters collection is a popular one or more of a set of predefined categories on ductive-learning approaches is to use a train- for text-categorization research and is pub- the basis of their textual content.Optimal ing set of labeled instances to learn the clas- licly available at http://ww.research.att. categorization functions can be learned sification function automatically.SVM com/-lewis/reuters21578.himl.We used from labeled training examples. classifiers resemble the second example the 12.902 Reuters stories that have been above-a vector of leamned feature weights. classified into 118 categories.Following Inductive learning of classifiers.A classi- The resulting classifiers are easy to construct the ModApte split,we used 75%of the fier is a function that maps an input attri- and update,depend only on information that stories(9.603 stories)to build classifiers bute vector,=.),to the is easy for people to provide (that is,exam- and the remaining 25%(3,299 stories)to confidence that the input belongs to a ples of items that are in or out of categories) test the accuracy of the resulting models in class-that is,)=confidence(class).In and allow users to smoothly trade off preci- reproducing the manual category assign- the case of text classification,the attributes sion and recall depending on their task. ments.Stories can be assigned to more than are words in the document and the classes one category. are the categories of interest(for example, Text representation and feature selec- Text files are automatically processed to Reuters categories include "interest," tion.Each document is represented as a produce a vector of words for each docu- earnings,”and“grain'"). vector of words,as is typically done in in- ment.Eliminating words that appear in only Example classifiers for the Reuters cate- formation retrieval.5 For most text-retrieval one document and then selecting the 300 gory interest are applications,the entries in the vector are words with highest mutual information with weighted to reflect the frequency of terms each category reduces the number of fea- if(interest AND rate)OR(quarterly),then in documents and the distribution of terms tures.These 300-element binary feature vec- confidence(“interest'”category)=0.9 across the collection as a whole.For text tors serve as input to the SVM.A separate 22 IEEE INTELLIGENT SYSTEMS
Learning text categorizers The goal of automatic text-categorization systems is to assign new items to one or more of a set of predefined categories on the basis of their textual content. Optimal categorization functions can be learned from labeled training examples. Inductive learning of classifiers. A classifier is a function that maps an input attribute vector, →x = (x1, x2, x3, …, xn), to the confidence that the input belongs to a class—that is, f( →x ) = confidence(class). In the case of text classification, the attributes are words in the document and the classes are the categories of interest (for example, Reuters categories include “interest,” “earnings,” and “grain”). Example classifiers for the Reuters category interest are • if (interest AND rate) OR (quarterly), then confidence (“interest” category) = 0.9 • confidence (“interest” category) = 0.3*interest + 0.4*rate + 0.7*quarterly The key idea behind SVMs and other inductive-learning approaches is to use a training set of labeled instances to learn the classification function automatically. SVM classifiers resemble the second example above—a vector of learned feature weights. The resulting classifiers are easy to construct and update, depend only on information that is easy for people to provide (that is, examples of items that are in or out of categories), and allow users to smoothly trade off precision and recall depending on their task. Text representation and feature selection. Each document is represented as a vector of words, as is typically done in information retrieval.5 For most text-retrieval applications, the entries in the vector are weighted to reflect the frequency of terms in documents and the distribution of terms across the collection as a whole. For text classification, simpler binary feature values (a word either occurs or does not occur in a document) are often used instead. Text collections containing millions of unique terms are quite common. Thus, for both efficiency and efficacy, feature selection is widely used when applying machinelearning methods to text categorization. To reduce the number of features, we first remove features based on overall frequency counts, and then select a small number of features based on their fit to categories. We use the mutual information between each feature and a category to further reduce the feature space. These much smaller document descriptions then serve as input to the SVM. Learning SVMs. We used simple linear SVMs because they provide good generalization accuracy and are fast to learn. Thorsten Joachims has explored two classes of nonlinear SVMs—polynomial classifiers and radial basis functions—and observed only small benefits compared to linear models.3 We used John Platt’s Sequential Minimal Optimization method6 (described in a later essay) to learn the vector of feature weights, w →. Once the weights are learned, new items are classified by computing x → ⋅w → where w → is the vector of learned weights, and x → is the binary vector representing a new document. We also learned two parameters of a sigmoid function to transform the output of the SVM to probabilities. An example—Reuters The Reuters collection is a popular one for text-categorization research and is publicly available at http://www.research.att. com/~lewis/reuters21578.html. We used the 12,902 Reuters stories that have been classified into 118 categories. Following the ModApte split, we used 75% of the stories (9,603 stories) to build classifiers and the remaining 25% (3,299 stories) to test the accuracy of the resulting models in reproducing the manual category assignments. Stories can be assigned to more than one category. Text files are automatically processed to produce a vector of words for each document. Eliminating words that appear in only one document and then selecting the 300 words with highest mutual information with each category reduces the number of features. These 300-element binary feature vectors serve as input to the SVM. A separate 22 IEEE INTELLIGENT SYSTEMS Table 1. Break-even performance for five learning algorithms. FINDSIM NAIVE BAYES BAYESNETS TREES LINEARSVM (%) (%) (%) (%) (%) earn 92.9 95.9 95.8 97.8 98.0 acq 64.7 87.8 88.3 89.7 93.6 money-fx 46.7 56.6 58.8 66.2 74.5 grain 67.5 78.8 81.4 85.0 94.6 crude 70.1 79.5 79.6 85.0 88.9 trade 65.1 63.9 69.0 72.5 75.9 interest 63.4 64.9 71.3 67.1 77.7 ship 49.2 85.4 84.4 74.2 85.6 wheat 68.9 69.7 82.7 92.5 91.8 corn 48.2 65.3 76.4 91.8 90.3 Avg. top 10 64.6 81.5 85.0 88.4 92.0 Avg. all 61.7 75.2 80.0 N/A 87.0 1 0.8 0.6 0.4 0.2 0 0.4 0.6 0.8 1 Find similar Decision tree LSVM Naive Bayes Recall Precision 0 0.2 Figure 5. ROC curve.