7 Feature Selection and extraction 7.1 Selecting an Initial Set of Features 331 7.2 Procedure for Feature Selection 7.3 Feature Selection Using Support Vector machine 7.3.1 Backward or Forward Feature Selection 7.3.2 Support Vector Machine-Based Feature Selection 7.3.3 Feature Selection by Cross-Validation 7.4 Feature extraction 339 References 340 8 Clustering 8.1 Domain Description 8.2 Extension to Clustering References 351 9 Maximum-Margin Multilayer Neural Networks 9.1 Approach 353 9.2 Three-Layer Neural Networks 9.3 CARVE Algorithm 9.4 Determination of Hidden-Layer Hyperplanes 9.4.1 Rotation of Hyperplanes 9.4.2 Training Algorithm 362 9.5 Determination of Output-Layer Hyperplanes 9. 6 Determination of Parameter values 9.7 Performance Evaluation 364 10 Maximum-Margin Fuzzy Classifiers 10.1 Kernel Fuzzy Classifiers with Ellipsoidal Regions 10.1.1 Conventional Fuzzy Classifiers with dal regions 10.1.2 Extension to a Feature Space 369 10.1. 4 Maximizing Margins 10.1.5 Performance evaluation 10.2 Fuzzy Classifiers with Polyhedral Regions 10.2.1 Training Methods 383 10.2.2 Performance evaluation 11 Function Approximation 11.1 Optimal Hyperplanes 11.2 LI Soft-Margin Support Vector Regressors 11.3 L2 Soft-Margin Support Vector Regressors 11.4 Model selection 11.5 Training Methods 103
Contents xvii 7 Feature Selection and Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 331 7.1 Selecting an Initial Set of Features . . . . . . . . . . . . . . . . . . . . . . . . 331 7.2 Procedure for Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 332 7.3 Feature Selection Using Support Vector Machines . . . . . . . . . . 333 7.3.1 Backward or Forward Feature Selection . . . . . . . . . . . . . 333 7.3.2 Support Vector Machine-Based Feature Selection . . . . . 336 7.3.3 Feature Selection by Cross-Validation . . . . . . . . . . . . . . . 337 7.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340 8 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 8.1 Domain Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 8.2 Extension to Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 9 Maximum-Margin Multilayer Neural Networks . . . . . . . . . . . 353 9.1 Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 9.2 Three-Layer Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 9.3 CARVE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 9.4 Determination of Hidden-Layer Hyperplanes . . . . . . . . . . . . . . . 358 9.4.1 Rotation of Hyperplanes . . . . . . . . . . . . . . . . . . . . . . . . . . 359 9.4.2 Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 9.5 Determination of Output-Layer Hyperplanes . . . . . . . . . . . . . . . 363 9.6 Determination of Parameter Values . . . . . . . . . . . . . . . . . . . . . . . 363 9.7 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 10 Maximum-Margin Fuzzy Classifiers . . . . . . . . . . . . . . . . . . . . . . . 367 10.1 Kernel Fuzzy Classifiers with Ellipsoidal Regions . . . . . . . . . . . 368 10.1.1 Conventional Fuzzy Classifiers with Ellipsoidal Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368 10.1.2 Extension to a Feature Space . . . . . . . . . . . . . . . . . . . . . . 369 10.1.3 Transductive Training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 10.1.4 Maximizing Margins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 10.1.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 10.2 Fuzzy Classifiers with Polyhedral Regions . . . . . . . . . . . . . . . . . 382 10.2.1 Training Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 10.2.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 11 Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 11.1 Optimal Hyperplanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 11.2 L1 Soft-Margin Support Vector Regressors . . . . . . . . . . . . . . . . . 399 11.3 L2 Soft-Margin Support Vector Regressors . . . . . . . . . . . . . . . . . 401 11.4 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 11.5 Training Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
Contents 11.5. 1 Overview 11.5.2 Newton's Methods 11.5.3 Active Set Training 11.6 Variants of Support Vector Regressors 11.6.1 Linear Programming Support Vector Regressors 11.6.2 v-Support Vector Regressors 11.6.3 Least-Squares Support Vector Regres 11.7 Variable Selection 11.7.2 Variable Selection by Block Deletion 11.7. 3 Performance Evaluation References A Conventional classifiers 443 A 1 Bayesian Classifiers A 2 Nearest-Neighbor Classifiers References B Matrices B. 1 Matrix Properties 447 B 2 Least-Squares Methods and Singular Value Decomposition.. 4 B 3 Covariance Matrices References 454 c Quadratic Programmin 455 C1 Optimality Conditions 455 C2 Properties of Solutions 456 D Positive Semidefinite Kernels and Reproducing Kernel Hilbert Space D. 1 Positive semidefinite Kernels D2 Reproducing Kernel Hilbert Space References Index
xviii Contents 11.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 11.5.2 Newton’s Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 11.5.3 Active Set Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422 11.6 Variants of Support Vector Regressors. . . . . . . . . . . . . . . . . . . . . 429 11.6.1 Linear Programming Support Vector Regressors . . . . . . 430 11.6.2 ν-Support Vector Regressors . . . . . . . . . . . . . . . . . . . . . . . 431 11.6.3 Least-Squares Support Vector Regressors . . . . . . . . . . . . 432 11.7 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 11.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 11.7.2 Variable Selection by Block Deletion . . . . . . . . . . . . . . . . 436 11.7.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 437 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 A Conventional Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443 A.1 Bayesian Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443 A.2 Nearest-Neighbor Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445 B Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 B.1 Matrix Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 B.2 Least-Squares Methods and Singular Value Decomposition . . . 449 B.3 Covariance Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 C Quadratic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 C.1 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 C.2 Properties of Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456 D Positive Semidefinite Kernels and Reproducing Kernel Hilbert Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 D.1 Positive Semidefinite Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 D.2 Reproducing Kernel Hilbert Space . . . . . . . . . . . . . . . . . . . . . . . . 463 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
amboS Ve use lowercase bold letters to denote vectors and uppercase italic letters to denote matrices. The following list shows the symbols used in the book variable associated with x A inverse of matrix a transpose of matrix A B set of bounded support vector indices b i bias term of the ith hyperplane margin parameter degree of a polynomial kernel p(x) mapping function from x to the feature space parameter for a radial basis function kernel dimension of the feature space number of training data number of input variables number of classes set of support vector indices U set of unbounded support vector indices Euclidean norm of vector x coefficient vector of the ith hyperpla X set for class i training data number of data in the set X ith m-dimensional training data class label 1 or -l for input xi for pattern classification and a scalar output for function approximation
Symbols We use lowercase bold letters to denote vectors and uppercase italic letters to denote matrices. The following list shows the symbols used in the book: αi Lagrange multiplier for xi ξi slack variable associated with xi A−1 inverse of matrix A A transpose of matrix A B set of bounded support vector indices bi bias term of the ith hyperplane C margin parameter d degree of a polynomial kernel φ(x) mapping function from x to the feature space γ parameter for a radial basis function kernel K(x, x ) kernel l dimension of the feature space M number of training data m number of input variables n number of classes S set of support vector indices U set of unbounded support vector indices x Euclidean norm of vector x wi coefficient vector of the ith hyperplane Xi set for class i training data |Xi| number of data in the set Xi xi ith m-dimensional training data yi class label 1 or −1 for input xi for pattern classification and a scalar output for function approximation xix
Chapter 1 Introduction Support vector machines and their variants and extensions, often called kernel-based methods(or simply kernel methods), have been studied exten- sively and applied to various pattern classification and function approxima- tion problems. Pattern classification is to classify some object into one of the given categories called classes. For a specific pattern classification problem fier, which is computer software, is developed so that objects are clas- correctly with reasonably good accuracy. Inputs to the classifier are called features, because they are determined so that they represent each class well or so that data belonging to different classes are well separated in the In general there are two approaches to develop classifiers: a parametric approach [ 1, in which a priori knowledge of data distributions is assumed and a nonparametric approach, in which no a priori knowledge is assumed Neural networks 2-4, fuzzy systems 5-7, and support vector machines 8, 9 are typical nonparametric classifiers. Through training using input output pairs, classifiers acquire decision functions that classify an input int one of the given classes In this chapter we first classify decision functions for a two-class prob- lem into direct and indirect decision functions. The class boundary given by a direct decision function corresponds to the curve where the function vanishes, while the class boundary given by two indirect decision functions responds to the curve where the two functions give the same values. Then we discuss how to define and determine the direct decision functions for mul- ticlass problems and list up benchmark data sets used in the book. Finally we discuss some measures to evaluate performance of classifiers and function approximators for a given data set Advances in Pattern Recognition, DOI 10.1007 /978-1-84996-098-41 C Springer-Verlag London Limited 2010
Chapter 1 Introduction Support vector machines and their variants and extensions, often called kernel-based methods (or simply kernel methods), have been studied extensively and applied to various pattern classification and function approximation problems. Pattern classification is to classify some object into one of the given categories called classes. For a specific pattern classification problem, a classifier, which is computer software, is developed so that objects are classified correctly with reasonably good accuracy. Inputs to the classifier are called features, because they are determined so that they represent each class well or so that data belonging to different classes are well separated in the input space. In general there are two approaches to develop classifiers: a parametric approach [1], in which a priori knowledge of data distributions is assumed, and a nonparametric approach, in which no a priori knowledge is assumed. Neural networks [2–4], fuzzy systems [5–7], and support vector machines [8, 9] are typical nonparametric classifiers. Through training using input– output pairs, classifiers acquire decision functions that classify an input into one of the given classes. In this chapter we first classify decision functions for a two-class problem into direct and indirect decision functions. The class boundary given by a direct decision function corresponds to the curve where the function vanishes, while the class boundary given by two indirect decision functions corresponds to the curve where the two functions give the same values. Then we discuss how to define and determine the direct decision functions for multiclass problems and list up benchmark data sets used in the book. Finally we discuss some measures to evaluate performance of classifiers and function approximators for a given data set. S. Abe, Support Vector Machines for Pattern Classification, 1 Advances in Pattern Recognition, DOI 10.1007/978-1-84996-098-4_1, c Springer-Verlag London Limited 2010
1 Introduction 1.1 Decision Functions 1.1.1 Decision Functions for Two-Class Problems Consider classifying an m-dimensional vector x=(a1,., Im) into one of wo classes. Suppose that we are given scalar functions gi(x) and g2(x) for Classes 1 and 2, respectively, and we classify x into Class l if 91(x)>0,g2(x)<0, (1.1) Class 2 if q1(x)<0.92(x)>0. We call these functions decision functions. By the preceding decision func tions. if fo 1(x)g2(x)>0 (1.3) is satisfied, x is not classifiable(see the hatched regions in Fig. 1. 1; the arrows show the positive sides of the functions) 91(x)=0 Fig. 1.1 Decision functions in a two-dimensional space To resolve unclassifiable regions, we may change(1.1)and(1. 2)as follows We classify x into Class 1 if
2 1 Introduction 1.1 Decision Functions 1.1.1 Decision Functions for Two-Class Problems Consider classifying an m-dimensional vector x = (x1,...,xm) into one of two classes. Suppose that we are given scalar functions g1(x) and g2(x) for Classes 1 and 2, respectively, and we classify x into Class 1 if g1(x) > 0, g2(x) < 0, (1.1) and into Class 2 if g1(x) < 0, g2(x) > 0. (1.2) We call these functions decision functions. By the preceding decision functions, if for x g1(x) g2(x) > 0 (1.3) is satisfied, x is not classifiable (see the hatched regions in Fig. 1.1; the arrows show the positive sides of the functions). Class 1 x1 x2 0 Class 2 g1 (x) = 0 g2 (x) = 0 Fig. 1.1 Decision functions in a two-dimensional space To resolve unclassifiable regions, we may change (1.1) and (1.2) as follows. We classify x into Class 1 if g1(x) > g2(x) (1.4)