1 Introduction microarray for prediction of early intrahepatic recurrence of hepatocellular carcinoma after curative resection. The Lancet, 361(9361): 923-929, 2003 29. C. L. Nutt. D. R. A. Betensky, P. Tamayo, J. G. Cairncross, C. Ladd, U. Pol T.T. Batchelor. P. M. Black. A. von Deimling S. L. Pomeroy, T.R. Golub, and D. N. Louis. Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer research,63(7):1602-1607,2003 30. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene exp n monitoring. Science, 286: 531-537, 1999 31. D. Singh. P. G. Febbo. K. Ross. D. G. Jackson, J. Manola. C. Ladd. P. Tamayo A. A. Renshaw. A.V. D'Amico.. P. Richie. E. S. Lander. M. Loda. P. w. Kantoff T.R. Golub, and w.R. Sellers. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2): 203-209, 2002. 32. R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics,7:179-188,1936. 33. J. C. Bezdek, J M. Keller, R. Krishnapuram, L. I. Kuncheva, and N. R. Pal. Will the 34. B ris data please stand up? IEEE Transactions on Fuzzy Systems, 7(3): 368-369 34. H. Takenaga, S. Abe, M. Takato, M. Kayama, T Kitamura, and Y Okuyama Input layer optimization of neural networks by sensitivity analysis and its application to recognition of numerals. Electrical Engineering in Japan, 111(4): 130-138, 1991 35. S. M. Weiss and I Kapouleas. An empirical comparison of pattern recognition, neural nets, and machine learning classification methods. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, pages 781-787, Detroit, 1989 36. A. Asuncion and D. J. Newman. UCI machine learning repositor ttp//www.ics.uci.edu/mlearn/mlrepository.html.2007. 37. A. Hashizume, J. Motoike, and R. Yabe Fully automated blood cell differential system and its application. In Proceedings of the IUPAC Third International Congress Automation and New Technology in the Clinical Laboratory, pages 297-302, Kobe, Japan, 1988 38. M-S Lan, H. Takenaga, and S. Abe. Character recognition using fuzzy rules extracted from data. In Proceedings of the Third IEEE International Conference on Fuzzy ystems, volume 1, pages 415-420, Orlando, FL, 1994 G. L Cash and M. Hatamian. Optical character recognition by the method of moments Computer Vision, Graphics, and Image Processing, 39(3): 291-310, 1987. 40.UspsDataset.http://www-i6.informatik.rwth-aachen.de/keysers/usps.html 41. Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): 2278-2324, 1998 42. Y. LeCun and C. Cortes. The mNiSt database of handwritten digit http://yann.i 43. R. S. Crowder. Predicting the Mackey-Glass time series with cascade-correlation learn- ing. In D. S. Touretzky, J. L Elman, T.J. Sejnowski, and G. E. Hinton, editors, Co nectionist Models: Proceedings of the 1990 Summer School, Kaufmann. San Mateo. CA. 1991 44. K. Baba, I. Enbutu, and M. Yoda. Explicit representation of knowledge acquired from plant historical data using neural network. In Proceedings of 1990 IJCNN Interna tional Joint Conference on Neural Networks, volume 3, pages 155-160, San Diego 45.UclMachineLearningGroup.http://www.ucl.ac.be/mlg/index.phppage=home. 46. D. Harrison and Rubinfeld. Hedonic prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1): 81-102, 1978 47.DelveDatasets.http://www.cs.toronto.edu/delve/data/datasets.htm
18 1 Introduction microarray for prediction of early intrahepatic recurrence of hepatocellular carcinoma after curative resection. The Lancet, 361(9361):923–929, 2003. 29. C. L. Nutt, D. R. Mani, R. A. Betensky, P. Tamayo, J. G. Cairncross, C. Ladd, U. Pohl, C. Hartmann, M. E. McLaughlin, T. T. Batchelor, P. M. Black, A. von Deimling, S. L. Pomeroy, T. R. Golub, and D. N. Louis. Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Research, 63(7):1602–1607, 2003. 30. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286:531–537, 1999. 31. D. Singh, P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo, A. A. Renshaw, A. V. D’Amico, J. P. Richie, E. S. Lander, M. Loda, P. W. Kantoff, T. R. Golub, and W. R. Sellers. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2):203–209, 2002. 32. R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7:179–188, 1936. 33. J. C. Bezdek, J. M. Keller, R. Krishnapuram, L. I. Kuncheva, and N. R. Pal. Will the real iris data please stand up? IEEE Transactions on Fuzzy Systems, 7(3):368–369, 1999. 34. H. Takenaga, S. Abe, M. Takato, M. Kayama, T. Kitamura, and Y. Okuyama. Input layer optimization of neural networks by sensitivity analysis and its application to recognition of numerals. Electrical Engineering in Japan, 111(4):130–138, 1991. 35. S. M. Weiss and I. Kapouleas. An empirical comparison of pattern recognition, neural nets, and machine learning classification methods. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, pages 781–787, Detroit, 1989. 36. A. Asuncion and D. J. Newman. UCI machine learning repository, http://www.ics.uci.edu/mlearn/MLRepository.html. 2007. 37. A. Hashizume, J. Motoike, and R. Yabe. Fully automated blood cell differential system and its application. In Proceedings of the IUPAC Third International Congress on Automation and New Technology in the Clinical Laboratory, pages 297–302, Kobe, Japan, 1988. 38. M.-S. Lan, H. Takenaga, and S. Abe. Character recognition using fuzzy rules extracted from data. In Proceedings of the Third IEEE International Conference on Fuzzy Systems, volume 1, pages 415–420, Orlando, FL, 1994. 39. G. L. Cash and M. Hatamian. Optical character recognition by the method of moments. Computer Vision, Graphics, and Image Processing, 39(3):291–310, 1987. 40. USPS Dataset. http://www-i6.informatik.rwth-aachen.de/keysers/usps.html. 41. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 42. Y. LeCun and C. Cortes. The MNIST database of handwritten digits http://yann.lecun.com/exdb/mnist/. 43. R. S. Crowder. Predicting the Mackey-Glass time series with cascade-correlation learning. In D. S. Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton, editors, Connectionist Models: Proceedings of the 1990 Summer School, pages 117–123. Morgan Kaufmann, San Mateo, CA, 1991. 44. K. Baba, I. Enbutu, and M. Yoda. Explicit representation of knowledge acquired from plant historical data using neural network. In Proceedings of 1990 IJCNN International Joint Conference on Neural Networks, volume 3, pages 155–160, San Diego, CA, 1990. 45. UCL Machine Learning Group. http://www.ucl.ac.be/mlg/index.php?page=home. 46. D. Harrison and D. L. Rubinfeld. Hedonic prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1):81–102, 1978. 47. Delve Datasets. http://www.cs.toronto.edu/delve/data/datasets.html
48. Milano Research ttp: //michem disat unimib. it/chm/download/download. htm 9. J. Demsar. Statistical comparisons of classifiers over multiple data sets. Joumal of h.7:1-30.2006 50. T G. Dietterich. Approximate statistical tests for comparing supervised classification ning algorithms. Neural Computation, 10(7): 1895-1923, 1998 51. J. Davis and M. Goadrich. The relationship between precision-recall and ROC curves. In Proceedings of the Twenty-Third International Conference on Machine Learning ICML 2006), pages 233-240. Pittsburgh, PA, 2006
References 19 48. Milano Chemometrics and QSAR Research Group. http://michem.disat.unimib.it/chm/download/download.htm. 49. J. Demšar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006. 50. T. G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7):1895–1923, 1998. 51. J. Davis and M. Goadrich. The relationship between precision-recall and ROC curves. In Proceedings of the Twenty-Third International Conference on Machine Learning (ICML 2006), pages 233–240, Pittsburgh, PA, 2006
Chapter 2 Two-Class Support Vector Machines In training a classifier, usually we try to maximize classification performance for the training data. But if the classifier is too fit for the training data, the classification ability for unknown data, i.e., the generalization ability is degraded. This phenomenon is called overfitting, namely, there is a trade-off between the generalization ability and fitting to the training data. Various methods have been proposed to prevent overfitting 1, pp. 11-15,2, pp 86-91131 For a two-class problem, a support vector machine is trained so that the direct decision function maximizes the generalization ability 4, pp. 127-151 pp.92-1 29, namely, the m-dimensional input space x is mapped into the 1-dimensional(I 2 m) feature space z. Then in z, the quadratic program- ming problem is solved to separate two classes by the optimal separating In this chapter we discuss support vector machines for two-class probler First, we discuss hard-margin support vector machines, in which training data are linearly separable in the input space. Then we extend hard-margin support vector machines to the case where training data are not linearly eparable and map the input space into the high-dimensional feature space to enhance linear separability in the feature space. The characteristics of support vector machines are then studied theoretically and by computer experiments 2.1 Hard-Margin Support Vector Machines Let M m-dimensional training inputs xi(i=1, g to Class l or 2 and the associated labels be 1 for Class 1 and -1 for Class 2. If these data are linearly separable, we can determine the decision function e support vector machines. t dd a regularization term, which controls the generalization ability, to the objective function Advances in Pattern Recognition, DOI 10.1007 /978-1-84996-098-4 2 C Springer-Verlag London Limited 2010
Chapter 2 Two-Class Support Vector Machines In training a classifier, usually we try to maximize classification performance for the training data. But if the classifier is too fit for the training data, the classification ability for unknown data, i.e., the generalization ability is degraded. This phenomenon is called overfitting, namely, there is a trade-off between the generalization ability and fitting to the training data. Various methods have been proposed to prevent overfitting [1, pp. 11–15], [2, pp. 86–91], [3].1 For a two-class problem, a support vector machine is trained so that the direct decision function maximizes the generalization ability [4, pp. 127–151], [5, pp. 92–129], namely, the m-dimensional input space x is mapped into the l-dimensional (l ≥ m) feature space z. Then in z, the quadratic programming problem is solved to separate two classes by the optimal separating hyperplane. In this chapter we discuss support vector machines for two-class problems. First, we discuss hard-margin support vector machines, in which training data are linearly separable in the input space. Then we extend hard-margin support vector machines to the case where training data are not linearly separable and map the input space into the high-dimensional feature space to enhance linear separability in the feature space. The characteristics of support vector machines are then studied theoretically and by computer experiments. 2.1 Hard-Margin Support Vector Machines Let M m-dimensional training inputs xi (i = 1,...,M) belong to Class 1 or 2 and the associated labels be yi = 1 for Class 1 and −1 for Class 2. If these data are linearly separable, we can determine the decision function: 1 One of the main ideas is, like support vector machines, to add a regularization term, which controls the generalization ability, to the objective function. S. Abe, Support Vector Machines for Pattern Classification, 21 Advances in Pattern Recognition, DOI 10.1007/978-1-84996-098-4_2, c Springer-Verlag London Limited 2010
2 Two-Class Support Vector Machines (x) There w is an m-dimensional vector, b is a bias term, and for i=1,., M 0 for yi =1, < (22) Because the training data are linearly separable, no training data satisfy x+b=0. Thus, to control separability, instead of (2.2), we consider the following inequalities x i+b ∫≥1forv=1 1 fo Here, I and -l on the right-hand sides of the inequalities can be constants a(>0) and -a, respectively. But by dividing both sides of the inequalities by a,(2.3) is obtained. Equation(2.3)is equivalent t v(wx1+b)≥1fori=1,,M The hyperplane D(x)=w x+b=c for -1<c< forms a separating hyperplane that separates xi(i=l M).When c=0 the separating hyperplane is in the middle of the two hyperplanes with c=l and -1. The distance between the separating hyperplane and the training data sample nearest to the hyperplane is called the margin. Assuming that the hyperplanes D(x)=l and -l include at least one training data sample the hyperplane D(x)=0 has the maximum margin for -1 <c<l. The region x-1 s D(x)<l is the generalization region for the decision function Figure 2.1 shows two decision functions that satisfy(2. 4). Thus there are infinite number of decision functions that satisfy (2.4), which rating hyperplanes. The generalization ability depends on the location of he separating hyperplane, and the hyperplane with the maximum margin is called the optimal separating hyperplane(see Fig 2.1). Assume that no outliers are included in the training data and that unknown test data will obey the same distribution as that of the training data. Then it is intuitively clear that the generalization ability is maximized if the optimal separating hyperplane is selected as the separating hyperplane Now consider determining the optimal separating hyperplane. The Er lean distance from a training data sample x to the separating hyperplane is given by D(x)/w. This can be shown as follows. Because the vector w is orthogonal to the separating hyperplane, the line that goes through x and that is orthogonal to the hyperplane is given by aw /w+x, where a is the Euclidean distance from x to the hyperplane. It crosses the hyperp
22 2 Two-Class Support Vector Machines D(x) = w x + b, (2.1) where w is an m-dimensional vector, b is a bias term, and for i = 1,...,M w xi + b > 0 for yi = 1, < 0 for yi = −1. (2.2) Because the training data are linearly separable, no training data satisfy w x + b = 0. Thus, to control separability, instead of (2.2), we consider the following inequalities: w xi + b ≥ 1 for yi = 1, ≤ −1 for yi = −1. (2.3) Here, 1 and −1 on the right-hand sides of the inequalities can be constants a (> 0) and −a, respectively. But by dividing both sides of the inequalities by a, (2.3) is obtained. Equation (2.3) is equivalent to yi (w xi + b) ≥ 1 for i = 1,...,M. (2.4) The hyperplane D(x) = w x + b = c for − 1 <c< 1 (2.5) forms a separating hyperplane that separates xi (i = 1,...,M). When c = 0, the separating hyperplane is in the middle of the two hyperplanes with c = 1 and −1. The distance between the separating hyperplane and the training data sample nearest to the hyperplane is called the margin. Assuming that the hyperplanes D(x)=1 and −1 include at least one training data sample, the hyperplane D(x)=0 has the maximum margin for −1 <c< 1. The region {x | − 1 ≤ D(x) ≤ 1} is the generalization region for the decision function. Figure 2.1 shows two decision functions that satisfy (2.4). Thus there are an infinite number of decision functions that satisfy (2.4), which are separating hyperplanes. The generalization ability depends on the location of the separating hyperplane, and the hyperplane with the maximum margin is called the optimal separating hyperplane (see Fig. 2.1). Assume that no outliers are included in the training data and that unknown test data will obey the same distribution as that of the training data. Then it is intuitively clear that the generalization ability is maximized if the optimal separating hyperplane is selected as the separating hyperplane. Now consider determining the optimal separating hyperplane. The Euclidean distance from a training data sample x to the separating hyperplane is given by |D(x)|/w. This can be shown as follows. Because the vector w is orthogonal to the separating hyperplane, the line that goes through x and that is orthogonal to the hyperplane is given by aw/w + x, where |a| is the Euclidean distance from x to the hyperplane. It crosses the hyperplane
2.1 Hard-Margin Support Vector Machines Optimal hyperplane Maximum nargin X1 Fig. 2.1 Optimal separating hyperplane in a two-dimensional space Daw/wy‖+x)=0 is satisfied. Solving(2.6)for a, we obtain a=-D(x)/wll Then all the training data must satisfy yk D(xk >o for k=1 e (w,6)is a solution,(aw, ab)is also a solution, where a is a scala is the margin. Now Thus we impose the following constraint 6|w‖=1. From(2.7) and(2.8), to find the optimal separating hyperplane, we need to find w with the minimum Euclidean norm that satisfies(2.4) Therefore, the optimal separating hyperplane can be obtained by solving the following minimization problem for w and b: Q(w,b)=|wl‖i subject to yi(w;+b)>1 for i=1,., M.(2.10) Here, the square of the Euclidean norm w in(2.9) is to make the optimiza- tion problem quadratic programming. The assumption of linear separability
2.1 Hard-Margin Support Vector Machines 23 Optimal hyperplane Maximum margin x1 x2 0 Fig. 2.1 Optimal separating hyperplane in a two-dimensional space at the point where D(aw/w + x)=0 (2.6) is satisfied. Solving (2.6) for a, we obtain a = −D(x)/w. Then all the training data must satisfy ykD(xk) w ≥ δ for k = 1,...,M, (2.7) where δ is the margin. Now if (w, b) is a solution, (aw, ab) is also a solution, where a is a scalar. Thus we impose the following constraint: δ w = 1. (2.8) From (2.7) and (2.8), to find the optimal separating hyperplane, we need to find w with the minimum Euclidean norm that satisfies (2.4). Therefore, the optimal separating hyperplane can be obtained by solving the following minimization problem for w and b: minimize Q(w, b) = 1 2 w2 (2.9) subject to yi (w xi + b) ≥ 1 for i = 1,...,M. (2.10) Here, the square of the Euclidean norm w in (2.9) is to make the optimization problem quadratic programming. The assumption of linear separability