1. 4 Classifier Evaluation The orange juice data are to estimate the level of saccharose of orange juice from its observed near-infrared spectra 45 The abalone data set predicts the age of abalone from physical measure ments 36 Boston 5 and Boston 14 data sets 46, 47 use the 5th and 14th input variables of the Boston data set as outputs, respectively. The fifth variable is NOX (nitric oxide) concentrations and the 14th variable is the house price in the Boston area. For these data sets training data are only provided Techniques for analyzing biological response to chemical structures are called quantitative structure-activity relationships(QSARs). Pyrimidines Triazines 36, and Phenetylamines 48 data sets are well-known QS data sets. For these data sets, only the training data are given. Table 1. 4 Benchmark data specification for function approximation Data nputs Training data Test data Water purification(stationary) Water purification(nonstationary) 10 Orange juice Abalone Boston 14 Triazines 186 Phenethylamines 628 1. 4 Classifier Evaluation In developing a classifier for a given problem we repeat determining input variables, namely features, gathering input-output pairs according to the determined features, training the classifier, and evaluating classifier perfo mance. In training the classifier, special care must be taken so that no infor- mation on the test data set is used for training the classifier. 2 Assume that a classifier for an n class problem is tested using M data samples. To evaluate the classifier for a test data set, we generate an nx n confusion matriz A, whose element aii is the number of class i data classified into class j. Then the recognition rate R or recognition accuracy in calculated by 2 It is my regret that I could not reevaluate the computer experiments, included in the book. that violate this rule
1.4 Classifier Evaluation 13 The orange juice data are to estimate the level of saccharose of orange juice from its observed near-infrared spectra [45]. The abalone data set predicts the age of abalone from physical measurements [36]. Boston 5 and Boston 14 data sets [46, 47] use the 5th and 14th input variables of the Boston data set as outputs, respectively. The fifth variable is NOX (nitric oxide) concentrations and the 14th variable is the house price in the Boston area. For these data sets training data are only provided. Techniques for analyzing biological response to chemical structures are called quantitative structure–activity relationships (QSARs). Pyrimidines [36], Triazines [36], and Phenetylamines [48] data sets are well-known QSAR data sets. For these data sets, only the training data are given. Table 1.4 Benchmark data specification for function approximation Data Inputs Training data Test data Mackey–Glass 4 500 500 Water purification (stationary) 10 241 237 Water purification (nonstationary) 10 45 40 Orange juice 700 150 68 Abalone 8 4,177 — Boston 5 13 506 — Boston 14 13 506 — Pyrimidines 27 74 — Triazines 60 186 — Phenetylamines 628 22 — 1.4 Classifier Evaluation In developing a classifier for a given problem we repeat determining input variables, namely features, gathering input–output pairs according to the determined features, training the classifier, and evaluating classifier performance. In training the classifier, special care must be taken so that no information on the test data set is used for training the classifier.2 Assume that a classifier for an n class problem is tested using M data samples. To evaluate the classifier for a test data set, we generate an n × n confusion matrix A, whose element aij is the number of class i data classified into class j. Then the recognition rate R or recognition accuracy in % is calculated by 2 It is my regret that I could not reevaluate the computer experiments, included in the book, that violate this rule
1 Introduction R= ×100%) (1.23) Or conversely the error rate E is defined by 100(%) (1.24) where R+e= 100%, assuming that there are no unclassified data. The recognition rate(error rate) gives the overall performance of a classifier and is used to compare classifiers. To improve reliability in comparing classifiers, we prepare several training data sets and their associated test data sets and check whether there is a statistical difference in the mean recognition rates and their standard deviations of the classifiers There may be cases where there are several classification problems with a single training data set and a single test data set each. In such a situation it is not a good way of simply comparing the average recognition rates of the classifiers. This is because the difference of 1% for a difficult problem equally treated with that for an easy problem. For the discussions on how to statistically compare classifiers in such a situation, see 49, 50 In diagnosis problems with negative(normal) and positive(abnormal classes, data samples for the negative class are easily obtained but those for the positive class are difficult to obtain. In such problems with imbalanced training data, misclassification of positive data into the negative class is fatal compared to misclassification of negative data into the positive class. The confusion matrix for this problem is as shown in Table 1.5, where TP(true positive) is the number of correctly classified positive data, FN (false nega tive)is the number of misclassified positive data, FP(false positive)is the number of misclassified negative data, and TN (true negative) is the number f correctly classified negative dat Table 1.5 Confusion matrix Assigned positive Assigned negative Actual negative The well-used measures to evaluate classifier performance for diagno- is problems are precision-recall and ROC (receiver operator characteristic) curves. Precision is defined by
14 1 Introduction R = n i=1 aii n i,j=1 aij × 100 (%). (1.23) Or conversely the error rate E is defined by E = n i=j,i,j=1 aij n i,j=1 aij × 100 (%), (1.24) where R + E = 100%, assuming that there are no unclassified data. The recognition rate (error rate) gives the overall performance of a classifier and is used to compare classifiers. To improve reliability in comparing classifiers, we prepare several training data sets and their associated test data sets and check whether there is a statistical difference in the mean recognition rates and their standard deviations of the classifiers. There may be cases where there are several classification problems with a single training data set and a single test data set each. In such a situation, it is not a good way of simply comparing the average recognition rates of the classifiers. This is because the difference of 1% for a difficult problem is equally treated with that for an easy problem. For the discussions on how to statistically compare classifiers in such a situation, see [49, 50]. In diagnosis problems with negative (normal) and positive (abnormal) classes, data samples for the negative class are easily obtained but those for the positive class are difficult to obtain. In such problems with imbalanced training data, misclassification of positive data into the negative class is fatal compared to misclassification of negative data into the positive class. The confusion matrix for this problem is as shown in Table 1.5, where TP (true positive) is the number of correctly classified positive data, FN (false negative) is the number of misclassified positive data, FP (false positive) is the number of misclassified negative data, and TN (true negative) is the number of correctly classified negative data. Table 1.5 Confusion matrix Assigned positive Assigned negative Actual positive TP FN Actual negative FP TN The well-used measures to evaluate classifier performance for diagnosis problems are precision-recall and ROC (receiver operator characteristic) curves. Precision is defined by
1. 4 Classifier Evaluation TP Precision TP+FP and recall is defined by TP (1.26) The precision-recall curve is plotted with precision on the y-axis and recall on the T-axis. a classifier with precision and recall values near l is preferable. The ROC curve is plotted with the true-positive rate defined by True-positive rate=- TP (1.27) on the y-axis and the false-positive rate defined by FP False-positive rate= 1.28 FP+TN on the x-axis. Recall is equivalent to the true-positive rate. The precision- recall and the ROC curves are plotted changing some parameter value of the classifier. The precision-recall curve is better than the ROC curve for heavily unbalanced data. The relations between the two types of curves are discussed in|51] o see the difference of the measures, we used the thyroid data set shown in Table 1. 3. We generated a two-class data set deleting data samples belonging to Class 2. We trained an support vector machine with rbF kernels with r=l for different values of margin parameter C. Tables 1.6 and 1.7 show he confusion matrices for C= 100 and 2,000, respectively. And Table 1.8 lists the recognition rate, precision, recall, and the false-positive rate in for he test data set. Although the recognition rate for C= 100 is higher, the recall value for C= 2, 000 is smaller. But the reverse is true for the precision values, which is less fatal. Therefore, for diagnosis problems, it is better to elect classifier with a higher recall value. Because the number of samples for the negative class is extremely large, comparison of the false-positive rate becomes meaningless Table 1. 6 Confusion matrix for C= 100 Assigned negative Actual positive Actual negative 3.166 The often-used performance evaluation measures for regression problems are the mean average error(MAE), the root-mean-square error(RMSe) and the normalized root-mean-square error (NRMSE). Let the input-output pairs be(xi, yi)(i=l,., M) and the regression function be f(x). Then the MAE, RMSE, and NRMSE are given, respectively, by
1.4 Classifier Evaluation 15 Precision = TP TP+FP. (1.25) and recall is defined by Recall = TP TP+FN. (1.26) The precision-recall curve is plotted with precision on the y-axis and recall on the x-axis. A classifier with precision and recall values near 1 is preferable. The ROC curve is plotted with the true-positive rate defined by True-positive rate = TP TP+FN. (1.27) on the y-axis and the false-positive rate defined by False-positive rate = FP FP+TN (1.28) on the x-axis. Recall is equivalent to the true-positive rate. The precisionrecall and the ROC curves are plotted changing some parameter value of the classifier. The precision-recall curve is better than the ROC curve for heavily unbalanced data. The relations between the two types of curves are discussed in [51]. To see the difference of the measures, we used the thyroid data set shown in Table 1.3. We generated a two-class data set deleting data samples belonging to Class 2. We trained an support vector machine with RBF kernels with γ = 1 for different values of margin parameter C. Tables 1.6 and 1.7 show the confusion matrices for C = 100 and 2,000, respectively. And Table 1.8 lists the recognition rate, precision, recall, and the false-positive rate in % for the test data set. Although the recognition rate for C = 100 is higher, the recall value for C = 2, 000 is smaller. But the reverse is true for the precision values, which is less fatal. Therefore, for diagnosis problems, it is better to select classifier with a higher recall value. Because the number of samples for the negative class is extremely large, comparison of the false-positive rate becomes meaningless. Table 1.6 Confusion matrix for C = 100 Assigned positive Assigned negative Actual positive 56 17 Actual negative 12 3,166 The often-used performance evaluation measures for regression problems are the mean average error (MAE), the root-mean-square error (RMSE), and the normalized root-mean-square error (NRMSE). Let the input–output pairs be (xi, yi) (i = 1,...,M) and the regression function be f(x). Then the MAE, RMSE, and NRMSE are given, respectively, by
1 Introduction Table 1. 7 Confusion matrix for C=2 000 Actual positive 3,158 Table 1.8 Classification performance of the thyroid data set(in % C R. rate Precision Recal False-positive rate 99.11 82.35 0.38 MAE- I li-f(xi)l, (1.29) RMSE= M∑(m-f(x) NRMSE=aM∑0-f(x)2 where o is the standard deviation of the observed data samples References 1. K. Fukunaga. Introduction to Statistical Pattern Recognition, Second Edition. Aca- demic Press, San Diego, 1990 2. C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Pres 3. S. Abe. Neural Networks and Fuzzy Systems: Theory and Applications. Kluwer Aca- demic Publishers, Norwell, MA, 1997 4. S. Haykin. Neural Networks: A Comprehensive Foundation, Second Edition. Prentice Hall, Upper Saddle River, NJ, 1999 5. J C. Bezdek, J. Keller R. Krisnapuram, and N. R. Pal. Fuzzy Models and algorithms for Pattern Recognition and Image Processing. Kluwer Academic Publishers, Norwell, IA.1999 6. S.K. Pal and S Mitra. Neuro-Fuzzy Pattern Recogmition: Methods in Soft Computing. John Wiley &z Sons, New York, 1999 7. S. Abe. Pattern Classification: Neuro-Fuzzy Methods and Their Compariso Springer-Verlag, London, 2001 8. V.N. Vapnik. Statistical Leaming Theory. John Wiley Sons, New York, 1998 9. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge, 10. V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York
16 1 Introduction Table 1.7 Confusion matrix for C = 2, 000 Assigned positive Assigned negative Actual positive 63 10 Actual negative 20 3,158 Table 1.8 Classification performance of the thyroid data set (in %) C R. rate Precision Recall False-positive rate 100 99.11 82.35 76.71 0.38 2,000 99.08 75.90 86.30 0.63 MAE = 1 M M i |yi − f(xi)|, (1.29) RMSE = 1 M M i (yi − f(xi))2, (1.30) NRMSE = 1 σ 1 M M i (yi − f(xi))2, (1.31) where σ is the standard deviation of the observed data samples. References 1. K. Fukunaga. Introduction to Statistical Pattern Recognition, Second Edition. Academic Press, San Diego, 1990. 2. C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Oxford, 1995. 3. S. Abe. Neural Networks and Fuzzy Systems: Theory and Applications. Kluwer Academic Publishers, Norwell, MA, 1997. 4. S. Haykin. Neural Networks: A Comprehensive Foundation, Second Edition. Prentice Hall, Upper Saddle River, NJ, 1999. 5. J. C. Bezdek, J. Keller R. Krisnapuram, and N. R. Pal. Fuzzy Models and Algorithms for Pattern Recognition and Image Processing. Kluwer Academic Publishers, Norwell, MA, 1999. 6. S. K. Pal and S. Mitra. Neuro-Fuzzy Pattern Recognition: Methods in Soft Computing. John Wiley & Sons, New York, 1999. 7. S. Abe. Pattern Classification: Neuro-Fuzzy Methods and Their Comparison. Springer-Verlag, London, 2001. 8. V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, New York, 1998. 9. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge, 2000. 10. V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.
11. U. H-G. Krefel. Pairwise classification and support vector C.J. C. Burges, and A J Smola, editors, Advances in Kernel Methods: Support vector 9, pages 255-268. MIT Cambridge, MA, 1999 12. T. G. Dietterich and G. Bakiri. Solving multiclass correcting output codes. Joumal of Artificial Intelligence Research, 2: 263-286 13. R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Sons New York. 1973 14. J. Weston and C. Watkins. Multi-class support vector machines. Technical Report CSD-TR-98-04, Royal Holloway, University of London, London, UK, 1998 15. J. Weston and C. Watkins. Support vector machines for multi-class pattern recogni- tion In Proceedings of the Seventh European Symposium on Artificial Neural Networks ESANN 1999), pages 219-224, Bruges, Belgium, 1999 16. K. P. Bennett. Combining support vector and mathematical programming methods for classification. In B Scholkopf, C J C. Burges, and A.J. Smola, editors, Advances in Kernel Methods: Support Vector Learning, pages 307-326. MIT Press, Cambridge, MA.1999 17. E.J. Bredensteiner and K. P. Bennett. Multicategory classification by support vector machines. Computational Optimization and Applications, 12(1-3): 53-79, 1999 18. Y. Guermeur, A. Elisseeff, and H. Paugam-Moisy. A new multi-class SVM based on a uniform convergence result. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks(IJCNN 2000), volume 4, pages 183-188, Como, 19. C. Angulo, X. Parra, and A. Catala. An [ sic unified framework for all data at once' 20. D. Anguita, S. Idella, and D. Sterpi. A new method for multiclass support vectra,e multi-class support vector machines. In Proceedings of the Tenth European Symposiu on Artificial Neural Networks(ESANN 2002), pages 161-166, Bruges, Belgium chines. In Proceedings of International Joint Conference on Neural Networks (ICNM 004), volume 1, pages 407-412, Budapest, Hungary, 2004 21. K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B. Scholkopf. An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2): 181- 22. G. Ratsch, T Onoda, and K.-R. Miller Soft margins for AdaBoost Machine Learning, 42(3):287-320,2001 23. Intelligent Data Analysis Grou ttp: //ida. first fraunhofer. de/projects/bench/ benchmarks. htm 24.N.Pochet,F.DeSmet,J.A.K.Suykens,andB.L.r.DeMoor.http://homes esat kuleuven. be/npochet/bioinformatics/ 25. I. Hedenfalk, D. Duggan, Y. Chen, M. Radmacher, M. Bittner, R. Simon, P. Meltzer B. Gusterson, M. Esteller, M. Raffeld, Z. Yakhini, A. Ben-Dor, E. Dougherty J. Kononen, L. Bubendorf, w. Fehrle, S. Pittaluga, S. Gruvberger, N. Loman, O. Jo- hansson, H. Olsson, B. Wilfond, G. Sauter, O-P. Kallioniemi, A. Borg, and J. Trent. Gene-expression profiles in hereditary breast cancer. The New England Journal of Medicine,344(8):539-548,2001 26. L. J. van't Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. M. Hart, M. Mao, H. L. Peterse, K. van der Kooy, M.J. Marton, A. T. Witteveen, G.J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley, R. Bernards, and S. H. Friend. Gene expression rofiling predicts clinical outcome of breast ca Nature.415:530-536.2002. 27. U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine Broad patterns of gene expression revealed by clustering analysis of tumor and normal olon tissues probed by oligonucleotide arrays. Proceedings of the National Academy the United States of America, 96(12): 6745-6750, 1999 28. N. lizuka, M. Oka, H. Yamada-Okabe, M. Nishida, Y. Maeda, N. Mori, T. Takao, T. Tamesa, A. Tangoku, H. Tabuchi, K da, H. Nakayama, H. Ishitsuka, T. Miyamoto, A. Hirabayashi, S. Uchimura, Hamamoto. Oligonucleotide
References 17 11. U. H.-G. Kreßel. Pairwise classification and support vector machines. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods: Support Vector Learning, pages 255–268. MIT Press, Cambridge, MA, 1999. 12. T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via errorcorrecting output codes. Journal of Artificial Intelligence Research, 2:263–286, 1995. 13. R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, New York, 1973. 14. J. Weston and C. Watkins. Multi-class support vector machines. Technical Report CSD-TR-98-04, Royal Holloway, University of London, London, UK, 1998. 15. J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In Proceedings of the Seventh European Symposium on Artificial Neural Networks (ESANN 1999), pages 219–224, Bruges, Belgium, 1999. 16. K. P. Bennett. Combining support vector and mathematical programming methods for classification. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods: Support Vector Learning, pages 307–326. MIT Press, Cambridge, MA, 1999. 17. E. J. Bredensteiner and K. P. Bennett. Multicategory classification by support vector machines. Computational Optimization and Applications, 12(1–3):53–79, 1999. 18. Y. Guermeur, A. Elisseeff, and H. Paugam-Moisy. A new multi-class SVM based on a uniform convergence result. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN 2000), volume 4, pages 183–188, Como, Italy, 2000. 19. C. Angulo, X. Parra, and A. Català. An [sic] unified framework for ‘all data at once’ multi-class support vector machines. In Proceedings of the Tenth European Symposium on Artificial Neural Networks (ESANN 2002), pages 161–166, Bruges, Belgium, 2002. 20. D. Anguita, S. Ridella, and D. Sterpi. A new method for multiclass support vector machines. In Proceedings of International Joint Conference on Neural Networks (IJCNN 2004), volume 1, pages 407–412, Budapest, Hungary, 2004. 21. K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf. An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2):181– 201, 2001. 22. G. Rätsch, T. Onoda, and K.-R. Müller. Soft margins for AdaBoost. Machine Learning, 42(3):287–320, 2001. 23. Intelligent Data Analysis Group. http://ida.first.fraunhofer.de/projects/bench/ benchmarks.htm. 24. N. Pochet, F. De Smet, J. A. K. Suykens, and B. L. R. De Moor. http://homes. esat.kuleuven.be/npochet/bioinformatics/. 25. I. Hedenfalk, D. Duggan, Y. Chen, M. Radmacher, M. Bittner, R. Simon, P. Meltzer, B. Gusterson, M. Esteller, M. Raffeld, Z. Yakhini, A. Ben-Dor, E. Dougherty, J. Kononen, L. Bubendorf, W. Fehrle, S. Pittaluga, S. Gruvberger, N. Loman, O. Johannsson, H. Olsson, B. Wilfond, G. Sauter, O.-P. Kallioniemi, A. Borg, and J. Trent. Gene-expression profiles in hereditary breast cancer. The New England Journal of Medicine, 344(8):539–548, 2001. 26. L. J. van’t Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. M. Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley, R. Bernards, and S. H. Friend. Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415:530–536, 2002. 27. U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America, 96(12):6745–6750, 1999. 28. N. Iizuka, M. Oka, H. Yamada-Okabe, M. Nishida, Y. Maeda, N. Mori, T. Takao, T. Tamesa, A. Tangoku, H. Tabuchi, K. Hamada, H. Nakayama, H. Ishitsuka, T. Miyamoto, A. Hirabayashi, S. Uchimura, and Y. Hamamoto. Oligonucleotide