1 Introduction An example of class boundaries is shown in Fig. 1.6. Unlike one-against and pairwise formulations, there is no unclassifiable region Class 1 91(x)>92(×) lass 2 91(x)>g3x) g2(x)>g2(x) Class 3 Fig. 1.6 Class boundaries by all-at-once formulation 1.2 Determination of Decision Functions Determination of decision functions using input-output pairs is called train- ing In training a multilayer neural network for a two-class problem, we can determine a direct decision function if we set one output neuron instead of two. But because for an n-class problem we set n output neurons with the th neuron corresponding to the class i decision function, the obtained func- tions are indirect. Similarly, decision functions for fuzzy classifiers are indirect because membership functions are defined for each class Conventional training methods determine the indirect decision function hat each training input is correctly classified into the class designated by the associated training output. Figure 1. 7 shows an example of the decision func- tions obtained when the training data of two classes do not overlap. Assuming that the circles and squares are training data for Classes l and 2, respectively, even if the decision function g2(x)moves to the right as shown in the dotted curve, the training data are still correctly classified. Thus there are infinite possibilities of the positions of the decision functions that correctly classif
8 1 Introduction An example of class boundaries is shown in Fig. 1.6. Unlike one-against-all and pairwise formulations, there is no unclassifiable region. x1 x2 0 Class 2 g1 (x) > g2 (x) Class 3 g2 (x) > g3 (x) g1 (x) > g3 (x) Class 1 Fig. 1.6 Class boundaries by all-at-once formulation 1.2 Determination of Decision Functions Determination of decision functions using input–output pairs is called training. In training a multilayer neural network for a two-class problem, we can determine a direct decision function if we set one output neuron instead of two. But because for an n-class problem we set n output neurons with the ith neuron corresponding to the class i decision function, the obtained functions are indirect. Similarly, decision functions for fuzzy classifiers are indirect because membership functions are defined for each class. Conventional training methods determine the indirect decision functions so that each training input is correctly classified into the class designated by the associated training output. Figure 1.7 shows an example of the decision functions obtained when the training data of two classes do not overlap. Assuming that the circles and squares are training data for Classes 1 and 2, respectively, even if the decision function g2(x) moves to the right as shown in the dotted curve, the training data are still correctly classified. Thus there are infinite possibilities of the positions of the decision functions that correctly classify
1.3 Data sets used in the book the training data. Although the generalization ability is directly affected by the positions, conventional training methods do not consider this x …, Fig. 1.7 Class boundary when classes do not overlap In a support vector machine, the direct decision function that maximizes the generalization ability is determined for a two-class problem. Assuming that the training data of different classes do not overlap the decision function is determined so that the distance from the training data is maximized. We call this the optimal decision function. Because it is difficult to determine a nonlinear decision function, the original input space is mapped into a high- dimensional space called feature space. And in the feature space, the optimal decision function, namely, the optimal hyperplane is determined Support vector machines outperform conventional classifiers, especially when the number of training data is small and the number of input vari ables is large. This is because the conventional classifiers do not have the mechanism to maximize the margins of class boundaries. Therefore, if we troduce some mechanism to maximize margins, the generalization ability is 1. 3 Data sets Used in the book In this book we evaluate methods for pattern classification and function ap- proximation using some benchmark data sets so that and disad-
1.3 Data Sets Used in the Book 9 the training data. Although the generalization ability is directly affected by the positions, conventional training methods do not consider this. Class 1 x1 x2 0 Class 2 g1 (x) = 0 g2 (x) = 0 Fig. 1.7 Class boundary when classes do not overlap In a support vector machine, the direct decision function that maximizes the generalization ability is determined for a two-class problem. Assuming that the training data of different classes do not overlap, the decision function is determined so that the distance from the training data is maximized. We call this the optimal decision function. Because it is difficult to determine a nonlinear decision function, the original input space is mapped into a highdimensional space called feature space. And in the feature space, the optimal decision function, namely, the optimal hyperplane is determined. Support vector machines outperform conventional classifiers, especially when the number of training data is small and the number of input variables is large. This is because the conventional classifiers do not have the mechanism to maximize the margins of class boundaries. Therefore, if we introduce some mechanism to maximize margins, the generalization ability is improved. 1.3 Data Sets Used in the Book In this book we evaluate methods for pattern classification and function approximation using some benchmark data sets so that advantages and disad-
1 Introduction vantages of these methods are clarified. In the following we explain these data sets Table 1.1 lists the data sets for two-class classification problems 21-23 For each problem the table lists the numbers of inputs, training data, test data, and data sets. Each problem has 100 or 20 training data sets and their corresponding test data sets and is used to compare statistical differences among some classifiers Table 1.1 Benchmark data sets for two-class problems Data Inputs Training data Test data Sets Banana 2 4,900 Breast cancer g Diabetes Flare-solar 9 German Heart 00000000 Ignor 7.000 Thyroid Titan 2.051 Twonorm Waveform 4.600 Pattern classification technology has been applied to DNA microarray data, which provide expression levels of thousands of genes, to classify nerous/non-cancerous patients. Microarray data are characterized by a large number of input variables but a small number of training/test data Thus the classification problems are linearly separable and overfitting occurs quite easily. Therefore, usually, feature selection or extraction is performed to improve generalization ability. Table 1. 2 lists the data sets 24 used in this book. For each problem there is one training data set and one test data set Table 1. 3 shows the data sets for multiclass problems. Each problem has one training data set and the associated test data set The Fisher iris data 32, 33 are widely used for evaluating classification performance of classifiers. They consist of 150 data with four features and three classes: there are 50 data per class. We used the first 25 data of each class as the training data and the remaining 25 data of each class as the test data The numeral data 34 were collected to identify Japanese license plates of running cars. They include numerals, hiragana, and kanji characters. The original image taken from a TV camera was preprocessed and each numeral was transformed into 12 features, such as the number of holes and the cur- vature of a numeral at some point
10 1 Introduction vantages of these methods are clarified. In the following we explain these data sets. Table 1.1 lists the data sets for two-class classification problems [21–23] For each problem the table lists the numbers of inputs, training data, test data, and data sets. Each problem has 100 or 20 training data sets and their corresponding test data sets and is used to compare statistical differences among some classifiers. Table 1.1 Benchmark data sets for two-class problems Data Inputs Training data Test data Sets Banana 2 400 4,900 100 Breast cancer 9 200 77 100 Diabetes 8 468 300 100 Flare-solar 9 666 400 100 German 20 700 300 100 Heart 13 170 100 100 Image 18 1,300 1,010 20 Ringnorm 20 400 7,000 100 Splice 60 1,000 2,175 20 Thyroid 5 140 75 100 Titanic 3 150 2,051 100 Twonorm 20 400 7,000 100 Waveform 21 400 4,600 100 Pattern classification technology has been applied to DNA microarray data, which provide expression levels of thousands of genes, to classify cancerous/non-cancerous patients. Microarray data are characterized by a large number of input variables but a small number of training/test data. Thus the classification problems are linearly separable and overfitting occurs quite easily. Therefore, usually, feature selection or extraction is performed to improve generalization ability. Table 1.2 lists the data sets [24] used in this book. For each problem there is one training data set and one test data set. Table 1.3 shows the data sets for multiclass problems. Each problem has one training data set and the associated test data set. The Fisher iris data [32, 33] are widely used for evaluating classification performance of classifiers. They consist of 150 data with four features and three classes; there are 50 data per class. We used the first 25 data of each class as the training data and the remaining 25 data of each class as the test data. The numeral data [34] were collected to identify Japanese license plates of running cars. They include numerals, hiragana, and kanji characters. The original image taken from a TV camera was preprocessed and each numeral was transformed into 12 features, such as the number of holes and the curvature of a numeral at some point
1.3 Data sets used in the book Table 1.2 Benchmark data sets for microarray problems Inputs raining data Test data Classes Breast cancer(1)25] 14 Breast cancer(2)25 14 Breast cancer (3)[26 24,188 8818 Breast cancer(s)(25 Colon cancer 27 2.000 High-grade glioma/29/128)7, 129 12,625 40343 09 222222222 Leukemia 30 7.129 Prostate cancer 31 The thyroid data 35, 36 include 15 digital features and more than 92% of the data belong to one class. Thus the recognition rate lower than 92% is useless The blood cell classification 37 involves classifying optically screened white blood cells into 12 classes using 13 features. This is a very difficult problem; class boundaries for some classes are ambiguous because the classes are defined according to the growth stages of white blood cells Hiragana-50 and hiragana-105 data 38, 7 were gathered from Japanese li- cense plates. The original grayscale images of hiragana characters were trans- formed into(5 X 10)-pixel and(7 X 15)-pixel images, respectively, with the grayscale range being from 0 to 255. Then by performing grayscale shift, po- sition shift, and random noise addition to the images, the training and test data were generated. Then for the hiragana-105 data to reduce the number of nput variables, i.e., 7x 15=105, the hiragana-13 data 38, 7 were generated by calculating the 13 central moments for the(7 15)-pixel images 39, 38 L, Satimage data [36] have 36 inputs: 3 x 3 pixels each with four spectral lues in a satellite image and are to classify the center pixel into one of six classes: red soil, cotton crop, grey soil, damp grey soil, soil with vegetation stubble, and very damp grey soil USPS data 40 are handwritten numerals in(16 x 16)-pixel grayscale im ages. They are scanned from envelopes by the United States Postal Services The MNIST data 41, 42 are handwritten numerals consisting of(28x28)- pixel inputs with 256 grayscale levels; they are often used to compare perfor mance of support vector machines and other classifiers Table 1. 4 lists the data sets for function approximation used in the book. For all the problems in the table, the number of outputs is 1 The Mackey-Glass differential equation 43 generates time series data with a chaotic behavior and is given by dr(t)0.2x(t-7) 0.1x(t) (1.22)
1.3 Data Sets Used in the Book 11 Table 1.2 Benchmark data sets for microarray problems Data Inputs Training data Test data Classes Breast cancer (1) [25] 3,226 14 8 2 Breast cancer (2) [25] 3,226 14 8 2 Breast cancer (3) [26] 24,188 78 19 2 Breast cancer (s) [25] 3,226 14 8 2 Colon cancer [27] 2,000 40 20 2 Hepatocellular carcinoma [28] 7,129 33 27 2 High-grade glioma [29] 12,625 21 29 2 Leukemia [30] 7,129 38 34 2 Prostate cancer [31] 12,600 102 34 2 The thyroid data [35, 36] include 15 digital features and more than 92% of the data belong to one class. Thus the recognition rate lower than 92% is useless. The blood cell classification [37] involves classifying optically screened white blood cells into 12 classes using 13 features. This is a very difficult problem; class boundaries for some classes are ambiguous because the classes are defined according to the growth stages of white blood cells. Hiragana-50 and hiragana-105 data [38, 7] were gathered from Japanese license plates. The original grayscale images of hiragana characters were transformed into (5 × 10)-pixel and (7 × 15)-pixel images, respectively, with the grayscale range being from 0 to 255. Then by performing grayscale shift, position shift, and random noise addition to the images, the training and test data were generated. Then for the hiragana-105 data to reduce the number of input variables, i.e., 7×15 = 105, the hiragana-13 data [38, 7] were generated by calculating the 13 central moments for the (7 × 15)-pixel images [39, 38]. Satimage data [36] have 36 inputs: 3 × 3 pixels each with four spectral values in a satellite image and are to classify the center pixel into one of the six classes: red soil, cotton crop, grey soil, damp grey soil, soil with vegetation stubble, and very damp grey soil. USPS data [40] are handwritten numerals in (16 × 16)-pixel grayscale images. They are scanned from envelopes by the United States Postal Services. The MNIST data [41, 42] are handwritten numerals consisting of (28×28)- pixel inputs with 256 grayscale levels; they are often used to compare performance of support vector machines and other classifiers. Table 1.4 lists the data sets for function approximation used in the book. For all the problems in the table, the number of outputs is 1. The Mackey–Glass differential equation [43] generates time series data with a chaotic behavior and is given by dx(t) dt = 0.2 x(t − τ ) 1 + x10(t − τ ) − 0.1 x(t), (1.22)
1 Introduction Table 1. 3 Benchmark data specification for multiclass problems Classes Training data Test data Iris 75 Thyroid Blood cell 3,097 Hiragana-50 Hiragana-13 8.356 Satimage USPS 032983600 4,435 2,000 2,007 MNIST 784 60,000 0.000 where t and T denote time and time delay, respectively By integrating(1. 22), we can obtain the time series data z(0), r(1), a(2), a(t),.... Using z prior to time t, we predict r after time t. Setting T=17 and using four inputs a(t-18), r(t-12), 1(t-6), I(t), we estimate r(t+6) The first 500 data from the time series data, r(118),., I(1117), are used to train function approximators, and the remaining 500 data are used to test performance. This data set is often used as the benchmark data for function pproximation and the normalized root-mean-square error (NRMSe),i.e the root-mean-square error divided by the standard deviation of the time series data is used to measure the performance. In a water purification plant, to eliminate small particles floating in the water taken from a river, coagulant is added and the water is stirred while these small particles begin sticking to each other. As more particles stick together they form flocs, which fall to the bottom of a holding tank. Potable water is obtained by removing the precipitated flocs and adding chlorine Careful implementation of the coagulant injection is very important to obtain high-quality water. Usually an operator determines the amount of coagulant needed according to an analysis of the water qualities, observation of floc formation, and prior experience. To automate this operation, as inputs for water quality, (1) turbidity, (2)temperature, (3) alkalinity, (4)pH, and (5)flow rate were used, and to replace the operator's observation of floc properties by image processing, (1) doc diameter,(2)number of flocs, 3)floc volume, (4)floc density, and(5 illumination intensity were used 44] The 563 input-output data, which were gathered over a 1-year period were divided into 478 stationary data and 95 nonstationary data according o whether turbidity values were smaller or larger than a specified value. Then each type of data was further divided into two groups to form a training data set and a test data set; division was done in such a way that both sets had similar distributions in the output space
12 1 Introduction Table 1.3 Benchmark data specification for multiclass problems Data Inputs Classes Training data Test data Iris 4 3 75 75 Numeral 12 10 810 820 Thyroid 21 3 3,772 3,428 Blood cell 13 12 3,097 3,100 Hiragana-50 50 39 4,610 4,610 Hiragana-105 105 38 8,375 8,356 Hiragana-13 13 38 8,375 8,356 Satimage 36 6 4,435 2,000 USPS 256 10 7,291 2,007 MNIST 784 10 60,000 10,000 where t and τ denote time and time delay, respectively. By integrating (1.22), we can obtain the time series data x(0), x(1), x(2), ...,x(t),.... Using x prior to time t, we predict x after time t. Setting τ = 17 and using four inputs x(t − 18), x(t − 12), x(t − 6), x(t), we estimate x(t + 6). The first 500 data from the time series data, x(118),...,x(1117), are used to train function approximators, and the remaining 500 data are used to test performance. This data set is often used as the benchmark data for function approximation and the normalized root-mean-square error (NRMSE), i.e., the root-mean-square error divided by the standard deviation of the time series data is used to measure the performance. In a water purification plant, to eliminate small particles floating in the water taken from a river, coagulant is added and the water is stirred while these small particles begin sticking to each other. As more particles stick together they form flocs, which fall to the bottom of a holding tank. Potable water is obtained by removing the precipitated flocs and adding chlorine. Careful implementation of the coagulant injection is very important to obtain high-quality water. Usually an operator determines the amount of coagulant needed according to an analysis of the water qualities, observation of floc formation, and prior experience. To automate this operation, as inputs for water quality, (1) turbidity, (2) temperature, (3) alkalinity, (4) pH, and (5) flow rate were used, and to replace the operator’s observation of floc properties by image processing, (1) floc diameter, (2) number of flocs, (3) floc volume, (4) floc density, and (5) illumination intensity were used [44]. The 563 input–output data, which were gathered over a 1-year period, were divided into 478 stationary data and 95 nonstationary data according to whether turbidity values were smaller or larger than a specified value. Then each type of data was further divided into two groups to form a training data set and a test data set; division was done in such a way that both sets had similar distributions in the output space