当前位置：和泉文库 > 基础医学 > 浏览文档

《医学影像信息学概论》课程参考资源：Support Vector Machines for Pattern Recognition（Support Vector Machines for Pattern Classification, Second Edition）

文件格式：PDF，文件大小：5.9MB，售价：84.3元

文档详细内容（约482页）

1 Introduction vantages of these methods are clarified. In the following we explain these data sets Table 1.1 lists the data sets for two-class classification problems 21-23 For each problem the table lists the numbers of inputs, training data, test data, and data sets. Each problem has 100 or 20 training data sets and their corresponding test data sets and is used to compare statistical differences among some classifiers Table 1.1 Benchmark data sets for two-class problems Data Inputs Training data Test data Sets Banana 2 4,900 Breast cancer g Diabetes Flare-solar 9 German Heart 00000000 Ignor 7.000 Thyroid Titan 2.051 Twonorm Waveform 4.600 Pattern classification technology has been applied to DNA microarray data, which provide expression levels of thousands of genes, to classify nerous/non-cancerous patients. Microarray data are characterized by a large number of input variables but a small number of training/test data Thus the classification problems are linearly separable and overfitting occurs quite easily. Therefore, usually, feature selection or extraction is performed to improve generalization ability. Table 1. 2 lists the data sets 24 used in this book. For each problem there is one training data set and one test data set Table 1. 3 shows the data sets for multiclass problems. Each problem has one training data set and the associated test data set The Fisher iris data 32, 33 are widely used for evaluating classification performance of classifiers. They consist of 150 data with four features and three classes: there are 50 data per class. We used the first 25 data of each class as the training data and the remaining 25 data of each class as the test data The numeral data 34 were collected to identify Japanese license plates of running cars. They include numerals, hiragana, and kanji characters. The original image taken from a TV camera was preprocessed and each numeral was transformed into 12 features, such as the number of holes and the cur- vature of a numeral at some point

10 1 Introduction vantages of these methods are clarified. In the following we explain these data sets. Table 1.1 lists the data sets for two-class classification problems [21–23] For each problem the table lists the numbers of inputs, training data, test data, and data sets. Each problem has 100 or 20 training data sets and their corresponding test data sets and is used to compare statistical differences among some classifiers. Table 1.1 Benchmark data sets for two-class problems Data Inputs Training data Test data Sets Banana 2 400 4,900 100 Breast cancer 9 200 77 100 Diabetes 8 468 300 100 Flare-solar 9 666 400 100 German 20 700 300 100 Heart 13 170 100 100 Image 18 1,300 1,010 20 Ringnorm 20 400 7,000 100 Splice 60 1,000 2,175 20 Thyroid 5 140 75 100 Titanic 3 150 2,051 100 Twonorm 20 400 7,000 100 Waveform 21 400 4,600 100 Pattern classification technology has been applied to DNA microarray data, which provide expression levels of thousands of genes, to classify cancerous/non-cancerous patients. Microarray data are characterized by a large number of input variables but a small number of training/test data. Thus the classification problems are linearly separable and overfitting occurs quite easily. Therefore, usually, feature selection or extraction is performed to improve generalization ability. Table 1.2 lists the data sets [24] used in this book. For each problem there is one training data set and one test data set. Table 1.3 shows the data sets for multiclass problems. Each problem has one training data set and the associated test data set. The Fisher iris data [32, 33] are widely used for evaluating classification performance of classifiers. They consist of 150 data with four features and three classes; there are 50 data per class. We used the first 25 data of each class as the training data and the remaining 25 data of each class as the test data. The numeral data [34] were collected to identify Japanese license plates of running cars. They include numerals, hiragana, and kanji characters. The original image taken from a TV camera was preprocessed and each numeral was transformed into 12 features, such as the number of holes and the curvature of a numeral at some point

1.3 Data sets used in the book Table 1.2 Benchmark data sets for microarray problems Inputs raining data Test data Classes Breast cancer(1)25] 14 Breast cancer(2)25 14 Breast cancer (3)[26 24,188 8818 Breast cancer(s)(25 Colon cancer 27 2.000 High-grade glioma/29/128)7, 129 12,625 40343 09 222222222 Leukemia 30 7.129 Prostate cancer 31 The thyroid data 35, 36 include 15 digital features and more than 92% of the data belong to one class. Thus the recognition rate lower than 92% is useless The blood cell classification 37 involves classifying optically screened white blood cells into 12 classes using 13 features. This is a very difficult problem; class boundaries for some classes are ambiguous because the classes are defined according to the growth stages of white blood cells Hiragana-50 and hiragana-105 data 38, 7 were gathered from Japanese li- cense plates. The original grayscale images of hiragana characters were trans- formed into(5 X 10)-pixel and(7 X 15)-pixel images, respectively, with the grayscale range being from 0 to 255. Then by performing grayscale shift, po- sition shift, and random noise addition to the images, the training and test data were generated. Then for the hiragana-105 data to reduce the number of nput variables, i.e., 7x 15=105, the hiragana-13 data 38, 7 were generated by calculating the 13 central moments for the(7 15)-pixel images 39, 38 L, Satimage data [36] have 36 inputs: 3 x 3 pixels each with four spectral lues in a satellite image and are to classify the center pixel into one of six classes: red soil, cotton crop, grey soil, damp grey soil, soil with vegetation stubble, and very damp grey soil USPS data 40 are handwritten numerals in(16 x 16)-pixel grayscale im ages. They are scanned from envelopes by the United States Postal Services The MNIST data 41, 42 are handwritten numerals consisting of(28x28)- pixel inputs with 256 grayscale levels; they are often used to compare perfor mance of support vector machines and other classifiers Table 1. 4 lists the data sets for function approximation used in the book. For all the problems in the table, the number of outputs is 1 The Mackey-Glass differential equation 43 generates time series data with a chaotic behavior and is given by dr(t)0.2x(t-7) 0.1x(t) (1.22)

1.3 Data Sets Used in the Book 11 Table 1.2 Benchmark data sets for microarray problems Data Inputs Training data Test data Classes Breast cancer (1) [25] 3,226 14 8 2 Breast cancer (2) [25] 3,226 14 8 2 Breast cancer (3) [26] 24,188 78 19 2 Breast cancer (s) [25] 3,226 14 8 2 Colon cancer [27] 2,000 40 20 2 Hepatocellular carcinoma [28] 7,129 33 27 2 High-grade glioma [29] 12,625 21 29 2 Leukemia [30] 7,129 38 34 2 Prostate cancer [31] 12,600 102 34 2 The thyroid data [35, 36] include 15 digital features and more than 92% of the data belong to one class. Thus the recognition rate lower than 92% is useless. The blood cell classification [37] involves classifying optically screened white blood cells into 12 classes using 13 features. This is a very difficult problem; class boundaries for some classes are ambiguous because the classes are defined according to the growth stages of white blood cells. Hiragana-50 and hiragana-105 data [38, 7] were gathered from Japanese license plates. The original grayscale images of hiragana characters were transformed into (5 × 10)-pixel and (7 × 15)-pixel images, respectively, with the grayscale range being from 0 to 255. Then by performing grayscale shift, position shift, and random noise addition to the images, the training and test data were generated. Then for the hiragana-105 data to reduce the number of input variables, i.e., 7×15 = 105, the hiragana-13 data [38, 7] were generated by calculating the 13 central moments for the (7 × 15)-pixel images [39, 38]. Satimage data [36] have 36 inputs: 3 × 3 pixels each with four spectral values in a satellite image and are to classify the center pixel into one of the six classes: red soil, cotton crop, grey soil, damp grey soil, soil with vegetation stubble, and very damp grey soil. USPS data [40] are handwritten numerals in (16 × 16)-pixel grayscale images. They are scanned from envelopes by the United States Postal Services. The MNIST data [41, 42] are handwritten numerals consisting of (28×28)- pixel inputs with 256 grayscale levels; they are often used to compare performance of support vector machines and other classifiers. Table 1.4 lists the data sets for function approximation used in the book. For all the problems in the table, the number of outputs is 1. The Mackey–Glass differential equation [43] generates time series data with a chaotic behavior and is given by dx(t) dt = 0.2 x(t − τ ) 1 + x10(t − τ ) − 0.1 x(t), (1.22)

1 Introduction Table 1. 3 Benchmark data specification for multiclass problems Classes Training data Test data Iris 75 Thyroid Blood cell 3,097 Hiragana-50 Hiragana-13 8.356 Satimage USPS 032983600 4,435 2,000 2,007 MNIST 784 60,000 0.000 where t and T denote time and time delay, respectively By integrating(1. 22), we can obtain the time series data z(0), r(1), a(2), a(t),.... Using z prior to time t, we predict r after time t. Setting T=17 and using four inputs a(t-18), r(t-12), 1(t-6), I(t), we estimate r(t+6) The first 500 data from the time series data, r(118),., I(1117), are used to train function approximators, and the remaining 500 data are used to test performance. This data set is often used as the benchmark data for function pproximation and the normalized root-mean-square error (NRMSe),i.e the root-mean-square error divided by the standard deviation of the time series data is used to measure the performance. In a water purification plant, to eliminate small particles floating in the water taken from a river, coagulant is added and the water is stirred while these small particles begin sticking to each other. As more particles stick together they form flocs, which fall to the bottom of a holding tank. Potable water is obtained by removing the precipitated flocs and adding chlorine Careful implementation of the coagulant injection is very important to obtain high-quality water. Usually an operator determines the amount of coagulant needed according to an analysis of the water qualities, observation of floc formation, and prior experience. To automate this operation, as inputs for water quality, (1) turbidity, (2)temperature, (3) alkalinity, (4)pH, and (5)flow rate were used, and to replace the operator's observation of floc properties by image processing, (1) doc diameter,(2)number of flocs, 3)floc volume, (4)floc density, and(5 illumination intensity were used 44] The 563 input-output data, which were gathered over a 1-year period were divided into 478 stationary data and 95 nonstationary data according o whether turbidity values were smaller or larger than a specified value. Then each type of data was further divided into two groups to form a training data set and a test data set; division was done in such a way that both sets had similar distributions in the output space

12 1 Introduction Table 1.3 Benchmark data specification for multiclass problems Data Inputs Classes Training data Test data Iris 4 3 75 75 Numeral 12 10 810 820 Thyroid 21 3 3,772 3,428 Blood cell 13 12 3,097 3,100 Hiragana-50 50 39 4,610 4,610 Hiragana-105 105 38 8,375 8,356 Hiragana-13 13 38 8,375 8,356 Satimage 36 6 4,435 2,000 USPS 256 10 7,291 2,007 MNIST 784 10 60,000 10,000 where t and τ denote time and time delay, respectively. By integrating (1.22), we can obtain the time series data x(0), x(1), x(2), ...,x(t),.... Using x prior to time t, we predict x after time t. Setting τ = 17 and using four inputs x(t − 18), x(t − 12), x(t − 6), x(t), we estimate x(t + 6). The first 500 data from the time series data, x(118),...,x(1117), are used to train function approximators, and the remaining 500 data are used to test performance. This data set is often used as the benchmark data for function approximation and the normalized root-mean-square error (NRMSE), i.e., the root-mean-square error divided by the standard deviation of the time series data is used to measure the performance. In a water purification plant, to eliminate small particles floating in the water taken from a river, coagulant is added and the water is stirred while these small particles begin sticking to each other. As more particles stick together they form flocs, which fall to the bottom of a holding tank. Potable water is obtained by removing the precipitated flocs and adding chlorine. Careful implementation of the coagulant injection is very important to obtain high-quality water. Usually an operator determines the amount of coagulant needed according to an analysis of the water qualities, observation of floc formation, and prior experience. To automate this operation, as inputs for water quality, (1) turbidity, (2) temperature, (3) alkalinity, (4) pH, and (5) flow rate were used, and to replace the operator’s observation of floc properties by image processing, (1) floc diameter, (2) number of flocs, (3) floc volume, (4) floc density, and (5) illumination intensity were used [44]. The 563 input–output data, which were gathered over a 1-year period, were divided into 478 stationary data and 95 nonstationary data according to whether turbidity values were smaller or larger than a specified value. Then each type of data was further divided into two groups to form a training data set and a test data set; division was done in such a way that both sets had similar distributions in the output space

点击进入文档下载页（PDF格式）

共482页，可试读40页，点击继续阅读 ↓↓

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录