Contents 1 Introduction L1 Overview 1.2 Classificat 1.3 Organization of the Book 1.4 Exercises References 667 2 Classification 9 2.1 The Classification Process 2.2 Features 2.3 Training and Learning 16 2.4 Supervised Learning and Algorithm Selection 2.5 Approaches to 2.6 Examples 2 6.2 Classification by Size 2.6.3 More Examples 2. 6.4 Classification of Letters 2.7 Exercises 3 Nonmetric Method 2 3.2 Decision Tree Classifier 3.2. 1 Information, Entropy, and Impurity 3.2.2 Information Gain 3.2.3 Decision Tree Issues 3.3 Rule-Based Classifier 3.4 Other Methods 39 3.5 Exercises 40 References
Contents 1 Introduction .......................................... 1 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Classification . ..................................... 3 1.3 Organization of the Book . . ........................... 6 1.4 Exercises ......................................... 6 References ............................................ 7 2 Classification .......................................... 9 2.1 The Classification Process . ............................ 9 2.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Training and Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Supervised Learning and Algorithm Selection . . . . . . . . . . . . . . 17 2.5 Approaches to Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6.1 Classification by Shape . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6.2 Classification by Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6.3 More Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.6.4 Classification of Letters . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3 Nonmetric Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Decision Tree Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.1 Information, Entropy, and Impurity . . . . . . . . . . . . . . . . 29 3.2.2 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.3 Decision Tree Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.4 Strengths and Weaknesses . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Rule-Based Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 ix
4 Statistical Pattern Recognition 43 4.1 Measured data and measurement errors 4.2 Probability Theory 4.2.2 Conditional Probability and Bayes'Rule 4.2.3 Naive Bayes Classifier 4.3 Continuous Random variables 4.3.1 The Multivariate Gaussian 4.3.2 The Covariance matrix 4.3.3 The Mahalanobis Distance 363479θ2 4.4 Exercis References 5 Supervised Learning 5.1 Parametric and Non-parametric Learnin 5.2 Parametric Learning 75 5.2.1 Bayesian Decision Theory 5.2.2 Discriminant Functions and Decision Boundaries 5.2.3 MAP(Maximum A Posteriori) Estimator 5.3 Exercises References 6 Nonparametric Learning 6.1 Histogram Estimator and Parzen Windows 6.2 k-Nearest Neighbor (k-NN) Classification 6.3 Artificial Neural Networks 104 6.4 Kernel machines 6.5 Exercises 7 Feature Extraction and Selection 7.1 Reducing Dimensionality 7.1.1 Preprocessing 7. 2 Feature Selection 124 7. 2. 1 Inter/Intraclass Distance 7. 2.2 Subset Selection 7.3 Feature Extraction 7.3. 1 Principal Component Analysis 7.3.2 Linear Discriminant Anal 140 References 141
4 Statistical Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1 Measured Data and Measurement Errors . . . . . . . . . . . . . . . . . . 43 4.2 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2.1 Simple Probability Theory . . . . . . . . . . . . . . . . . . . . . . . 43 4.2.2 Conditional Probability and Bayes’ Rule . . . . . . . . . . . . . 46 4.2.3 Naı¨ve Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3.1 The Multivariate Gaussian . . . . . . . . . . . . . . . . . . . . . . . 57 4.3.2 The Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3.3 The Mahalanobis Distance . . . . . . . . . . . . . . . . . . . . . . . 69 4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.1 Parametric and Non-parametric Learning . . . . . . . . . . . . . . . . . . 75 5.2 Parametric Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2.1 Bayesian Decision Theory . . . . . . . . . . . . . . . . . . . . . . . 75 5.2.2 Discriminant Functions and Decision Boundaries . . . . . . 87 5.2.3 MAP (Maximum A Posteriori) Estimator . . . . . . . . . . . . 94 5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6 Nonparametric Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.1 Histogram Estimator and Parzen Windows . . . . . . . . . . . . . . . . . 99 6.2 k-Nearest Neighbor (k-NN) Classification . . . . . . . . . . . . . . . . . 100 6.3 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.4 Kernel Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7 Feature Extraction and Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.1 Reducing Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.2.1 Inter/Intraclass Distance . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.2.2 Subset Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.3.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . 127 7.3.2 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . 135 7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 x Contents
Contents 8 Unsupervised Learni 8. 1 Clustering 143 8. 2 k-Means Clustering 145 8.2.1 Fuzzy c-Means Clustering 8.3(Agglomerative) Hierarchical Clustering 8.4 Exercises References 9 Estimating and Comparing Classifiers 9. 1 Comparing Classifiers and the No Free Lunch Theorem 9.1.1 Bias and Variance 9.2 Cross-Validation and Resampling methods 9.2.1 The Holdout Method 9.2.2 k-Fold Cross-Validation 9.2.3 Bootstrap 9.3 Measuring Classifier Performance 9.4 Compar 9.4.1 ROC Curves 9.4.2 McNemar's Test 9.4.3 Other Statistical Tests 9.4. 4 The Classification Toolbox 9.5 Combining classifiers 174 Reference 10 Projects 10.1 Retinal Tortuosity as an Indicator of Disease 177 10.2 Segmentation by Texture 10.3 Biometric Systems 10.3.2 Face Re References
8 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 8.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 8.2 k-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 8.2.1 Fuzzy c-Means Clustering . . . . . . . . . . . . . . . . . . . . . 148 8.3 (Agglomerative) Hierarchical Clustering . . . . . . . . . . . . . . . . . 150 8.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 9 Estimating and Comparing Classifiers . . . . . . . . . . . . . . . . . . . . . . 157 9.1 Comparing Classifiers and the No Free Lunch Theorem . . . . . . 157 9.1.1 Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 9.2 Cross-Validation and Resampling Methods . . . . . . . . . . . . . . . 160 9.2.1 The Holdout Method . . . . . . . . . . . . . . . . . . . . . . . . . 161 9.2.2 k-Fold Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . 162 9.2.3 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 9.3 Measuring Classifier Performance . . . . . . . . . . . . . . . . . . . . . . 164 9.4 Comparing Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 9.4.1 ROC Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 9.4.2 McNemar’s Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 9.4.3 Other Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . 169 9.4.4 The Classification Toolbox . . . . . . . . . . . . . . . . . . . . . 171 9.5 Combining Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 10 Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 10.1 Retinal Tortuosity as an Indicator of Disease . . . . . . . . . . . . . . 177 10.2 Segmentation by Texture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 10.3 Biometric Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 10.3.1 Fingerprint Recognition . . . . . . . . . . . . . . . . . . . . . . . 184 10.3.2 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Contents xi
hapter 1 Introduction 1.1 Overview Humans are good at recognizing objects(or patterns, to use the generic term).We are so good that we take this ability for granted, and find it difficult to analyze the steps in the process. It is generally easy to distinguish the sound of a human voic from that of a violin; a handwritten numeral“3,”" from an“8”; and the aroma of a rose, from that of an onion. Every day, we recognize faces around us, but we do it unconsciously and because we cannot explain our expertise, we find it difficult to write a computer program to do the same. Each person's face is a pattern composed of a particular combination of structures (eyes, nose, mouth, .) located in certain positions on the face. By analyzing sample images of faces, a program should be able to capture the pattern specific to a face and identify(or recognize) it as a face(as a member of a category or class we already know ) this would be pattern recognition. There may be several categories(or classes)and we have to sort(or classify)a particular face into a certain category(or class); hence the term classification. Note that in pattern recognition, the term pattern is interpreted widely and does not necessarily imply a repetition; it is used to include all objects that we might want to classify, e. g, apples (or oranges), speech waveforms, and fingerprint a class is a collection of objects that are similar, but not necessarily identical, and which is distinguishable from other classes. Figure 1. 1 illustrates the difference between classification where the classes are known beforehand and classification where classes are created after inspecting the objects. Interest in pattern recognition and classification has grown due to emerging applications, which are not only challenging but also computatio eman These applications include · Data mining ifting through a large volume of data to extract a small amount of relevant and useful information, e.g, fraud detection, financial forecasting, and redit scoring) G. Dougherty, Pattern DOI 10.1007/978-1-4614-5323-9_1, o Springer Science+Business Media New York 2013
Chapter 1 Introduction 1.1 Overview Humans are good at recognizing objects (or patterns, to use the generic term). We are so good that we take this ability for granted, and find it difficult to analyze the steps in the process. It is generally easy to distinguish the sound of a human voice, from that of a violin; a handwritten numeral “3,” from an “8”; and the aroma of a rose, from that of an onion. Every day, we recognize faces around us, but we do it unconsciously and because we cannot explain our expertise, we find it difficult to write a computer program to do the same. Each person’s face is a pattern composed of a particular combination of structures (eyes, nose, mouth, ...) located in certain positions on the face. By analyzing sample images of faces, a program should be able to capture the pattern specific to a face and identify (or recognize) it as a face (as a member of a category or class we already know); this would be pattern recognition. There may be several categories (or classes) and we have to sort (or classify) a particular face into a certain category (or class); hence the term classification. Note that in pattern recognition, the term pattern is interpreted widely and does not necessarily imply a repetition; it is used to include all objects that we might want to classify, e.g., apples (or oranges), speech waveforms, and fingerprints. A class is a collection of objects that are similar, but not necessarily identical, and which is distinguishable from other classes. Figure 1.1 illustrates the difference between classification where the classes are known beforehand and classification where classes are created after inspecting the objects. Interest in pattern recognition and classification has grown due to emerging applications, which are not only challenging but also computationally demanding. These applications include: • Data mining (sifting through a large volume of data to extract a small amount of relevant and useful information, e.g., fraud detection, financial forecasting, and credit scoring) G. Dougherty, Pattern Recognition and Classification: An Introduction, DOI 10.1007/978-1-4614-5323-9_1, # Springer Science+Business Media New York 2013 1
A异Acat a B 6 AA1171111 B Category "B B凸B8 Fig. 1.1 Classification when the classes are(a) known and (b) unknown beforehand Biometrics(personal identification based on physical attributes of the face, iris, fingerprints, etc.) Machine vision(e. g, automated visual inspection in an assembly line) Character recognition [e. g, automatic mail sorting by zip code, automated check scanners at ATMs(automated teller machines) Document recognition(e.g, recognize whether an e-mail is spam or not, based on the message header and content) Computer-aided diagnosis [e. g, helping doctors make diagnostic decisions based on interpreting medical data such as mammographic images, ultrasound images, electrocardiograms(ECGs), and electroencephalograms(EEGs) Medical imaging [e.g, classifying cells as malignant or benign based on the results of magnetic resonance imaging(MRI)scans, or classify different emo- tional and cognitive states from the images of brain activity in functional MRI Speech recognition(e. g, helping handicapped patients to control machines) Bioinformatics(e. g, DNA sequence analysis to detect genes related to particul Remote sensing(e.g, land use and crop yield) Astronomy (classifying galaxies based on their shapes; or automated searches such as the Search for Extra-Terrestrial Intelligence (SETD) which analyzes radio telescope data in an attempt to locate signals that might be artificial i The methods used have been developed in various fields, often independently In statistics, going from particular observations to general descriptions is called nference, learning [i.e, using example(training) data] is called estimation, and classification is known as discriminant analysis(McLachlan 1992). In engineer g, classification is called pattern recognition and the approach is nonparametric and much more empirical (Duda et al. 2001). Other approaches have their origins n machine learning(Alpaydin 2010), artificial intelligence(Russell and Norvig 2002), artificial neural networks( Bishop 2006), and data mining(Han and Kamber 2006). We will incorporate techniques from these different emphases to give a more unified treatment( Fig. 1.2)
• Biometrics (personal identification based on physical attributes of the face, iris, fingerprints, etc.) • Machine vision (e.g., automated visual inspection in an assembly line) • Character recognition [e.g., automatic mail sorting by zip code, automated check scanners at ATMs (automated teller machines)] • Document recognition (e.g., recognize whether an e-mail is spam or not, based on the message header and content) • Computer-aided diagnosis [e.g., helping doctors make diagnostic decisions based on interpreting medical data such as mammographic images, ultrasound images, electrocardiograms (ECGs), and electroencephalograms (EEGs)] • Medical imaging [e.g., classifying cells as malignant or benign based on the results of magnetic resonance imaging (MRI) scans, or classify different emotional and cognitive states from the images of brain activity in functional MRI] • Speech recognition (e.g., helping handicapped patients to control machines) • Bioinformatics (e.g., DNA sequence analysis to detect genes related to particular diseases) • Remote sensing (e.g., land use and crop yield) • Astronomy (classifying galaxies based on their shapes; or automated searches such as the Search for Extra-Terrestrial Intelligence (SETI) which analyzes radio telescope data in an attempt to locate signals that might be artificial in origin) The methods used have been developed in various fields, often independently. In statistics, going from particular observations to general descriptions is called inference, learning [i.e., using example (training) data] is called estimation, and classification is known as discriminant analysis (McLachlan 1992). In engineering, classification is called pattern recognition and the approach is nonparametric and much more empirical (Duda et al. 2001). Other approaches have their origins in machine learning (Alpaydin 2010), artificial intelligence (Russell and Norvig 2002), artificial neural networks (Bishop 2006), and data mining (Han and Kamber 2006). We will incorporate techniques from these different emphases to give a more unified treatment (Fig. 1.2). Fig. 1.1 Classification when the classes are (a) known and (b) unknown beforehand 2 1 Introduction