2 Classification Class 1 Feature 3 Fig. 2.5 Three-dimensional feature space containing two classes of features, class I (in gray)and var(X,) var( Fig. 2. 6 Scaling of features for a particular object. The feature vectors can be plotted as points in feature space (Fig. 2.5). For n features, the feature space is n-dimensional with each feature onstituting a dimension. Objects from the same class should cluster together in feature pace(reliability), and be well separated from different classes(discriminating ). In classification, our goal is to assign each feature vector to one of the set of classes oil If the different features have different scales, it would be prudent to normalize each by its standard deviation(Fig. 2.6
x ¼ x1 x2 _ _ xn (2.4) for a particular object. The feature vectors can be plotted as points in feature space (Fig. 2.5). For n features, the feature space is n-dimensional with each feature constituting a dimension. Objects from the same class should cluster together in feature space (reliability), and be well separated from different classes (discriminating). In classification, our goal is to assign each feature vector to one of the set of classes {oi}. If the different features have different scales, it would be prudent to normalize each by its standard deviation (Fig. 2.6). Class 1 Feature 3 -2 -2 0 2 4 0 2 4 -2 -1 0 1 2 3 4 5 Feature 1 Feature 2 Class 2 Fig. 2.5 Three-dimensional feature space containing two classes of features, class 1 (in gray) and class 2 (in black) x2 -6 -6 -2.5 -2 -1 0 0.5 -1.5 2 1 2.5 -0.5 1.5 -4 -2 0 2 4 6 -4 -2 0 2 4 6 8 -2 -1 0 1 2 3 x1 x1 var(x1) = = var(x2) x2 x’ 2 x’ 2 x’ 1 x’ 1 Fig. 2.6 Scaling of features 14 2 Classification
2.2 Features Classifier produces a decision boundary Labeled objects Class B Test Object classified as class A ion, using labeled training sets and two features, results in a linear decision boundary The classification stage assigns objects to certain categories(or classes )based on the feature information. How many features should we measure? And which are the best? The problem is that the more we measure the higher is the dimension of feature space, and the more complicated the classification will become(not to mention the added requirements for computation time and storage). This is referred to as the"curse of dimensionality. In our search for a simple, yet efficient, classifier we are frequently drawn to using the minimum number of"good"features that will be sufficient to do the classification adequately(for which we need a measure of the performance of the classifier) for a particular problem. This follows the heuristic principle known traditionally as Occam's razor(viz., the simplest olution is the best) or referred to as Kiss(Keep contemporary language; while it may not be true in all situations, we will adopt a natural bias towards simplicity The prudent approach is to err on the side of measuring more features per object than that might be necessary, and then reduce the number of features by either (1) feature selection- choosing the most informative subset of features, and removing s many irrelevant and redundant features as possible (Yu and Liu 2004) or(2) feature extraction--combining the existing feature set into a smaller set of ne known feature extraction method is Principal Component Analysis(PCA), which we will consider fully in Chap. 6. One paradigm for classification is the learning from examples approach. If a sample of labeled objects(called the training set) is randomly selected, and their feature vectors plotted in feature space, then it may be possible to build a classifier which separates the two(or more) classes adequately using a decision boundary or ecision surface. A linear classifier results in a decision surface which is a hyp plane(Fig. 2.7). Again, the decision boundary should be as simple as possible, consistent with doing an adequate job of classifying. The use of a labeled training set, in which it is known to which class the sample objects belong, constitutes
The classification stage assigns objects to certain categories (or classes) based on the feature information. How many features should we measure? And which are the best? The problem is that the more we measure the higher is the dimension of feature space, and the more complicated the classification will become (not to mention the added requirements for computation time and storage). This is referred to as the “curse of dimensionality.” In our search for a simple, yet efficient, classifier we are frequently drawn to using the minimum number of “good” features that will be sufficient to do the classification adequately (for which we need a measure of the performance of the classifier) for a particular problem. This follows the heuristic principle known traditionally as Occam’s razor (viz., the simplest solution is the best) or referred to as KISS (Keep It Simple, Stupid) in more contemporary language; while it may not be true in all situations, we will adopt a natural bias towards simplicity. The prudent approach is to err on the side of measuring more features per object than that might be necessary, and then reduce the number of features by either (1) feature selection—choosing the most informative subset of features, and removing as many irrelevant and redundant features as possible (Yu and Liu 2004) or (2) feature extraction—combining the existing feature set into a smaller set of new, more informative features (Markovitch and Rosenstein 2002). The most wellknown feature extraction method is Principal Component Analysis (PCA), which we will consider fully in Chap. 6. One paradigm for classification is the learning from examples approach. If a sample of labeled objects (called the training set) is randomly selected, and their feature vectors plotted in feature space, then it may be possible to build a classifier which separates the two (or more) classes adequately using a decision boundary or decision surface. A linear classifier results in a decision surface which is a hyperplane (Fig. 2.7). Again, the decision boundary should be as simple as possible, consistent with doing an adequate job of classifying. The use of a labeled training set, in which it is known to which class the sample objects belong, constitutes supervised learning. Fig. 2.7 Linear classification, using labeled training sets and two features, results in a linear decision boundary 2.2 Features 15
2 Classification 2. 3 Training and Learning A typical classification problem comprises the following task: given example(or instance)objects typical of a number of classes(the training set), and classify other objects(the test set)into one of these classes. Features need to be identified such that the in-class variations are less than the between-class variations If the classes to which the objects belong are known, the process is called supervised learning, and if they are unknown, in which case the most appropriate lasses must be found, it is called unsupervised learning. With unsupervised learning, the hope is to discover the unknown, but useful, classes of items ( Jain eta.2000 The process of using data to determine the best set of features for a classifier is known as training the classifier. The most effective methods for training classifiers involve learning from examples. a performance metric for a set of features, based on the classification errors it produces, should be calculated in order to evaluate the usefulness of the features Learning(aka machine learning or artificial intelligence) refers to some form of adaptation of the classification algorithm to achieve a better response, i.e., to reduce the classification error on a set of training data. This would involve feedback to earlier steps in the process in an iterative manner until some desired level of accuracy is achieved. Ideally this would result in a monotonically increasin performance(Fig. 2. 8), although this is often difficult to achieve In reinforcement learning(Barto and Sutton 1997), the output of the system is a sequence of actions to best reach the goal. The machine learning program must discover the best sequence of actions to take to yield the best reward. A robot 1.0 0.2 0.1 Trials or attempts Fig 2.8 Idealized learning curve
2.3 Training and Learning A typical classification problem comprises the following task: given example (or instance) objects typical of a number of classes (the training set), and classify other objects (the test set) into one of these classes. Features need to be identified such that the in-class variations are less than the between-class variations. If the classes to which the objects belong are known, the process is called supervised learning, and if they are unknown, in which case the most appropriate classes must be found, it is called unsupervised learning. With unsupervised learning, the hope is to discover the unknown, but useful, classes of items (Jain et al. 2000). The process of using data to determine the best set of features for a classifier is known as training the classifier. The most effective methods for training classifiers involve learning from examples. A performance metric for a set of features, based on the classification errors it produces, should be calculated in order to evaluate the usefulness of the features. Learning (aka machine learning or artificial intelligence) refers to some form of adaptation of the classification algorithm to achieve a better response, i.e., to reduce the classification error on a set of training data. This would involve feedback to earlier steps in the process in an iterative manner until some desired level of accuracy is achieved. Ideally this would result in a monotonically increasing performance (Fig. 2.8), although this is often difficult to achieve. In reinforcement learning (Barto and Sutton 1997), the output of the system is a sequence of actions to best reach the goal. The machine learning program must discover the best sequence of actions to take to yield the best reward. A robot Fig. 2.8 Idealized learning curve 16 2 Classification
2.4 Supervised Learning and Algorithm Selection 17 navigating in an environment in search of a particular location is an example of enforcement learning. After a number of trials, it should learn the correct sequence of moves to reach the location as quickly as possible without hitting any of the obstacles. A task may require multiple agents to interact to accomplish a common goal, such as with a team of robots playing soccer. 2. 4 Supervised Learning and algorithm Selection Supervised learning is an inductive reasoning process, whereby a set of rules are learned from instances(examples in a training set) and a classifier algorithm is osen or created that can apply these rules successfully to new instances. The process of applying supervised learning to a real-world problem is outlined in Fig. 2.9 Problem Identification of required data Data pre-process Definition of training set Igorithm selection Parametcr H valuation with Fig. 2.9 The supervised learning
navigating in an environment in search of a particular location is an example of reinforcement learning. After a number of trials, it should learn the correct sequence of moves to reach the location as quickly as possible without hitting any of the obstacles. A task may require multiple agents to interact to accomplish a common goal, such as with a team of robots playing soccer. 2.4 Supervised Learning and Algorithm Selection Supervised learning is an inductive reasoning process, whereby a set of rules are learned from instances (examples in a training set) and a classifier algorithm is chosen or created that can apply these rules successfully to new instances. The process of applying supervised learning to a real-world problem is outlined in Fig. 2.9. Fig. 2.9 The process of supervised learning 2.4 Supervised Learning and Algorithm Selection 17
2 Classification The choice of which specific learning algorithm to use is a critical step. We should choose an algorithm, apply it to a training set, and then evaluate it before dopting it for general use. The evaluation is most often based on prediction ccuracy,i.e, the percentage of correct prediction divided by the total number of predictions. There are at least three techniques which are used to calculate the I. Split the training set by using two-thirds for training and the other third for estimating performance 2. Divide the training set into mutually exclusive and equal-sized subsets and for each subset, train the classifier on the union of all the other subsets. The average of the error rate of each subset is then an estimate of the error rate of the classifier. This is known as cross-validation 3. Leave-one-out validation is a special case of cross-validation, with all test subsets consisting of a single instance. This type of validation is, of course, more expensive computationally, but useful when the most accurate estimate of We will consider ways to estimate the performance and accuracy of classifiers in Chap 8 2.5 Approaches to Classification 1. Statistical approaches( Chaps. 4 and 5)are characterized by their reliance on an explicit underlying probability model. The features are extracted from the input data(object)and are used to assign each object(described by a feature vector)to one of the labeled classes. The decision boundaries are determined by the probability distributions of the objects belonging to each class, which must either be specified or learned. A priori probabilities (i.e, probabilities before measurement--described by probability density functions)are converted into a posteriori(or class/measurement-conditioned probabilities probabilities (i.e, probabilities after measurement). Bayesian networks(e. g, Jensen 1996)are the most well-known representative of statistical learning Gorithm In a discriminant analysis-based approach, a parametric form of the decision undary(e. g, linear or quadratic) is specified, and then the best decision boundary of this form is found based on the classification of training objects. Such boundaries can be constructed using, for example, a mean squared error criterion In maximum entropy techniques, the overriding principle is that when othing is known, the distribution should be as uniform as possible, i.e., have maximal entropy. Labeled training data are used to derive a set of constraints for
The choice of which specific learning algorithm to use is a critical step. We should choose an algorithm, apply it to a training set, and then evaluate it before adopting it for general use. The evaluation is most often based on prediction accuracy, i.e., the percentage of correct prediction divided by the total number of predictions. There are at least three techniques which are used to calculate the accuracy of a classifier 1. Split the training set by using two-thirds for training and the other third for estimating performance. 2. Divide the training set into mutually exclusive and equal-sized subsets and for each subset, train the classifier on the union of all the other subsets. The average of the error rate of each subset is then an estimate of the error rate of the classifier. This is known as cross-validation. 3. Leave-one-out validation is a special case of cross-validation, with all test subsets consisting of a single instance. This type of validation is, of course, more expensive computationally, but useful when the most accurate estimate of a classifier’s error rate is required. We will consider ways to estimate the performance and accuracy of classifiers in Chap. 8. 2.5 Approaches to Classification There are a variety of approaches to classification: 1. Statistical approaches (Chaps. 4 and 5) are characterized by their reliance on an explicit underlying probability model. The features are extracted from the input data (object) and are used to assign each object (described by a feature vector) to one of the labeled classes. The decision boundaries are determined by the probability distributions of the objects belonging to each class, which must either be specified or learned. A priori probabilities (i.e., probabilities before measurement—described by probability density functions) are converted into a posteriori (or class-/measurement-conditioned probabilities) probabilities (i.e., probabilities after measurement). Bayesian networks (e.g., Jensen 1996) are the most well-known representative of statistical learning algorithms. In a discriminant analysis-based approach, a parametric form of the decision boundary (e.g., linear or quadratic) is specified, and then the best decision boundary of this form is found based on the classification of training objects. Such boundaries can be constructed using, for example, a mean squared error criterion. In maximum entropy techniques, the overriding principle is that when nothing is known, the distribution should be as uniform as possible, i.e., have maximal entropy. Labeled training data are used to derive a set of constraints for 18 2 Classification