JAIN ET AL.:STATISTICAL PATTERN RECOGNITION:A REVIEW 9 The decision making process in statistical pattern supervised nonparametric method which constructs a recognition can be summarized as follows:A given pattern decision boundary. is to be assigned to one of c categories w,w2,.,we based Another dichotomy in statistical pattern recognition is on a vector of d feature values x =(z1,x2,...,xd).The that of supervised learning (labeled training samples) features are assumed to have a probability density or mass versus unsupervised learning (unlabeled training sam- (depending on whether the features are continuous or ples).The label of a training pattern represents the discrete)function conditioned on the pattern class.Thus,a category to which that pattern belongs.In an unsuper- pattern vector z belonging to class wi is viewed as an vised learning problem,sometimes the number of classes observation drawn randomly from the class-conditional must be learned along with the structure of each class. probability function p(xwi).A number of well-known The various dichotomies that appear in statistical pattern decision rules,including the Bayes decision rule,the recognition are shown in the tree structure of Fig.2.As maximum likelihood rule (which can be viewed as a we traverse the tree from top to bottom and left to right, particular case of the Bayes rule),and the Neyman-Pearson less information is available to the system designer and as rule are available to define the decision boundary.The a result,the difficulty of classification problems increases. "optimal"Bayes decision rule for minimizing the risk In some sense,most of the approaches in statistical (expected value of the loss function)can be stated as pattern recognition (leaf nodes in the tree of Fig.2)are follows:Assign input pattern x to class w;for which the attempting to implement the Bayes decision rule.The conditional risk field of cluster analysis essentially deals with decision making problems in the nonparametric and unsupervised R(ulr)=∑L(,w)·P(ulz) (1) learning mode [81].Further,in cluster analysis the number of categories or clusters may not even be specified;the task is to discover a reasonable categoriza- is minimum,where L(wi,wj)is the loss incurred in deciding tion of the data (if one exists).Cluster analysis algorithms wi when the true class is wj and P(wjlr)is the posterior along with various techniques for visualizing and project- probability [44].In the case of the 0/1 loss function,as defined in (2),the conditional risk becomes the conditional ing multidimensional data are also referred to as exploratory data analysis methods. probability of misclassification. Yet another dichotomy in statistical pattern recognition 0,i=j can be based on whether the decision boundaries are L(w,)={1,i≠j (2) obtained directly (geometric approach)or indirectly (probabilistic density-based approach)as shown in Fig.2. For this choice of loss function,the Bayes decision rule can The probabilistic approach requires to estimate density be simplified as follows (also called the maximum a functions first,and then construct the discriminant posteriori(MAP)rule):Assign input pattern x to class w;if functions which specify the decision boundaries.On the P(wil)>P(wjlx)for all jti. (3) other hand,the geometric approach often constructs the decision boundaries directly from optimizing certain cost Various strategies are utilized to design a classifier in functions.We should point out that under certain statistical pattern recognition,depending on the kind of assumptions on the density functions,the two approaches information available about the class-conditional densities. are equivalent.We will see examples of each category in If all of the class-conditional densities are completely Section 5. specified,then the optimal Bayes decision rule can be No matter which classification or decision rule is used,it used to design a classifier.However,the class-conditional must be trained using the available training samples.As a densities are usually not known in practice and must be result,the performance of a classifier depends on both the learned from the available training patterns.If the form of number of available training samples as well as the specific the class-conditional densities is known (e.g,multivariate values of the samples.At the same time,the goal of Gaussian),but some of the parameters of the densities designing a recognition system is to classify future test (e.g.,mean vectors and covariance matrices)are un- samples which are likely to be different from the training known,then we have a parametric decision problem.A samples.Therefore,optimizing a classifier to maximize its common strategy for this kind of problem is to replace performance on the training set may not always result in the the unknown parameters in the density functions by their desired performance on a test set.The generalization ability estimated values,resulting in the so-called Bayes plug-in of a classifier refers to its performance in classifying test classifier.The optimal Bayesian strategy in this situation patterns which were not used during the training stage.A requires additional information in the form of a prior poor generalization ability of a classifier can be attributed to distribution on the unknown parameters.If the form of any one of the following factors:1)the number of features is the class-conditional densities is not known,then we too large relative to the number of training samples(curse operate in a nonparametric mode.In this case,we must of dimensionality [80]),2)the number of unknown either estimate the density function (e.g.,Parzen window parameters associated with the classifier is large approach)or directly construct the decision boundary (e.g,polynomial classifiers or a large neural network), based on the training data (e.g.,k-nearest neighbor rule). and 3)a classifier is too intensively optimized on the In fact,the multilayer perceptron can also be viewed as a training set (overtrained);this is analogous to the
The decision making process in statistical pattern recognition can be summarized as follows: A given pattern is to be assigned to one of c categories !1; !2; ; !c based on a vector of d feature values x x1; x2; ; xd. The features are assumed to have a probability density or mass (depending on whether the features are continuous or discrete) function conditioned on the pattern class. Thus, a pattern vector x belonging to class !i is viewed as an observation drawn randomly from the class-conditional probability function p xj!i. A number of well-known decision rules, including the Bayes decision rule, the maximum likelihood rule (which can be viewed as a particular case of the Bayes rule), and the Neyman-Pearson rule are available to define the decision boundary. The ªoptimalº Bayes decision rule for minimizing the risk (expected value of the loss function) can be stated as follows: Assign input pattern x to class !i for which the conditional risk R !ijx Xc j1 L !i; !j P !jjx 1 is minimum, where L !i; !j is the loss incurred in deciding !i when the true class is !j and P !jjx is the posterior probability [44]. In the case of the 0/1 loss function, as defined in (2), the conditional risk becomes the conditional probability of misclassification. L !i; !j 0; i j 1; i 6 j : 2 For this choice of loss function, the Bayes decision rule can be simplified as follows (also called the maximum a posteriori (MAP) rule): Assign input pattern x to class !i if P !ijx > P !jjx for all j 6 i: 3 Various strategies are utilized to design a classifier in statistical pattern recognition, depending on the kind of information available about the class-conditional densities. If all of the class-conditional densities are completely specified, then the optimal Bayes decision rule can be used to design a classifier. However, the class-conditional densities are usually not known in practice and must be learned from the available training patterns. If the form of the class-conditional densities is known (e.g., multivariate Gaussian), but some of the parameters of the densities (e.g., mean vectors and covariance matrices) are unknown, then we have a parametric decision problem. A common strategy for this kind of problem is to replace the unknown parameters in the density functions by their estimated values, resulting in the so-called Bayes plug-in classifier. The optimal Bayesian strategy in this situation requires additional information in the form of a prior distribution on the unknown parameters. If the form of the class-conditional densities is not known, then we operate in a nonparametric mode. In this case, we must either estimate the density function (e.g., Parzen window approach) or directly construct the decision boundary based on the training data (e.g., k-nearest neighbor rule). In fact, the multilayer perceptron can also be viewed as a supervised nonparametric method which constructs a decision boundary. Another dichotomy in statistical pattern recognition is that of supervised learning (labeled training samples) versus unsupervised learning (unlabeled training samples). The label of a training pattern represents the category to which that pattern belongs. In an unsupervised learning problem, sometimes the number of classes must be learned along with the structure of each class. The various dichotomies that appear in statistical pattern recognition are shown in the tree structure of Fig. 2. As we traverse the tree from top to bottom and left to right, less information is available to the system designer and as a result, the difficulty of classification problems increases. In some sense, most of the approaches in statistical pattern recognition (leaf nodes in the tree of Fig. 2) are attempting to implement the Bayes decision rule. The field of cluster analysis essentially deals with decision making problems in the nonparametric and unsupervised learning mode [81]. Further, in cluster analysis the number of categories or clusters may not even be specified; the task is to discover a reasonable categorization of the data (if one exists). Cluster analysis algorithms along with various techniques for visualizing and projecting multidimensional data are also referred to as exploratory data analysis methods. Yet another dichotomy in statistical pattern recognition can be based on whether the decision boundaries are obtained directly (geometric approach) or indirectly (probabilistic density-based approach) as shown in Fig. 2. The probabilistic approach requires to estimate density functions first, and then construct the discriminant functions which specify the decision boundaries. On the other hand, the geometric approach often constructs the decision boundaries directly from optimizing certain cost functions. We should point out that under certain assumptions on the density functions, the two approaches are equivalent. We will see examples of each category in Section 5. No matter which classification or decision rule is used, it must be trained using the available training samples. As a result, the performance of a classifier depends on both the number of available training samples as well as the specific values of the samples. At the same time, the goal of designing a recognition system is to classify future test samples which are likely to be different from the training samples. Therefore, optimizing a classifier to maximize its performance on the training set may not always result in the desired performance on a test set. The generalization ability of a classifier refers to its performance in classifying test patterns which were not used during the training stage. A poor generalization ability of a classifier can be attributed to any one of the following factors: 1) the number of features is too large relative to the number of training samples (curse of dimensionality [80]), 2) the number of unknown parameters associated with the classifier is large (e.g., polynomial classifiers or a large neural network), and 3) a classifier is too intensively optimized on the training set (overtrained); this is analogous to the JAIN ET AL.: STATISTICAL PATTERN RECOGNITION: A REVIEW 9
10 EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL 22,NO.1,JANUARY 2000 Class-Conditional Densities Known Unknown Bayes Decision Supervised Unsupervised Theory Learning Learning Parametric Nonparametric Parametric Nonparametric 'Optimal" Plug-in Density Decision Mixture Cluster Rules Rules Estimation Boundary Resolving Analysis Construction Density-Based Approaches Geometric Approach Fig.2.Various approaches in statistical pattem recognition. phenomenon of overfitting in regression when there are too discriminant analysis [611,and the use of bootstrapping many free parameters. for designing classifiers [48],and for error estimation [82]. Overtraining has been investigated theoretically for Throughout the paper,some of the classification meth- classifiers that minimize the apparent error rate(the error ods will be illustrated by simple experiments on the on the training set).The classical studies by Cover [33]and following three data sets: Vapnik [162]on classifier capacity and complexity provide Dataset 1:An artificial dataset consisting of two classes a good understanding of the mechanisms behind with bivariate Gaussian density with the following para- overtraining.Complex classifiers (e.g.,those having many meters: independent parameters)may have a large capacity,i.e., they are able to represent many dichotomies for a given 「101 dataset.A frequently used measure for the capacity is the m=m=20,a=[625]mdg-[0.8l 01 Vapnik-Chervonenkis (VC)dimensionality [162].These results can also be used to prove some interesting proper- The intrinsic overlap between these two densities is ties,for example,the consistency of certain classifiers (see, 12.5 percent. Devroye et al.[40],[411).The practical use of the results on Dataset 2:Iris dataset consists of 150 four-dimensional classifier complexity was initially limited because the patterns in three classes(50 patterns each):Iris Setosa,Iris proposed bounds on the required number of (training) Versicolor,and Iris Virginica. samples were too conservative.In the recent development Dataset 3:The digit dataset consists of handwritten of support vector machines [162],however,these results numerals ("0"-"9")extracted from a collection of Dutch have proved to be quite useful.The pitfalls of over- utility maps.Two hundred patterns per class(for a total of adaptation of estimators to the given training set are 2,000 patterns)are available in the form of 30x 48 binary observed in several stages of a pattern recognition system, images.These characters are represented in terms of the such as dimensionality reduction,density estimation,and following six feature sets: classifier design.A sound solution is to always use an independent dataset(test set)for evaluation.In order to 1.76 Fourier coefficients of the character shapes; avoid the necessity of having several independent test sets, 2. 216 profile correlations; estimators are often based on rotated subsets of the data, 3.64 Karhunen-Loeve coefficients; preserving different parts of the data for optimization and 4. 240 pixel averages in 2 x 3 windows; evaluation [166].Examples are the optimization of the 5.47 Zernike moments; covariance estimates for the Parzen kernel [76]and 6. 6 morphological features
phenomenon of overfitting in regression when there are too many free parameters. Overtraining has been investigated theoretically for classifiers that minimize the apparent error rate (the error on the training set). The classical studies by Cover [33] and Vapnik [162] on classifier capacity and complexity provide a good understanding of the mechanisms behind overtraining. Complex classifiers (e.g., those having many independent parameters) may have a large capacity, i.e., they are able to represent many dichotomies for a given dataset. A frequently used measure for the capacity is the Vapnik-Chervonenkis (VC) dimensionality [162]. These results can also be used to prove some interesting properties, for example, the consistency of certain classifiers (see, Devroye et al. [40], [41]). The practical use of the results on classifier complexity was initially limited because the proposed bounds on the required number of (training) samples were too conservative. In the recent development of support vector machines [162], however, these results have proved to be quite useful. The pitfalls of overadaptation of estimators to the given training set are observed in several stages of a pattern recognition system, such as dimensionality reduction, density estimation, and classifier design. A sound solution is to always use an independent dataset (test set) for evaluation. In order to avoid the necessity of having several independent test sets, estimators are often based on rotated subsets of the data, preserving different parts of the data for optimization and evaluation [166]. Examples are the optimization of the covariance estimates for the Parzen kernel [76] and discriminant analysis [61], and the use of bootstrapping for designing classifiers [48], and for error estimation [82]. Throughout the paper, some of the classification methods will be illustrated by simple experiments on the following three data sets: Dataset 1: An artificial dataset consisting of two classes with bivariate Gaussian density with the following parameters: m1 1; 1; m2 2; 0; 1 1 0 0 0:25 and 2 0:8 0 0 1 : The intrinsic overlap between these two densities is 12.5 percent. Dataset 2: Iris dataset consists of 150 four-dimensional patterns in three classes (50 patterns each): Iris Setosa, Iris Versicolor, and Iris Virginica. Dataset 3: The digit dataset consists of handwritten numerals (ª0º-ª9º) extracted from a collection of Dutch utility maps. Two hundred patterns per class (for a total of 2,000 patterns) are available in the form of 30 48 binary images. These characters are represented in terms of the following six feature sets: 1. 76 Fourier coefficients of the character shapes; 2. 216 profile correlations; 3. 64 Karhunen-LoeÁve coefficients; 4. 240 pixel averages in 2 3 windows; 5. 47 Zernike moments; 6. 6 morphological features. 10 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 1, JANUARY 2000 Fig. 2. Various approaches in statistical pattern recognition.
JAIN ET AL.:STATISTICAL PATTERN RECOGNITION:A REVIEW 11 Details of this dataset are available in [160].In our imum discrimination between the two classes.The only experiments we always used the same subset of 1,000 parameter in the densities is the mean vector, patterns for testing and various subsets of the remaining m=m1=-m2. 1,000 patterns for training.2 Throughout this paper,when Trunk considered the following two cases: we refer to "the digit dataset,"just the Karhunen-Loeve features (in item 3)are meant,unless stated otherwise. 1.The mean vector m is known.In this situation,we can use the optimal Bayes decision rule (with a 0/1 loss function)to construct the decision boundary. 3 THE CURSE OF DIMENSIONALITY AND PEAKING The probability of error as a function of d can be PHENOMENA expressed as: The performance of a classifier depends on the interrela- tionship between sample sizes,number of features,and P(d ed (4) classifier complexity.A naive table-lookup technique V2π (partitioning the feature space into cells and associating a class label with each cell)requires the number of training It is easy to verify that limd Pe(d)=0.In other data points to be an exponential function of the feature words,we can perfectly discriminate the two classes by arbitrarily increasing the number of features,d. dimension [18].This phenomenon is termed as "curse of 2 The mean vector m is unknown and n labeled dimensionality,"which leads to the "peaking phenomenon" (see discussion below)in classifier design.It is well-known training samples are available.Trunk found the maximum likelihood estimate m of m and used the that the probability of misclassification of a decision rule plug-in decision rule (substitute m for m in the does not increase as the number of features increases,as optimal Bayes decision rule).Now the probability of long as the class-conditional densities are completely error which is a function of both n and d can be known (or,equivalently,the number of training samples written as: is arbitrarily large and representative of the underlying densities).However,it has been often observed in practice 00 1 Pe(n,d)= ·erdz,where Jo(d)V2 (5) that the added features may actually degrade the perfor- mance of a classifier if the number of training samples that are used to design the classifier is small relative to the number of features.This paradoxical behavior is referred to (d)= ∑) (6) as the peaking phenomenon3 [80],[131],[132].A simple V1+)∑份+月 explanation for this phenomenon is as follows:The most commonly used parametric classifiers estimate the un- Trunk showed that lim P(n,d)=,which implies known parameters and plug them in for the true parameters that the probability of error approaches the maximum in the class-conditional densities.For a fixed sample size,as possible value of 0.5 for this two-class problem.This the number of features is increased(with a corresponding demonstrates that,unlike case 1)we cannot arbitrarily increase in the number of unknown parameters),the increase the number of features when the parameters of reliability of the parameter estimates decreases.Conse- class-conditional densities are estimated from a finite quently,the performance of the resulting plug-in classifiers, number of training samples.The practical implication of for a fixed sample size,may degrade with an increase in the the curse of dimensionality is that a system designer should number of features. try to select only a small number of salient features when Trunk[157]provided a simple example to illustrate the confronted with a limited training set. curse of dimensionality which we reproduce below. All of the commonly used classifiers,including multi- Consider the two-class classification problem with equal layer feed-forward networks,can suffer from the curse of prior probabilities,and a d-dimensional multivariate Gaus- dimensionality.While an exact relationship between the sian distribution with the identity covariance matrix for probability of misclassification,the number of training each class.The mean vectors for the two classes have the samples,the number of features and the true parameters of following components the class-conditional densities is very difficult to establish, 11 some guidelines have been suggested regarding the ratio of m=万…园 and the sample size to dimensionality.It is generally accepted that using at least ten times as many training samples per m2=(-1,- class as the number of features(n/d 10)is a good practice a to follow in classifier design [80].The more complex the Note that the features are statistically independent and the classifier,the larger should the ratio of sample size to discriminating power of the successive features decreases dimensionality be to avoid the curse of dimensionality. monotonically with the first feature providing the max- 4 DIMENSIONALITY REDUCTION 2.The dataset is available through the University of California,Irvine Machine Learning Repository (www.ics.uci.edu/-mlearn/MLRepositor- There are two main reasons to keep the dimensionality of y.html) 3.In the rest of this paper,we do not make distinction between the curse the pattern representation (i.e.,the number of features)as of dimensionality and the peaking phenomenon. small as possible:measurement cost and classification
Details of this dataset are available in [160]. In our experiments we always used the same subset of 1,000 patterns for testing and various subsets of the remaining 1,000 patterns for training.2 Throughout this paper, when we refer to ªthe digit dataset,º just the Karhunen-Loeve features (in item 3) are meant, unless stated otherwise. 3 THE CURSE OF DIMENSIONALITY AND PEAKING PHENOMENA The performance of a classifier depends on the interrelationship between sample sizes, number of features, and classifier complexity. A naive table-lookup technique (partitioning the feature space into cells and associating a class label with each cell) requires the number of training data points to be an exponential function of the feature dimension [18]. This phenomenon is termed as ªcurse of dimensionality,º which leads to the ªpeaking phenomenonº (see discussion below) in classifier design. It is well-known that the probability of misclassification of a decision rule does not increase as the number of features increases, as long as the class-conditional densities are completely known (or, equivalently, the number of training samples is arbitrarily large and representative of the underlying densities). However, it has been often observed in practice that the added features may actually degrade the performance of a classifier if the number of training samples that are used to design the classifier is small relative to the number of features. This paradoxical behavior is referred to as the peaking phenomenon3 [80], [131], [132]. A simple explanation for this phenomenon is as follows: The most commonly used parametric classifiers estimate the unknown parameters and plug them in for the true parameters in the class-conditional densities. For a fixed sample size, as the number of features is increased (with a corresponding increase in the number of unknown parameters), the reliability of the parameter estimates decreases. Consequently, the performance of the resulting plug-in classifiers, for a fixed sample size, may degrade with an increase in the number of features. Trunk [157] provided a simple example to illustrate the curse of dimensionality which we reproduce below. Consider the two-class classification problem with equal prior probabilities, and a d-dimensional multivariate Gaussian distribution with the identity covariance matrix for each class. The mean vectors for the two classes have the following components m1 1; 1 2 p ; 1 3 p ; ; 1 d p and m2 ÿ1; ÿ 1 2 p ; ÿ 1 3 p ; ; ÿ 1 d p : Note that the features are statistically independent and the discriminating power of the successive features decreases monotonically with the first feature providing the maximum discrimination between the two classes. The only parameter in the densities is the mean vector, m m1 ÿm2. Trunk considered the following two cases: 1. The mean vector m is known. In this situation, we can use the optimal Bayes decision rule (with a 0/1 loss function) to construct the decision boundary. The probability of error as a function of d can be expressed as: Pe d Z 1 Pd i1 1 i q 1 2 p eÿ1 2z2 dz: 4 It is easy to verify that limd!1 Pe d 0. In other words, we can perfectly discriminate the two classes by arbitrarily increasing the number of features, d. 2. The mean vector m is unknown and n labeled training samples are available. Trunk found the maximum likelihood estimate m^ of m and used the plug-in decision rule (substitute m^ for m in the optimal Bayes decision rule). Now the probability of error which is a function of both n and d can be written as: Pe n; d Z 1 d 1 2 p eÿ1 2z2 dz; where 5 d Pd i1 1 i 1 1 n Pd i1 1 i d n q : 6 Trunk showed that limd!1 Pe n; d 1 2 , which implies that the probability of error approaches the maximum possible value of 0.5 for this two-class problem. This demonstrates that, unlike case 1) we cannot arbitrarily increase the number of features when the parameters of class-conditional densities are estimated from a finite number of training samples. The practical implication of the curse of dimensionality is that a system designer should try to select only a small number of salient features when confronted with a limited training set. All of the commonly used classifiers, including multilayer feed-forward networks, can suffer from the curse of dimensionality. While an exact relationship between the probability of misclassification, the number of training samples, the number of features and the true parameters of the class-conditional densities is very difficult to establish, some guidelines have been suggested regarding the ratio of the sample size to dimensionality. It is generally accepted that using at least ten times as many training samples per class as the number of features (n=d > 10) is a good practice to follow in classifier design [80]. The more complex the classifier, the larger should the ratio of sample size to dimensionality be to avoid the curse of dimensionality. 4 DIMENSIONALITY REDUCTION There are two main reasons to keep the dimensionality of the pattern representation (i.e., the number of features) as small as possible: measurement cost and classification JAIN ET AL.: STATISTICAL PATTERN RECOGNITION: A REVIEW 11 2. The dataset is available through the University of California, Irvine Machine Learning Repository (www.ics.uci.edu/~mlearn/MLRepository.html) 3. In the rest of this paper, we do not make distinction between the curse of dimensionality and the peaking phenomenon
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL 22,NO.1,JANUARY 2000 accuracy.A limited yet salient feature set simplifies both the classification error of a feature subset.But the classification pattern representation and the classifiers that are built on error itself cannot be reliably estimated when the ratio of the selected representation.Consequently,the resulting sample size to the number of features is small.In addition to classifier will be faster and will use less memory.Moreover, the choice of a criterion function,we also need to determine as stated earlier,a small number of features can alleviate the the appropriate dimensionality of the reduced feature curse of dimensionality when the number of training space.The answer to this question is embedded in the samples is limited.On the other hand,a reduction in the notion of the intrinsic dimensionality of data.Intrinsic number of features may lead to a loss in the discrimination dimensionality essentially determines whether the given power and thereby lower the accuracy of the resulting d-dimensional patterns can be described adequately in a recognition system.Watanabe's ugly duckling theorem [163] subspace of dimensionality less than d.For example, also supports the need for a careful choice of the features, d-dimensional patterns along a reasonably smooth curve since it is possible to make two arbitrary patterns similar by have an intrinsic dimensionality of one,irrespective of the encoding them with a sufficiently large number of value of d.Note that the intrinsic dimensionality is not the redundant features. same as the linear dimensionality which is a global property It is important to make a distinction between feature of the data involving the number of significant eigenvalues selection and feature extraction.The term feature selection of the covariance matrix of the data.While several refers to algorithms that select the (hopefully)best subset of algorithms are available to estimate the intrinsic dimension- the input feature set.Methods that create new features ality [81],they do not indicate how a subspace of the based on transformations or combinations of the original identified dimensionality can be easily identified. feature set are called feature extraction algorithms.How- We now briefly discuss some of the commonly used ever,the terms feature selection and feature extraction are methods for feature extraction and feature selection. used interchangeably in the literature.Note that often feature extraction precedes feature selection;first,features 4.1 Feature Extraction are extracted from the sensed data (e.g.,using principal Feature extraction methods determine an appropriate sub- component or discriminant analysis)and then some of the space of dimensionality m(either in a linear or a nonlinear extracted features with low discrimination ability are way)in the original feature space of dimensionality d discarded.The choice between feature selection and feature (m<d).Linear transforms,such as principal component extraction depends on the application domain and the analysis,factor analysis,linear discriminant analysis,and specific training data which is available.Feature selection projection pursuit have been widely used in pattern leads to savings in measurement cost (since some of the recognition for feature extraction and dimensionality features are discarded)and the selected features retain their reduction.The best known linear feature extractor is the original physical interpretation.In addition,the retained principal component analysis (PCA)or Karhunen-Loeve features may be important for understanding the physical expansion,that computes the m largest eigenvectors of the process that generates the patterns.On the other hand, d x d covariance matrix of the n d-dimensional patterns.The transformed features generated by feature extraction may linear transformation is defined as provide a better discriminative ability than the best subset Y=XH of given features,but these new features (a linear or a (7) nonlinear combination of given features)may not have a where X is the given n x d pattern matrix,Y is the derived clear physical meaning. n x m pattern matrix,and H is the d x m matrix of linear In many situations,it is useful to obtain a two-or three- transformation whose columns are the eigenvectors.Since dimensional projection of the given multivariate data(n x d PCA uses the most expressive features (eigenvectors with pattern matrix)to permit a visual examination of the data. the largest eigenvalues),it effectively approximates the data Several graphical techniques also exist for visually obser- by a linear subspace using the mean squared error criterion. ving multivariate data,in which the objective is to exactly Other methods,like projection pursuit [53]and depict each pattern as a picture with d degrees of freedom, independent component analysis (ICA)[31],[11],[24],[96] where d is the given number of features.For example, are more appropriate for non-Gaussian distributions since Chernoff [29]represents each pattern as a cartoon face they do not rely on the second-order property of the data. whose facial characteristics,such as nose length,mouth ICA has been successfully used for blind-source separation curvature,and eye size,are made to correspond to [78];extracting linear feature combinations that define individual features.Fig.3 shows three faces corresponding independent sources.This demixing is possible if at most to the mean vectors of Iris Setosa,Iris Versicolor,and Iris one of the sources has a Gaussian distribution. Virginica classes in the Iris data (150 four-dimensional Whereas PCA is an unsupervised linear feature extrac- patterns;50 patterns per class).Note that the face associated tion method,discriminant analysis uses the category with Iris Setosa looks quite different from the other two information associated with each pattern for (linearly) faces which implies that the Setosa category can be well extracting the most discriminatory features.In discriminant separated from the remaining two categories in the four-analysis,interclass separation is emphasized by replacing dimensional feature space (This is also evident in the two- the total covariance matrix in PCA by a general separability dimensional plots of this data in Fig.5). measure like the Fisher criterion,which results in finding The main issue in dimensionality reduction is the choice the eigenvectors of SS(the product of the inverse of the of a criterion function.A commonly used criterion is the within-class scatter matrix,S,and the between-class
accuracy. A limited yet salient feature set simplifies both the pattern representation and the classifiers that are built on the selected representation. Consequently, the resulting classifier will be faster and will use less memory. Moreover, as stated earlier, a small number of features can alleviate the curse of dimensionality when the number of training samples is limited. On the other hand, a reduction in the number of features may lead to a loss in the discrimination power and thereby lower the accuracy of the resulting recognition system. Watanabe's ugly duckling theorem [163] also supports the need for a careful choice of the features, since it is possible to make two arbitrary patterns similar by encoding them with a sufficiently large number of redundant features. It is important to make a distinction between feature selection and feature extraction. The term feature selection refers to algorithms that select the (hopefully) best subset of the input feature set. Methods that create new features based on transformations or combinations of the original feature set are called feature extraction algorithms. However, the terms feature selection and feature extraction are used interchangeably in the literature. Note that often feature extraction precedes feature selection; first, features are extracted from the sensed data (e.g., using principal component or discriminant analysis) and then some of the extracted features with low discrimination ability are discarded. The choice between feature selection and feature extraction depends on the application domain and the specific training data which is available. Feature selection leads to savings in measurement cost (since some of the features are discarded) and the selected features retain their original physical interpretation. In addition, the retained features may be important for understanding the physical process that generates the patterns. On the other hand, transformed features generated by feature extraction may provide a better discriminative ability than the best subset of given features, but these new features (a linear or a nonlinear combination of given features) may not have a clear physical meaning. In many situations, it is useful to obtain a two- or threedimensional projection of the given multivariate data (n d pattern matrix) to permit a visual examination of the data. Several graphical techniques also exist for visually observing multivariate data, in which the objective is to exactly depict each pattern as a picture with d degrees of freedom, where d is the given number of features. For example, Chernoff [29] represents each pattern as a cartoon face whose facial characteristics, such as nose length, mouth curvature, and eye size, are made to correspond to individual features. Fig. 3 shows three faces corresponding to the mean vectors of Iris Setosa, Iris Versicolor, and Iris Virginica classes in the Iris data (150 four-dimensional patterns; 50 patterns per class). Note that the face associated with Iris Setosa looks quite different from the other two faces which implies that the Setosa category can be well separated from the remaining two categories in the fourdimensional feature space (This is also evident in the twodimensional plots of this data in Fig. 5). The main issue in dimensionality reduction is the choice of a criterion function. A commonly used criterion is the classification error of a feature subset. But the classification error itself cannot be reliably estimated when the ratio of sample size to the number of features is small. In addition to the choice of a criterion function, we also need to determine the appropriate dimensionality of the reduced feature space. The answer to this question is embedded in the notion of the intrinsic dimensionality of data. Intrinsic dimensionality essentially determines whether the given d-dimensional patterns can be described adequately in a subspace of dimensionality less than d. For example, d-dimensional patterns along a reasonably smooth curve have an intrinsic dimensionality of one, irrespective of the value of d. Note that the intrinsic dimensionality is not the same as the linear dimensionality which is a global property of the data involving the number of significant eigenvalues of the covariance matrix of the data. While several algorithms are available to estimate the intrinsic dimensionality [81], they do not indicate how a subspace of the identified dimensionality can be easily identified. We now briefly discuss some of the commonly used methods for feature extraction and feature selection. 4.1 Feature Extraction Feature extraction methods determine an appropriate subspace of dimensionality m (either in a linear or a nonlinear way) in the original feature space of dimensionality d (m d). Linear transforms, such as principal component analysis, factor analysis, linear discriminant analysis, and projection pursuit have been widely used in pattern recognition for feature extraction and dimensionality reduction. The best known linear feature extractor is the principal component analysis (PCA) or Karhunen-LoeÁve expansion, that computes the m largest eigenvectors of the d d covariance matrix of the n d-dimensional patterns. The linear transformation is defined as Y XH; 7 where X is the given n d pattern matrix, Y is the derived n m pattern matrix, and H is the d m matrix of linear transformation whose columns are the eigenvectors. Since PCA uses the most expressive features (eigenvectors with the largest eigenvalues), it effectively approximates the data by a linear subspace using the mean squared error criterion. Other methods, like projection pursuit [53] and independent component analysis (ICA) [31], [11], [24], [96] are more appropriate for non-Gaussian distributions since they do not rely on the second-order property of the data. ICA has been successfully used for blind-source separation [78]; extracting linear feature combinations that define independent sources. This demixing is possible if at most one of the sources has a Gaussian distribution. Whereas PCA is an unsupervised linear feature extraction method, discriminant analysis uses the category information associated with each pattern for (linearly) extracting the most discriminatory features. In discriminant analysis, interclass separation is emphasized by replacing the total covariance matrix in PCA by a general separability measure like the Fisher criterion, which results in finding the eigenvectors of Sÿ1 w Sb (the product of the inverse of the within-class scatter matrix, Sw, and the between-class 12 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 1, JANUARY 2000
JAIN ET AL.:STATISTICAL PATTERN RECOGNITION:A REVIEW 13 Setosa Versicolor Virginica Fig.3.Chemnoff Faces corresponding to the mean vectors of Iris Setosa,Iris Versicolor,and Iris Virginica. scatter matrix,S)[58.Another supervised criterion for criterion is the stress function introduced by Sammon non-Gaussian class-conditional densities is based on the [141]and Niemann [114].A problem with MDS is that it Patrick-Fisher distance using Parzen density estimates [41].does not give an explicit mapping function,so it is not There are several ways to define nonlinear feature possible to place a new pattern in a map which has been extraction techniques.One such method which is directly computed for a given training set without repeating the related to PCA is called the Kernel PCA [73],[145].The mapping.Several techniques have been investigated to basic idea of kernel PCA is to first map input data into some address this deficiency which range from linear interpola- new feature space F typically via a nonlinear function tion to training a neural network [38].It is also possible to (e.g.,polynomial of degree p)and then perform a linear redefine the MDS algorithm so that it directly produces a PCA in the mapped space.However,the F space often has map that may be used for new test patterns [165]. a very high dimension.To avoid computing the mapping A feed-forward neural network offers an integrated explicitly,kernel PCA employs only Mercer kernels which procedure for feature extraction and classification;the can be decomposed into a dot product, output of each hidden layer may be interpreted as a set of new,often nonlinear,features presented to the output layer K(x,)=(c)·(y), for classification.In this sense,multilayer networks serve as As a result,the kernel space has a well-defined metric. feature extractors [100].For example,the networks used by Examples of Mercer kernels include pth-order polynomial Fukushima [62]et al.and Le Cun et al.[95]have the so (x-y)P and Gaussian kernel called shared weight layers that are in fact filters for extracting features in two-dimensional images.During e training,the filters are tuned to the data,so as to maximize the classification performance. Let X be the normalized n x d pattern matrix with zero Neural networks can also be used directly for feature mean,and (X)be the pattern matrix in the F space. extraction in an unsupervised mode.Fig.4a shows the The linear PCA in the F space solves the eigenvectors of the correlation matrixΦ(X))Φ(X)T,which is”also called the architecture of a network which is able to find the PCA subspace [1171.Instead of sigmoids,the neurons have linear kernel matrix K(X,X).In kernel PCA,the first m transfer functions.This network has d inputs and d outputs, eigenvectors of K(X,X)are obtained to define a transfor- where d is the given number of features.The inputs are also mation matrix,E.(E has size n x m,where m represents the used as targets,forcing the output layer to reconstruct the desired number of features,m <d).New patterns x are input space using only one hidden layer.The three nodes in mapped by K(z,X)E,which are now represented relative the hidden layer capture the first three principal compo- to the training set and not by their measured feature values. nents [18].If two nonlinear layers with sigmoidal hidden Note that for a complete representation,up to m eigenvec- units are also included (see Fig.4b),then a nonlinear tors in E may be needed(depending on the kernel function) subspace is found in the middle layer (also called the by kernel PCA,while in linear PCA a set of d eigenvectors bottleneck layer).The nonlinearity is limited by the size of represents the original feature space.How the kernel these additional layers.These so-called autoassociative,or function should be chosen for a given application is still nonlinear PCA networks offer a powerful tool to train and an open issue. describe nonlinear subspaces [98].Oja [118]shows how Multidimensional scaling (MDS)is another nonlinear autoassociative networks can be used for ICA. feature extraction technique.It aims to represent a multi- The Self-Organizing Map (SOM),or Kohonen Map [92], dimensional dataset in two or three dimensions such that can also be used for nonlinear feature extraction.In SOM, the distance matrix in the original d-dimensional feature neurons are arranged in an m-dimensional grid,where m is space is preserved as faithfully as possible in the projected usually 1,2,or 3.Each neuron is connected to all the d input space.Various stress functions are used for measuring the units.The weights on the connections for each neuron form performance of this mapping [20];the most popular a d-dimensional weight vector.During training,patterns are
scatter matrix, Sb) [58]. Another supervised criterion for non-Gaussian class-conditional densities is based on the Patrick-Fisher distance using Parzen density estimates [41]. There are several ways to define nonlinear feature extraction techniques. One such method which is directly related to PCA is called the Kernel PCA [73], [145]. The basic idea of kernel PCA is to first map input data into some new feature space F typically via a nonlinear function (e.g., polynomial of degree p) and then perform a linear PCA in the mapped space. However, the F space often has a very high dimension. To avoid computing the mapping explicitly, kernel PCA employs only Mercer kernels which can be decomposed into a dot product, K x; y x y: As a result, the kernel space has a well-defined metric. Examples of Mercer kernels include pth-order polynomial x ÿ y p and Gaussian kernel eÿkxÿyk2 c : Let X be the normalized n d pattern matrix with zero mean, and X be the pattern matrix in the F space. The linear PCA in the F space solves the eigenvectors of the correlation matrix X X T , which is also called the kernel matrix K X; X. In kernel PCA, the first m eigenvectors of K X;X are obtained to define a transformation matrix, E. (E has size n m, where m represents the desired number of features, m d). New patterns x are mapped by K x; XE, which are now represented relative to the training set and not by their measured feature values. Note that for a complete representation, up to m eigenvectors in E may be needed (depending on the kernel function) by kernel PCA, while in linear PCA a set of d eigenvectors represents the original feature space. How the kernel function should be chosen for a given application is still an open issue. Multidimensional scaling (MDS) is another nonlinear feature extraction technique. It aims to represent a multidimensional dataset in two or three dimensions such that the distance matrix in the original d-dimensional feature space is preserved as faithfully as possible in the projected space. Various stress functions are used for measuring the performance of this mapping [20]; the most popular criterion is the stress function introduced by Sammon [141] and Niemann [114]. A problem with MDS is that it does not give an explicit mapping function, so it is not possible to place a new pattern in a map which has been computed for a given training set without repeating the mapping. Several techniques have been investigated to address this deficiency which range from linear interpolation to training a neural network [38]. It is also possible to redefine the MDS algorithm so that it directly produces a map that may be used for new test patterns [165]. A feed-forward neural network offers an integrated procedure for feature extraction and classification; the output of each hidden layer may be interpreted as a set of new, often nonlinear, features presented to the output layer for classification. In this sense, multilayer networks serve as feature extractors [100]. For example, the networks used by Fukushima [62] et al. and Le Cun et al. [95] have the so called shared weight layers that are in fact filters for extracting features in two-dimensional images. During training, the filters are tuned to the data, so as to maximize the classification performance. Neural networks can also be used directly for feature extraction in an unsupervised mode. Fig. 4a shows the architecture of a network which is able to find the PCA subspace [117]. Instead of sigmoids, the neurons have linear transfer functions. This network has d inputs and d outputs, where d is the given number of features. The inputs are also used as targets, forcing the output layer to reconstruct the input space using only one hidden layer. The three nodes in the hidden layer capture the first three principal components [18]. If two nonlinear layers with sigmoidal hidden units are also included (see Fig. 4b), then a nonlinear subspace is found in the middle layer (also called the bottleneck layer). The nonlinearity is limited by the size of these additional layers. These so-called autoassociative, or nonlinear PCA networks offer a powerful tool to train and describe nonlinear subspaces [98]. Oja [118] shows how autoassociative networks can be used for ICA. The Self-Organizing Map (SOM), or Kohonen Map [92], can also be used for nonlinear feature extraction. In SOM, neurons are arranged in an m-dimensional grid, where m is usually 1, 2, or 3. Each neuron is connected to all the d input units. The weights on the connections for each neuron form a d-dimensional weight vector. During training, patterns are JAIN ET AL.: STATISTICAL PATTERN RECOGNITION: A REVIEW 13 Fig. 3. Chernoff Faces corresponding to the mean vectors of Iris Setosa, Iris Versicolor, and Iris Virginica.