14 EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL 22,NO.1,JANUARY 2000 X,x2x3…Xd X2x3…xd (a) (b) Fig.4.Autoassociative networks for finding a three-dimensional subspace.(a)Linear and (b)nonlinear(not all the connections are shown). presented to the network in a random order.At each different sensor modalities,are concatenated to form a presentation,the winner whose weight vector is the closest feature vector with a large number of components; to the input vector is first identified.Then,all the neurons in 2)integration of multiple data models:sensor data can be the neighborhood (defined on the grid)of the winner are modeled using different approaches,where the model updated such that their weight vectors move towards the parameters serve as features,and the parameters from input vector.Consequently,after training is done,the different models can be pooled to yield a high-dimensional weight vectors of neighboring neurons in the grid are likely feature vector. to represent input patterns which are close in the original Let Y be the given set of features,with cardinality d and feature space.Thus,a "topology-preserving"map is let m represent the desired number of features in the formed.When the grid is plotted in the original space,the selected subset X,XCY.Let the feature selection criterion grid connections are more or less stressed according to the function for the set X be represented by J(X).Let us density of the training data.Thus,SOM offers an assume that a higher value of indicates a better feature m-dimensional map with a spatial connectivity,which can subset;a natural choice for the criterion function is be interpreted as feature extraction.SOM is different from J=(1-Pe),where Pe denotes the classification error.The learning vector quantization (LVQ)because no neighbor- use of Pe in the criterion function makes feature selection hood is defined in LVQ. procedures dependent on the specific classifier that is used Table 4 summarizes the feature extraction and projection and the sizes of the training and test sets.The most methods discussed above.Note that the adjective nonlinear straightforward approach to the feature selection problem may be used both for the mapping (being a nonlinear would require 1)examining all (d)possible subsets of size function of the original features)as well as for the criterion m,and 2)selecting the subset with the largest value of /() function(for non-Gaussian data).Fig.5 shows an example However,the number of possible subsets grows combina- of four different two-dimensional projections of the four- torially,making this exhaustive search impractical for even dimensional Iris dataset.Fig.5a and Fig.5b show two linear moderate values of m and d.Cover and Van Campenhout mappings,while Fig.5c and Fig.5d depict two nonlinear [35]showed that no nonexhaustive sequential feature mappings.Only the Fisher mapping(Fig.5b)makes use of selection procedure can be guaranteed to produce the the category information,this being the main reason why optimal subset.They further showed that any ordering of this mapping exhibits the best separation between the three the classification errors of each of the 2d feature subsets is categories. possible.Therefore,in order to guarantee the optimality of, say,a 12-dimensional feature subset out of 24 available 4.2 Feature Selection features,approximately 2.7 million possible subsets must be The problem of feature selection is defined as follows:given evaluated.The only "optimal"(in terms of a class of a set of d features,select a subset of size m that leads to the monotonic criterion functions)feature selection method smallest classification error.There has been a resurgence of which avoids the exhaustive search is based on the branch interest in applying feature selection methods due to the and bound algorithm.This procedure avoids an exhaustive large number of features encountered in the following search by using intermediate results for obtaining bounds situations:1)multisensor fusion:features,computed from on the final criterion value.The key to this algorithm is the
presented to the network in a random order. At each presentation, the winner whose weight vector is the closest to the input vector is first identified. Then, all the neurons in the neighborhood (defined on the grid) of the winner are updated such that their weight vectors move towards the input vector. Consequently, after training is done, the weight vectors of neighboring neurons in the grid are likely to represent input patterns which are close in the original feature space. Thus, a ªtopology-preservingº map is formed. When the grid is plotted in the original space, the grid connections are more or less stressed according to the density of the training data. Thus, SOM offers an m-dimensional map with a spatial connectivity, which can be interpreted as feature extraction. SOM is different from learning vector quantization (LVQ) because no neighborhood is defined in LVQ. Table 4 summarizes the feature extraction and projection methods discussed above. Note that the adjective nonlinear may be used both for the mapping (being a nonlinear function of the original features) as well as for the criterion function (for non-Gaussian data). Fig. 5 shows an example of four different two-dimensional projections of the fourdimensional Iris dataset. Fig. 5a and Fig. 5b show two linear mappings, while Fig. 5c and Fig. 5d depict two nonlinear mappings. Only the Fisher mapping (Fig. 5b) makes use of the category information, this being the main reason why this mapping exhibits the best separation between the three categories. 4.2 Feature Selection The problem of feature selection is defined as follows: given a set of d features, select a subset of size m that leads to the smallest classification error. There has been a resurgence of interest in applying feature selection methods due to the large number of features encountered in the following situations: 1) multisensor fusion: features, computed from different sensor modalities, are concatenated to form a feature vector with a large number of components; 2) integration of multiple data models: sensor data can be modeled using different approaches, where the model parameters serve as features, and the parameters from different models can be pooled to yield a high-dimensional feature vector. Let Y be the given set of features, with cardinality d and let m represent the desired number of features in the selected subset X; X Y . Let the feature selection criterion function for the set X be represented by J X. Let us assume that a higher value of J indicates a better feature subset; a natural choice for the criterion function is J 1 ÿ Pe, where Pe denotes the classification error. The use of Pe in the criterion function makes feature selection procedures dependent on the specific classifier that is used and the sizes of the training and test sets. The most straightforward approach to the feature selection problem would require 1) examining all d m ÿ possible subsets of size m, and 2) selecting the subset with the largest value of J . However, the number of possible subsets grows combinatorially, making this exhaustive search impractical for even moderate values of m and d. Cover and Van Campenhout [35] showed that no nonexhaustive sequential feature selection procedure can be guaranteed to produce the optimal subset. They further showed that any ordering of the classification errors of each of the 2d feature subsets is possible. Therefore, in order to guarantee the optimality of, say, a 12-dimensional feature subset out of 24 available features, approximately 2.7 million possible subsets must be evaluated. The only ªoptimalº (in terms of a class of monotonic criterion functions) feature selection method which avoids the exhaustive search is based on the branch and bound algorithm. This procedure avoids an exhaustive search by using intermediate results for obtaining bounds on the final criterion value. The key to this algorithm is the 14 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 1, JANUARY 2000 Fig. 4. Autoassociative networks for finding a three-dimensional subspace. (a) Linear and (b) nonlinear (not all the connections are shown)
JAIN ET AL.:STATISTICAL PATTERN RECOGNITION:A REVIEW 15 十++ (a) (b) 4+ 60 0 + 中44 车+ ++ + ,*4 + + (c) (d) Fig.5.Two-dimensional mappings of the Iris dataset (+Iris Setosa;:Iris Versicolor;o:Iris Virginica).(a)PCA,(b)Fisher Mapping,(c)Sammon Mapping,and(d)Kernel PCA with second order polynomial kemel. monotonicity property of the criterion functionJ();given the fact that the best pair of features need not contain the two features subsets X1 and X2,if X1C X2,then best single feature [34].In general:good,larger feature sets J(X1)<J(X2).In other words,the performance of a do not necessarily include the good,small sets.As a result, feature subset should improve whenever a feature is added the simple method of selecting just the best individual to it.Most commonly used criterion functions do not satisfy features may fail dramatically.It might still be useful, this monotonicity property. however,as a first step to select some individually good It has been argued that since feature selection is typically features in decreasing very large feature sets (e.g.,hundreds done in an off-line manner,the execution time of a of features).Further selection has to be done by more particular algorithm is not as critical as the optimality of advanced methods that take feature dependencies into the feature subset it generates.While this is true for feature account.These operate either by evaluating growing feature sets of moderate size,several recent applications,particu- sets (forward selection)or by evaluating shrinking feature larly those in data mining and document classification, sets (backward selection).A simple sequential method like involve thousands of features.In such cases,the computa- SFS (SBS)adds (deletes)one feature at a time.More tional requirement of a feature selection algorithm is sophisticated techniques are the "Plus 1-take away r" extremely important.As the number of feature subset strategy and the Sequential Floating Search methods,SFFS evaluations may easily become prohibitive for large feature and SBFS [126].These methods backtrack as long as they sizes,a number of suboptimal selection techniques have find improvements compared to previous feature sets of the been proposed which essentially tradeoff the optimality of same size.In almost any large feature selection problem, the selected subset for computational efficiency. these methods perform better than the straight sequential Table 5 lists most of the well-known feature selection searches,SFS and SBS.SFFS and SBFS methods find methods which have been proposed in the literature [85]. "nested"sets of features that remain hidden otherwise, Only the first two methods in this table guarantee an but the number of feature set evaluations,however,may optimal subset.All other strategies are suboptimal due to easily increase by a factor of 2 to 10
monotonicity property of the criterion function J ; given two features subsets X1 and X2, if X1 X2, then J X1 < J X2. In other words, the performance of a feature subset should improve whenever a feature is added to it. Most commonly used criterion functions do not satisfy this monotonicity property. It has been argued that since feature selection is typically done in an off-line manner, the execution time of a particular algorithm is not as critical as the optimality of the feature subset it generates. While this is true for feature sets of moderate size, several recent applications, particularly those in data mining and document classification, involve thousands of features. In such cases, the computational requirement of a feature selection algorithm is extremely important. As the number of feature subset evaluations may easily become prohibitive for large feature sizes, a number of suboptimal selection techniques have been proposed which essentially tradeoff the optimality of the selected subset for computational efficiency. Table 5 lists most of the well-known feature selection methods which have been proposed in the literature [85]. Only the first two methods in this table guarantee an optimal subset. All other strategies are suboptimal due to the fact that the best pair of features need not contain the best single feature [34]. In general: good, larger feature sets do not necessarily include the good, small sets. As a result, the simple method of selecting just the best individual features may fail dramatically. It might still be useful, however, as a first step to select some individually good features in decreasing very large feature sets (e.g., hundreds of features). Further selection has to be done by more advanced methods that take feature dependencies into account. These operate either by evaluating growing feature sets (forward selection) or by evaluating shrinking feature sets (backward selection). A simple sequential method like SFS (SBS) adds (deletes) one feature at a time. More sophisticated techniques are the ªPlus l - take away rº strategy and the Sequential Floating Search methods, SFFS and SBFS [126]. These methods backtrack as long as they find improvements compared to previous feature sets of the same size. In almost any large feature selection problem, these methods perform better than the straight sequential searches, SFS and SBS. SFFS and SBFS methods find ªnestedº sets of features that remain hidden otherwise, but the number of feature set evaluations, however, may easily increase by a factor of 2 to 10. JAIN ET AL.: STATISTICAL PATTERN RECOGNITION: A REVIEW 15 Fig. 5. Two-dimensional mappings of the Iris dataset (+: Iris Setosa; *: Iris Versicolor; o: Iris Virginica). (a) PCA, (b) Fisher Mapping, (c) Sammon Mapping, and (d) Kernel PCA with second order polynomial kernel