4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.22,NO.1,JANUARY 2000 Statistical Pattern Recognition:A Review Anil K.Jain,Fellow,IEEE,Robert P.W.Duin,and Jianchang Mao,Senior Member,IEEE Abstract-The primary goal of pattern recognition is supervised or unsupervised classification.Among the various frameworks in which pattern recognition has been traditionally formulated,the statistical approach has been most intensively studied and used in practice.More recently,neural network techniques and methods imported from statistical leaming theory have been receiving increasing attention.The design of a recognition system requires careful attention to the following issues:definition of pattern classes, sensing environment,pattern representation,feature extraction and selection,cluster analysis,classifier design and leaming,selection of training and test samples,and performance evaluation.In spite of almost 50 years of research and development in this field,the general problem of recognizing complex patterns with arbitrary orientation,location,and scale remains unsolved.New and emerging applications,such as data mining,web searching,retrieval of multimedia data,face recognition,and cursive handwriting recognition, require robust and efficient pattem recognition techniques.The objective of this review paper is to summarize and compare some of the well-known methods used in various stages of a pattern recognition system and identify research topics and applications which are at the forefront of this exciting and challenging field Index Terms-Statistical pattern recognition,classification,clustering,feature extraction,feature selection,error estimation,classifier combination,neural networks. ◆ 1 INTRODUCTION DY the time they are five years old,most children can depend in some way on pattern recognition...We need to Drecognize digits and letters.Small characters,large pay much more explicit attention to teaching pattern characters,handwritten,machine printed,or rotated-all recognition."Our goal here is to introduce pattern recogni- are easily recognized by the young.The characters may be tion as the best possible way of utilizing available sensors, written on a cluttered background,on crumpled paper or processors,and domain knowledge to make decisions may even be partially occluded.We take this ability for automatically. granted until we face the task of teaching a machine how to do the same.Pattern recognition is the study of how 1.1 What is Pattern Recognition? machines can observe the environment,learn to distinguish Automatic (machine)recognition,description,classifica- patterns of interest from their background,and make sound tion,and grouping of patterns are important problems in a and reasonable decisions about the categories of the variety of engineering and scientific disciplines such as patterns.In spite of almost 50 years of research,design of biology,psychology,medicine,marketing,computer vision, a general purpose machine pattern recognizer remains an artificial intelligence,and remote sensing.But what is a elusive goal. pattern?Watanabe [163]defines a pattern "as opposite of a The best pattern recognizers in most instances are chaos;it is an entity,vaguely defined,that could be given a humans,yet we do not understand how humans recognize name."For example,a pattern could be a fingerprint image, patterns.Ross [140]emphasizes the work of Nobel Laureate a handwritten cursive word,a human face,or a speech Herbert Simon whose central finding was that pattern signal.Given a pattern,its recognition/classification may recognition is critical in most human decision making tasks: consist of one of the following two tasks [163]:1)supervised "The more relevant patterns at your disposal,the better classification(e.g.,discriminant analysis)in which the input your decisions will be.This is hopeful news to proponents pattern is identified as a member of a predefined class, of artificial intelligence,since computers can surely be 2)unsupervised classification (e.g.,clustering)in which the taught to recognize patterns.Indeed,successful computer pattern is assigned to a hitherto unknown class.Note that programs that help banks score credit applicants,help the recognition problem here is being posed as a classifica- doctors diagnose disease and help pilots land airplanes tion or categorization task,where the classes are either defined by the system designer (in supervised classifica- tion)or are learned based on the similarity of patterns(in .A.K.Jain is with the Department of Computer Science and Engineering, unsupervised classification). Michigan State University,East Lansing,MI 48824. Interest in the area of pattern recognition has been E-mail:jain@cse.msu.edu. .R.P.W.Duin is with the Department of Applied Physics,Delft University renewed recently due to emerging applications which are of Techmnology,2600 GA Delft,the Netherlands. not only challenging but also computationally more E-mail:duin@ph.tn.tudelft.nl. demanding(see Table 1).These applications include data .J.Mao is with the IBM Almaden Research Center,650 Harry Road,San mining (identifying a "pattern,"e.g.,correlation,or an Jose,CA 95120.E-mail:mao@almaden.ibm.com. outlier in millions of multidimensional patterns),document Manuscript received 23 July 1999;accepted 12 Oct.1999. classification (efficiently searching text documents),finan- Recommended for acceptance by K.Bowyer. For information on obtaining reprints of this article,please send e-mail to: cial forecasting,organization and retrieval of multimedia fpami@computer.org,and reference IEEECS Log Number 110296. databases,and biometrics (personal identification based on 0162-8828/00s10.00020001EEE
Statistical Pattern Recognition: A Review Anil K. Jain, Fellow, IEEE, Robert P.W. Duin, and Jianchang Mao, Senior Member, IEEE AbstractÐThe primary goal of pattern recognition is supervised or unsupervised classification. Among the various frameworks in which pattern recognition has been traditionally formulated, the statistical approach has been most intensively studied and used in practice. More recently, neural network techniques and methods imported from statistical learning theory have been receiving increasing attention. The design of a recognition system requires careful attention to the following issues: definition of pattern classes, sensing environment, pattern representation, feature extraction and selection, cluster analysis, classifier design and learning, selection of training and test samples, and performance evaluation. In spite of almost 50 years of research and development in this field, the general problem of recognizing complex patterns with arbitrary orientation, location, and scale remains unsolved. New and emerging applications, such as data mining, web searching, retrieval of multimedia data, face recognition, and cursive handwriting recognition, require robust and efficient pattern recognition techniques. The objective of this review paper is to summarize and compare some of the well-known methods used in various stages of a pattern recognition system and identify research topics and applications which are at the forefront of this exciting and challenging field. Index TermsÐStatistical pattern recognition, classification, clustering, feature extraction, feature selection, error estimation, classifier combination, neural networks. æ 1 INTRODUCTION BY the time they are five years old, most children can recognize digits and letters. Small characters, large characters, handwritten, machine printed, or rotatedÐall are easily recognized by the young. The characters may be written on a cluttered background, on crumpled paper or may even be partially occluded. We take this ability for granted until we face the task of teaching a machine how to do the same. Pattern recognition is the study of how machines can observe the environment, learn to distinguish patterns of interest from their background, and make sound and reasonable decisions about the categories of the patterns. In spite of almost 50 years of research, design of a general purpose machine pattern recognizer remains an elusive goal. The best pattern recognizers in most instances are humans, yet we do not understand how humans recognize patterns. Ross [140] emphasizes the work of Nobel Laureate Herbert Simon whose central finding was that pattern recognition is critical in most human decision making tasks: ªThe more relevant patterns at your disposal, the better your decisions will be. This is hopeful news to proponents of artificial intelligence, since computers can surely be taught to recognize patterns. Indeed, successful computer programs that help banks score credit applicants, help doctors diagnose disease and help pilots land airplanes depend in some way on pattern recognition... We need to pay much more explicit attention to teaching pattern recognition.º Our goal here is to introduce pattern recognition as the best possible way of utilizing available sensors, processors, and domain knowledge to make decisions automatically. 1.1 What is Pattern Recognition? Automatic (machine) recognition, description, classification, and grouping of patterns are important problems in a variety of engineering and scientific disciplines such as biology, psychology, medicine, marketing, computer vision, artificial intelligence, and remote sensing. But what is a pattern? Watanabe [163] defines a pattern ªas opposite of a chaos; it is an entity, vaguely defined, that could be given a name.º For example, a pattern could be a fingerprint image, a handwritten cursive word, a human face, or a speech signal. Given a pattern, its recognition/classification may consist of one of the following two tasks [163]: 1) supervised classification (e.g., discriminant analysis) in which the input pattern is identified as a member of a predefined class, 2) unsupervised classification (e.g., clustering) in which the pattern is assigned to a hitherto unknown class. Note that the recognition problem here is being posed as a classification or categorization task, where the classes are either defined by the system designer (in supervised classification) or are learned based on the similarity of patterns (in unsupervised classification). Interest in the area of pattern recognition has been renewed recently due to emerging applications which are not only challenging but also computationally more demanding (see Table 1). These applications include data mining (identifying a ªpattern,º e.g., correlation, or an outlier in millions of multidimensional patterns), document classification (efficiently searching text documents), financial forecasting, organization and retrieval of multimedia databases, and biometrics (personal identification based on 4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 1, JANUARY 2000 . A.K. Jain is with the Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824. E-mail: jain@cse.msu.edu. . R.P.W. Duin is with the Department of Applied Physics, Delft University of Technology, 2600 GA Delft, the Netherlands. E-mail: duin@ph.tn.tudelft.nl. . J. Mao is with the IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120. E-mail: mao@almaden.ibm.com. Manuscript received 23 July 1999; accepted 12 Oct. 1999. Recommended for acceptance by K. Bowyer. For information on obtaining reprints of this article, please send e-mail to: tpami@computer.org, and reference IEEECS Log Number 110296. 0162-8828/00/$10.00 ß 2000 IEEE
JAIN ET AL.:STATISTICAL PATTERN RECOGNITION:A REVIEW TABLE 1 Examples of Pattern Recognition Applications Problem Domain Application Input Pattern Pattern Classes Bioinformatics Sequence analysis DNA/Protein sequence Known types of genes/ patterns Data mining Searching for Points in multi- Compact and well- meaningful patterns dimensional space separated clusters Document Internet search Text document Semantic categories classification (e.g.,business,sports, etc.) Document image Reading machine for Document image Alphanumeric analysis the blind characters,words Industrial automation Printed circuit board Intensity or range Defective /non-defective inspection image nature of product Multimedia database Internet search Video clip Video genres (e.g., retrieval action,dialogue,etc.) Biometric recognition Personal identification Face,iris, Authorized users for fingerprint access control Remote sensing Forecasting crop yield Multispectral image Land use categories, growth pattern of crops Speech recognition Telephone directory Speech waveform Spoken words enquiry without operator assistance various physical attributes such as face and fingerprints). well-defined and sufficiently constrained recognition pro- Picard [125]has identified a novel application of pattern blem (small intraclass variations and large interclass recognition,called affective computing which will give a variations)will lead to a compact pattern representation computer the ability to recognize and express emotions,to and a simple decision making strategy.Learning from a set respond intelligently to human emotion,and to employ of examples (training set)is an important and desired mechanisms of emotion that contribute to rational decision attribute of most pattern recognition systems.The four best making.A common characteristic of a number of these known approaches for pattern recognition are:1)template applications is that the available features(typically,in the matching,2)statistical classification,3)syntactic or struc- thousands)are not usually suggested by domain experts, tural matching,and 4)neural networks.These models are but must be extracted and optimized by data-driven not necessarily independent and sometimes the same procedures. pattern recognition method exists with different interpreta- The rapidly growing and available computing power, tions.Attempts have been made to design hybrid systems while enabling faster processing of huge data sets,has also involving multiple models [57].A brief description and facilitated the use of elaborate and diverse methods for data comparison of these approaches is given below and analysis and classification.At the same time,demands on summarized in Table 2. automatic pattern recognition systems are rising enor- 1.2 Template Matching mously due to the availability of large databases and One of the simplest and earliest approaches to pattern stringent performance requirements(speed,accuracy,and recognition is based on template matching.Matching is a cost).In many of the emerging applications,it is clear that no single approach for classification is "optimal"and that generic operation in pattern recognition which is used to determine the similarity between two entities (points, multiple methods and approaches have to be used. curves,or shapes)of the same type.In template matching, Consequently,combining several sensing modalities and a template (typically,a 2D shape)or a prototype of the classifiers is now a commonly used practice in pattern pattern to be recognized is available.The pattern to be recognition. recognized is matched against the stored template while The design of a pattern recognition system essentially taking into account all allowable pose (translation and involves the following three aspects:1)data acquisition and rotation)and scale changes.The similarity measure,often a preprocessing,2)data representation,and 3)decision correlation,may be optimized based on the available making.The problem domain dictates the choice of training set.Often,the template itself is learned from the sensor(s),preprocessing technique,representation scheme, training set.Template matching is computationally de- and the decision making model.It is generally agreed that a manding,but the availability of faster processors has now
various physical attributes such as face and fingerprints). Picard [125] has identified a novel application of pattern recognition, called affective computing which will give a computer the ability to recognize and express emotions, to respond intelligently to human emotion, and to employ mechanisms of emotion that contribute to rational decision making. A common characteristic of a number of these applications is that the available features (typically, in the thousands) are not usually suggested by domain experts, but must be extracted and optimized by data-driven procedures. The rapidly growing and available computing power, while enabling faster processing of huge data sets, has also facilitated the use of elaborate and diverse methods for data analysis and classification. At the same time, demands on automatic pattern recognition systems are rising enormously due to the availability of large databases and stringent performance requirements (speed, accuracy, and cost). In many of the emerging applications, it is clear that no single approach for classification is ªoptimalº and that multiple methods and approaches have to be used. Consequently, combining several sensing modalities and classifiers is now a commonly used practice in pattern recognition. The design of a pattern recognition system essentially involves the following three aspects: 1) data acquisition and preprocessing, 2) data representation, and 3) decision making. The problem domain dictates the choice of sensor(s), preprocessing technique, representation scheme, and the decision making model. It is generally agreed that a well-defined and sufficiently constrained recognition problem (small intraclass variations and large interclass variations) will lead to a compact pattern representation and a simple decision making strategy. Learning from a set of examples (training set) is an important and desired attribute of most pattern recognition systems. The four best known approaches for pattern recognition are: 1) template matching, 2) statistical classification, 3) syntactic or structural matching, and 4) neural networks. These models are not necessarily independent and sometimes the same pattern recognition method exists with different interpretations. Attempts have been made to design hybrid systems involving multiple models [57]. A brief description and comparison of these approaches is given below and summarized in Table 2. 1.2 Template Matching One of the simplest and earliest approaches to pattern recognition is based on template matching. Matching is a generic operation in pattern recognition which is used to determine the similarity between two entities (points, curves, or shapes) of the same type. In template matching, a template (typically, a 2D shape) or a prototype of the pattern to be recognized is available. The pattern to be recognized is matched against the stored template while taking into account all allowable pose (translation and rotation) and scale changes. The similarity measure, often a correlation, may be optimized based on the available training set. Often, the template itself is learned from the training set. Template matching is computationally demanding, but the availability of faster processors has now JAIN ET AL.: STATISTICAL PATTERN RECOGNITION: A REVIEW 5 TABLE 1 Examples of Pattern Recognition Applications
6 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL 22,NO.1,JANUARY 2000 TABLE 2 Pattern Recognition Models Approach Representation Recognition Function Typical Criterion Template matching Samples,pixels,curves Correlation,distance measure Classification error Statistical Features Discriminant function Classification error Syntactic or structural Primitives Rules,grammar Acceptance error Neural networks Samples,pixels,features Network function Mean square error made this approach more feasible.The rigid template 1.4 Syntactic Approach matching mentioned above,while effective in some In many recognition problems involving complex patterns, application domains,has a number of disadvantages.For it is more appropriate to adopt a hierarchical perspective instance,it would fail if the patterns are distorted due to the where a pattern is viewed as being composed of simple imaging process,viewpoint change,or large intraclass subpatterns which are themselves built from yet simpler variations among the patterns.Deformable template models subpatterns [56],[121].The simplest/elementary subpat- [69]or rubber sheet deformations [9]can be used to match terns to be recognized are called primitives and the given patterns when the deformation cannot be easily explained complex pattern is represented in terms of the interrelation- or modeled directly. ships between these primitives.In syntactic pattern recog- nition,a formal analogy is drawn between the structure of 1.3 Statistical Approach patterns and the syntax of a language.The patterns are In the statistical approach,each pattern is represented in viewed as sentences belonging to a language,primitives are terms of d features or measurements and is viewed as a viewed as the alphabet of the language,and the sentences are generated according to a grammar.Thus,a large point in a d-dimensional space.The goal is to choose those collection of complex patterns can be described by a small features that allow pattern vectors belonging to different number of primitives and grammatical rules.The grammar categories to occupy compact and disjoint regions in a for each pattern class must be inferred from the available d-dimensional feature space.The effectiveness of the training samples. representation space (feature set)is determined by how Structural pattern recognition is intuitively appealing well patterns from different classes can be separated.Given because,in addition to classification,this approach also a set of training patterns from each class,the objective is to provides a description of how the given pattern is establish decision boundaries in the feature space which constructed from the primitives.This paradigm has been separate patterns belonging to different classes.In the used in situations where the patterns have a definite statistical decision theoretic approach,the decision bound- structure which can be captured in terms of a set of rules, aries are determined by the probability distributions of the such as EKG waveforms,textured images,and shape patterns belonging to each class,which must either be analysis of contours [56].The implementation of a syntactic specified or learned [41],[44]. approach,however,leads to many difficulties which primarily have to do with the segmentation of noisy One can also take a discriminant analysis-based ap- patterns (to detect the primitives)and the inference of the proach to classification:First a parametric form of the grammar from training data.Fu [56]introduced the notion decision boundary (e.g,linear or quadratic)is specified; of attributed grammars which unifies syntactic and statis- then the "best"decision boundary of the specified form is tical pattern recognition.The syntactic approach may yield found based on the classification of training patterns.Such a combinatorial explosion of possibilities to be investigated, boundaries can be constructed using,for example,a mean demanding large training sets and very large computational squared error criterion.The direct boundary construction efforts [122]. approaches are supported by Vapnik's philosophy [162]:"If 1.5 Neural Networks you possess a restricted amount of information for solving some problem,try to solve the problem directly and never Neural networks can be viewed as massively parallel solve a more general problem as an intermediate step.It is computing systems consisting of an extremely large number of simple processors with many interconnections. possible that the available information is sufficient for a Neural network models attempt to use some organiza- direct solution but is insufficient for solving a more general tional principles (such as learning,generalization,adap- intermediate problem." tivity,fault tolerance and distributed representation,and
made this approach more feasible. The rigid template matching mentioned above, while effective in some application domains, has a number of disadvantages. For instance, it would fail if the patterns are distorted due to the imaging process, viewpoint change, or large intraclass variations among the patterns. Deformable template models [69] or rubber sheet deformations [9] can be used to match patterns when the deformation cannot be easily explained or modeled directly. 1.3 Statistical Approach In the statistical approach, each pattern is represented in terms of d features or measurements and is viewed as a point in a d-dimensional space. The goal is to choose those features that allow pattern vectors belonging to different categories to occupy compact and disjoint regions in a d-dimensional feature space. The effectiveness of the representation space (feature set) is determined by how well patterns from different classes can be separated. Given a set of training patterns from each class, the objective is to establish decision boundaries in the feature space which separate patterns belonging to different classes. In the statistical decision theoretic approach, the decision boundaries are determined by the probability distributions of the patterns belonging to each class, which must either be specified or learned [41], [44]. One can also take a discriminant analysis-based approach to classification: First a parametric form of the decision boundary (e.g., linear or quadratic) is specified; then the ªbestº decision boundary of the specified form is found based on the classification of training patterns. Such boundaries can be constructed using, for example, a mean squared error criterion. The direct boundary construction approaches are supported by Vapnik's philosophy [162]: ªIf you possess a restricted amount of information for solving some problem, try to solve the problem directly and never solve a more general problem as an intermediate step. It is possible that the available information is sufficient for a direct solution but is insufficient for solving a more general intermediate problem.º 1.4 Syntactic Approach In many recognition problems involving complex patterns, it is more appropriate to adopt a hierarchical perspective where a pattern is viewed as being composed of simple subpatterns which are themselves built from yet simpler subpatterns [56], [121]. The simplest/elementary subpatterns to be recognized are called primitives and the given complex pattern is represented in terms of the interrelationships between these primitives. In syntactic pattern recognition, a formal analogy is drawn between the structure of patterns and the syntax of a language. The patterns are viewed as sentences belonging to a language, primitives are viewed as the alphabet of the language, and the sentences are generated according to a grammar. Thus, a large collection of complex patterns can be described by a small number of primitives and grammatical rules. The grammar for each pattern class must be inferred from the available training samples. Structural pattern recognition is intuitively appealing because, in addition to classification, this approach also provides a description of how the given pattern is constructed from the primitives. This paradigm has been used in situations where the patterns have a definite structure which can be captured in terms of a set of rules, such as EKG waveforms, textured images, and shape analysis of contours [56]. The implementation of a syntactic approach, however, leads to many difficulties which primarily have to do with the segmentation of noisy patterns (to detect the primitives) and the inference of the grammar from training data. Fu [56] introduced the notion of attributed grammars which unifies syntactic and statistical pattern recognition. The syntactic approach may yield a combinatorial explosion of possibilities to be investigated, demanding large training sets and very large computational efforts [122]. 1.5 Neural Networks Neural networks can be viewed as massively parallel computing systems consisting of an extremely large number of simple processors with many interconnections. Neural network models attempt to use some organizational principles (such as learning, generalization, adaptivity, fault tolerance and distributed representation, and 6 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 1, JANUARY 2000 TABLE 2 Pattern Recognition Models
JAIN ET AL.:STATISTICAL PATTERN RECOGNITION:A REVIEW 7 computation)in a network of weighted directed graphs and demonstrated to be useful in practical applications, in which the nodes are artificial neurons and directed along with the new trends and ideas edges (with weights)are connections between neuron The literature on pattern recognition is vast and outputs and neuron inputs.The main characteristics of scattered in numerous journals in several disciplines neural networks are that they have the ability to learn (e.g,applied statistics,machine learning,neural net- complex nonlinear input-output relationships,use se- works,and signal and image processing).A quick scan of quential training procedures,and adapt themselves to the table of contents of all the issues of the IEEE the data. Transactions on Pattern Analysis and Machine Intelligence, The most commonly used family of neural networks for since its first publication in January 1979,reveals that pattern classification tasks [83]is the feed-forward network, approximately 350 papers deal with pattern recognition. which includes multilayer perceptron and Radial-Basis Approximately 300 of these papers covered the statistical Function (RBF)networks.These networks are organized approach and can be broadly categorized into the into layers and have unidirectional connections between the following subtopics:curse of dimensionality (15),dimen- layers.Another popular network is the Self-Organizing sionality reduction (50),classifier design (175),classifier Map (SOM),or Kohonen-Network [92],which is mainly combination (10),error estimation(25)and unsupervised used for data clustering and feature mapping.The learning classification (50).In addition to the excellent textbooks process involves updating network architecture and con- by Duda and Hart [44],Fukunaga [58],Devijver and nection weights so that a network can efficiently perform a Kittler [39],Devroye et al.[41],Bishop [18],Ripley [137], specific classification/clustering task.The increasing popu- Schurmann [147],and McLachlan [105],we should also larity of neural network models to solve pattern recognition point out two excellent survey papers written by Nagy problems has been primarily due to their seemingly low [111]in 1968 and by Kanal [89]in 1974.Nagy described dependence on domain-specific knowledge (relative to the early roots of pattern recognition,which at that time model-based and rule-based approaches)and due to the was shared with researchers in artificial intelligence and availability of efficient learning algorithms for practitioners perception.A large part of Nagy's paper introduced a to use. number of potential applications of pattern recognition Neural networks provide a new suite of nonlinear and the interplay between feature definition and the algorithms for feature extraction (using hidden layers) application domain knowledge.He also emphasized the and classification(e.g.,multilayer perceptrons).In addition, linear classification methods;nonlinear techniques were existing feature extraction and classification algorithms can based on polynomial discriminant functions as well as on also be mapped on neural network architectures for potential functions (similar to what are now called the efficient (hardware)implementation.In spite of the see- kernel functions).By the time Kanal wrote his survey mingly different underlying principles,most of the well- paper,more than 500 papers and about half a dozen known neural network models are implicitly equivalent or books on pattern recognition were already published. similar to classical statistical pattern recognition methods Kanal placed less emphasis on applications,but more on (see Table 3).Ripley 136]and Anderson et al.5]also modeling and design of pattern recognition systems.The discuss this relationship between neural networks and discussion on automatic feature extraction in [89]was statistical pattern recognition.Anderson et al.point out that based on various distance measures between class- "neural networks are statistics for amateurs...Most NNs conditional probability density functions and the result- conceal the statistics from the user."Despite these simila- ing error bounds.Kanal's review also contained a large rities,neural networks do offer several advantages such as, section on structural methods and pattern grammars. unified approaches for feature extraction and classification In comparison to the state of the pattern recognition field and flexible procedures for finding good,moderately as described by Nagy and Kanal in the 1960s and 1970s, nonlinear solutions. today a number of commercial pattern recognition systems 1.6 Scope and Organization are available which even individuals can buy for personal use (e.g,machine printed character recognition and In the remainder of this paper we will primarily review isolated spoken word recognition).This has been made statistical methods for pattern representation and classifica- possible by various technological developments resulting in tion,emphasizing recent developments.Whenever appro- the availability of inexpensive sensors and powerful desk- priate,we will also discuss closely related algorithms from top computers.The field of pattern recognition has become the neural networks literature.We omit the whole body of so large that in this review we had to skip detailed literature on fuzzy classification and fuzzy clustering which descriptions of various applications,as well as almost all are in our opinion beyond the scope of this paper. the procedures which model domain-specific knowledge Interested readers can refer to the well-written books on (e.g.,structural pattern recognition,and rule-based sys- fuzzy pattern recognition by Bezdek [15]and [16].In most tems).The starting point of our review (Section 2)is the of the sections,the various approaches and methods are basic elements of statistical methods for pattern recognition. summarized in tables as an easy and quick reference for the It should be apparent that a feature vector is a representa- reader.Due to space constraints,we are not able to provide tion of real world objects;the choice of the representation many details and we have to omit some of the approaches strongly influences the classification results and the associated references.Our goal is to emphasize those approaches which have been extensively evaluated 1.Its second edition by Duda,Hart,and Stork [45]is in press
computation) in a network of weighted directed graphs in which the nodes are artificial neurons and directed edges (with weights) are connections between neuron outputs and neuron inputs. The main characteristics of neural networks are that they have the ability to learn complex nonlinear input-output relationships, use sequential training procedures, and adapt themselves to the data. The most commonly used family of neural networks for pattern classification tasks [83] is the feed-forward network, which includes multilayer perceptron and Radial-Basis Function (RBF) networks. These networks are organized into layers and have unidirectional connections between the layers. Another popular network is the Self-Organizing Map (SOM), or Kohonen-Network [92], which is mainly used for data clustering and feature mapping. The learning process involves updating network architecture and connection weights so that a network can efficiently perform a specific classification/clustering task. The increasing popularity of neural network models to solve pattern recognition problems has been primarily due to their seemingly low dependence on domain-specific knowledge (relative to model-based and rule-based approaches) and due to the availability of efficient learning algorithms for practitioners to use. Neural networks provide a new suite of nonlinear algorithms for feature extraction (using hidden layers) and classification (e.g., multilayer perceptrons). In addition, existing feature extraction and classification algorithms can also be mapped on neural network architectures for efficient (hardware) implementation. In spite of the seemingly different underlying principles, most of the wellknown neural network models are implicitly equivalent or similar to classical statistical pattern recognition methods (see Table 3). Ripley [136] and Anderson et al. [5] also discuss this relationship between neural networks and statistical pattern recognition. Anderson et al. point out that ªneural networks are statistics for amateurs... Most NNs conceal the statistics from the user.º Despite these similarities, neural networks do offer several advantages such as, unified approaches for feature extraction and classification and flexible procedures for finding good, moderately nonlinear solutions. 1.6 Scope and Organization In the remainder of this paper we will primarily review statistical methods for pattern representation and classification, emphasizing recent developments. Whenever appropriate, we will also discuss closely related algorithms from the neural networks literature. We omit the whole body of literature on fuzzy classification and fuzzy clustering which are in our opinion beyond the scope of this paper. Interested readers can refer to the well-written books on fuzzy pattern recognition by Bezdek [15] and [16]. In most of the sections, the various approaches and methods are summarized in tables as an easy and quick reference for the reader. Due to space constraints, we are not able to provide many details and we have to omit some of the approaches and the associated references. Our goal is to emphasize those approaches which have been extensively evaluated and demonstrated to be useful in practical applications, along with the new trends and ideas. The literature on pattern recognition is vast and scattered in numerous journals in several disciplines (e.g., applied statistics, machine learning, neural networks, and signal and image processing). A quick scan of the table of contents of all the issues of the IEEE Transactions on Pattern Analysis and Machine Intelligence, since its first publication in January 1979, reveals that approximately 350 papers deal with pattern recognition. Approximately 300 of these papers covered the statistical approach and can be broadly categorized into the following subtopics: curse of dimensionality (15), dimensionality reduction (50), classifier design (175), classifier combination (10), error estimation (25) and unsupervised classification (50). In addition to the excellent textbooks by Duda and Hart [44],1 Fukunaga [58], Devijver and Kittler [39], Devroye et al. [41], Bishop [18], Ripley [137], Schurmann [147], and McLachlan [105], we should also point out two excellent survey papers written by Nagy [111] in 1968 and by Kanal [89] in 1974. Nagy described the early roots of pattern recognition, which at that time was shared with researchers in artificial intelligence and perception. A large part of Nagy's paper introduced a number of potential applications of pattern recognition and the interplay between feature definition and the application domain knowledge. He also emphasized the linear classification methods; nonlinear techniques were based on polynomial discriminant functions as well as on potential functions (similar to what are now called the kernel functions). By the time Kanal wrote his survey paper, more than 500 papers and about half a dozen books on pattern recognition were already published. Kanal placed less emphasis on applications, but more on modeling and design of pattern recognition systems. The discussion on automatic feature extraction in [89] was based on various distance measures between classconditional probability density functions and the resulting error bounds. Kanal's review also contained a large section on structural methods and pattern grammars. In comparison to the state of the pattern recognition field as described by Nagy and Kanal in the 1960s and 1970s, today a number of commercial pattern recognition systems are available which even individuals can buy for personal use (e.g., machine printed character recognition and isolated spoken word recognition). This has been made possible by various technological developments resulting in the availability of inexpensive sensors and powerful desktop computers. The field of pattern recognition has become so large that in this review we had to skip detailed descriptions of various applications, as well as almost all the procedures which model domain-specific knowledge (e.g., structural pattern recognition, and rule-based systems). The starting point of our review (Section 2) is the basic elements of statistical methods for pattern recognition. It should be apparent that a feature vector is a representation of real world objects; the choice of the representation strongly influences the classification results. JAIN ET AL.: STATISTICAL PATTERN RECOGNITION: A REVIEW 7 1. Its second edition by Duda, Hart, and Stork [45] is in press
8 EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL 22,NO.1,JANUARY 2000 TABLE 3 Links Between Statistical and Neural Network Methods Statistical Pattern Recognition Artificial Neural Networks Linear Discriminant Function Perceptron Principal Component Analysis Auto-Associative Network,and various PCA networks A Posteriori Probability Estimation Multilayer Perceptron Nonlinear Discriminant Analysis Multilayer Perceptron Parzen Window Density-based Classifier Radial Basis Function Network Edited K-NN Rule Kohonen's LVQ The topic of probabilistic distance measures is cur- recognition.To this purpose,we have included a number rently not as important as 20 years ago,since it is very of examples to illustrate the performance of various difficult to estimate density functions in high dimensional algorithms.Nevertheless,we realize that,due to space feature spaces.Instead,the complexity of classification limitations,we have not been able to introduce all the procedures and the resulting accuracy have gained a concepts completely.At these places,we have to rely on large interest.The curse of dimensionality (Section 3)as the background knowledge which may be available only well as the danger of overtraining are some of the to the more experienced readers. consequences of a complex classifier.It is now under- stood that these problems can,to some extent,be circumvented using regularization,or can even be 2 STATISTICAL PATTERN RECOGNITION completely resolved by a proper design of classification Statistical pattern recognition has been used successfully to procedures.The study of support vector machines design a number of commercial recognition systems.In (SVMs),discussed in Section 5,has largely contributed statistical pattern recognition,a pattern is represented by a to this understanding.In many real world problems, set of d features,or attributes,viewed as a d-dimensional patterns are scattered in high-dimensional (often)non- feature vector.Well-known concepts from statistical linear subspaces.As a consequence,nonlinear procedures decision theory are utilized to establish decision boundaries and subspace approaches have become popular,both for dimensionality reduction (Section 4)and for building between pattern classes.The recognition system is operated classifiers (Section 5).Neural networks offer powerful in two modes:training (learning)and classification(testing) tools for these purposes.It is now widely accepted that (see Fig.1).The role of the preprocessing module is to no single procedure will completely solve a complex segment the pattern of interest from the background, classification problem.There are many admissible ap- remove noise,normalize the pattern,and any other proaches,each capable of discriminating patterns in operation which will contribute in defining a compact certain portions of the feature space.The combination of representation of the pattern.In the training mode,the classifiers has,therefore,become a heavily studied topic feature extraction/selection module finds the appropriate (Section 6).Various approaches to estimating the error features for representing the input patterns and the rate of a classifier are presented in Section 7.The topic of classifier is trained to partition the feature space.The unsupervised classification or clustering is covered in feedback path allows a designer to optimize the preproces- Section 8.Finally,Section 9 identifies the frontiers of sing and feature extraction/selection strategies.In the pattern recognition. classification mode,the trained classifier assigns the input It is our goal that most parts of the paper can be pattern to one of the pattern classes under consideration appreciated by a newcomer to the field of pattern based on the measured features. test Feature Preprocessing Measurement Classification pattern Classification Training Feature training Preprocessing Extraction/ Learning pattern Selection Fig.1.Model for statistical pattem recognition
The topic of probabilistic distance measures is currently not as important as 20 years ago, since it is very difficult to estimate density functions in high dimensional feature spaces. Instead, the complexity of classification procedures and the resulting accuracy have gained a large interest. The curse of dimensionality (Section 3) as well as the danger of overtraining are some of the consequences of a complex classifier. It is now understood that these problems can, to some extent, be circumvented using regularization, or can even be completely resolved by a proper design of classification procedures. The study of support vector machines (SVMs), discussed in Section 5, has largely contributed to this understanding. In many real world problems, patterns are scattered in high-dimensional (often) nonlinear subspaces. As a consequence, nonlinear procedures and subspace approaches have become popular, both for dimensionality reduction (Section 4) and for building classifiers (Section 5). Neural networks offer powerful tools for these purposes. It is now widely accepted that no single procedure will completely solve a complex classification problem. There are many admissible approaches, each capable of discriminating patterns in certain portions of the feature space. The combination of classifiers has, therefore, become a heavily studied topic (Section 6). Various approaches to estimating the error rate of a classifier are presented in Section 7. The topic of unsupervised classification or clustering is covered in Section 8. Finally, Section 9 identifies the frontiers of pattern recognition. It is our goal that most parts of the paper can be appreciated by a newcomer to the field of pattern recognition. To this purpose, we have included a number of examples to illustrate the performance of various algorithms. Nevertheless, we realize that, due to space limitations, we have not been able to introduce all the concepts completely. At these places, we have to rely on the background knowledge which may be available only to the more experienced readers. 2 STATISTICAL PATTERN RECOGNITION Statistical pattern recognition has been used successfully to design a number of commercial recognition systems. In statistical pattern recognition, a pattern is represented by a set of d features, or attributes, viewed as a d-dimensional feature vector. Well-known concepts from statistical decision theory are utilized to establish decision boundaries between pattern classes. The recognition system is operated in two modes: training (learning) and classification (testing) (see Fig. 1). The role of the preprocessing module is to segment the pattern of interest from the background, remove noise, normalize the pattern, and any other operation which will contribute in defining a compact representation of the pattern. In the training mode, the feature extraction/selection module finds the appropriate features for representing the input patterns and the classifier is trained to partition the feature space. The feedback path allows a designer to optimize the preprocessing and feature extraction/selection strategies. In the classification mode, the trained classifier assigns the input pattern to one of the pattern classes under consideration based on the measured features. 8 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 1, JANUARY 2000 TABLE 3 Links Between Statistical and Neural Network Methods Fig. 1. Model for statistical pattern recognition