当前位置：和泉文库 > 数学 > 浏览文档

《模式识别》课程教学资源（书籍文献）Data Clustering - A Review（A.K. JAIN、M.N. MURTY、P.J. FLYNN）

文件格式：PDF，文件大小：621.33KB，售价：24.75元

文档详细内容（约60页）

Data Clustering:A Review A.K.JAIN Michigan State University M.N.MURTY Indian Institute of Science AND P.J.FLYNN The Ohio State University Clustering is the unsupervised classification of patterns(observations,data items, or feature vectors)into groups(clusters).The clustering problem has been addressed in many contexts and by researchers in many disciplines;this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However,clustering is a difficult problem combinatorially,and differences in assumptions and contexts in different communities has made the transfer of useful generic concepts and methodologies slow to occur.This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners.We present a taxonomy of clustering techniques,and identify cross-cutting themes and recent advances.We also describe some important applications of clustering algorithms such as image segmentation,object recognition,and information retrieval. Categories and Subject Descriptors:I.5.1 [Pattern Recognition]:Models;I.5.3 [Pattern Recognition]:Clustering;1.5.4 [Pattern Recognition]:Applications- Computer vision;H.3.3 [Information Storage and Retrievall:Information Search and Retrieval-Clustering;1.2.6 [Artificial Intelligence]: Learning-Knowledge acquisition General Terms:Algorithms Additional Key Words and Phrases:Cluster analysis,clustering applications, exploratory data analysis,incremental clustering,similarity indices,unsupervised learning Section 6.1 is based on the chapter "Image Segmentation Using Clustering"by A.K.Jain and P.J. Flynn,Advances in Image Understanding:A Festschrift for Azriel Rosenfeld(K.Bowyer and N.Ahuja Eds.),1996 IEEE Computer Society Press,and is used by permission of the IEEE Computer Society. Authors'addresses:A.Jain,Department of Computer Science,Michigan State University,A714 Wells Hall,East Lansing,MI 48824;M.Murty,Department of Computer Science and Automation,Indian Institute of Science,Bangalore,560 012,India;P.Flynn,Department of Electrical Engineering,The Ohio State University,Columbus,OH 43210. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage,the copyright notice,the title of the publication,and its date appear,and notice is given that copying is by permission of the ACM,Inc.To copy otherwise,to republish,to post on servers,or to redistribute to ists,requires prior specific permission and/or a fee. ©2000ACM0360-0300/99/0900-0001$5.00 ACM Computing Surveys,Vol.31,No.3,September 1999

Data Clustering: A Review A.K. JAIN Michigan State University M.N. MURTY Indian Institute of Science AND P.J. FLYNN The Ohio State University Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorially, and differences in assumptions and contexts in different communities has made the transfer of useful generic concepts and methodologies slow to occur. This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify cross-cutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval. Categories and Subject Descriptors: I.5.1 [Pattern Recognition]: Models; I.5.3 [Pattern Recognition]: Clustering; I.5.4 [Pattern Recognition]: Applications— Computer vision; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Clustering; I.2.6 [Artificial Intelligence]: Learning—Knowledge acquisition General Terms: Algorithms Additional Key Words and Phrases: Cluster analysis, clustering applications, exploratory data analysis, incremental clustering, similarity indices, unsupervised learning Section 6.1 is based on the chapter “Image Segmentation Using Clustering” by A.K. Jain and P.J. Flynn, Advances in Image Understanding: A Festschrift for Azriel Rosenfeld (K. Bowyer and N. Ahuja, Eds.), 1996 IEEE Computer Society Press, and is used by permission of the IEEE Computer Society. Authors’ addresses: A. Jain, Department of Computer Science, Michigan State University, A714 Wells Hall, East Lansing, MI 48824; M. Murty, Department of Computer Science and Automation, Indian Institute of Science, Bangalore, 560 012, India; P. Flynn, Department of Electrical Engineering, The Ohio State University, Columbus, OH 43210. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. © 2000 ACM 0360-0300/99/0900–0001 $5.00 ACM Computing Surveys, Vol. 31, No. 3, September 1999

Data Clustering 265 CONTENTS Intuitively,patterns within a valid clus- ter are more similar to each other than 1.Introduction 1.1 Motivation they are to a pattern belonging to a 1.2 Components of a Clustering Task different cluster.An example of cluster- 1.3 The User's Dilemma and the Role of Expertise ing is depicted in Figure 1.The input 1.4 History patterns are shown in Figure 1(a),and 1.5 Outline 2.Definitions and Notation the desired clusters are shown in Figure 3.Pattern Representation,Feature Selection and 1(b).Here,points belonging to the same Extraction cluster are given the same label.The 4.Similarity Measures variety of techniques for representing 5.Clustering Techniques data,measuring proximity (similarity) 5.1 Hierarchical Clustering Algorithms 5.2 Partitional Algorithms between data elements,and grouping 5.3 Mixture-Resolving and Mode-Seeking data elements has produced a rich and Algorithms often confusing assortment of clustering 5.4 Nearest Neighbor Clustering methods. 5.5 Fuzzy Clustering 5.6 Representation of Clusters It is important to understand the dif- 5.7 Artificial Neural Networks for Clustering ference between clustering (unsuper- 5.8 Evolutionary Approaches for Clustering vised classification)and discriminant 5.9 Search-Based Approaches analysis (supervised classification).In 5.10 A Comparison of Techniques supervised classification,we are pro- 5.11 Incorporating Domain Constraints in Clustering vided with a collection of labeled (pre- 5.12 Clustering Large Data Sets classified)patterns;the problem is to 6.Applications label a newly encountered,yet unla- 6.1 Image Segmentation Using Clustering beled,pattern.Typically,the given la- 6.2 Object and Character Recognition 6.3 Information Retrieval beled (training)patterns are used to 6.4 Data Mining learn the descriptions of classes which 7.Summary in turn are used to label a new pattern. In the case of clustering,the problem is to group a given collection of unlabeled patterns into meaningful clusters.In a 1.INTRODUCTION sense,labels are associated with clus- ters also,but these category labels are 1.1 Motivation data driven;that is,they are obtained solely from the data. Data analysis underlies many comput- Clustering is useful in several explor- ing applications,either in a design atory pattern-analysis,grouping,deci- phase or as part of their on-line opera- sion-making,and machine-learning sit- tions.Data analysis procedures can be uations, including data mining, dichotomized as either exploratory or document retrieval,image segmenta- confirmatory,based on the availability tion,and pattern classification.How- of appropriate models for the data ever,in many such problems,there is source,but a key element in both types little prior information (e.g.,statistical of procedures (whether for hypothesis models)available about the data,and formation or decision-making)is thethe decision-maker must make as few grouping,or classification of measure-assumptions about the data as possible. ments based on either(i)goodness-of-fit It is under these restrictions that clus- to a postulated model,or (ii)natural tering methodology is particularly ap- groupings(clustering)revealed through propriate for the exploration of interre- analysis.Cluster analysis is the organi- lationships among the data points to zation of a collection of patterns (usual-make an assessment (perhaps prelimi- ly represented as a vector of measure-nary)of their structure. ments,or a point in a multidimensional The term“clustering'”is used in sev- space)into clusters based on similarity.eral research communities to describe ACM Computing Surveys,Vol.31,No.3,September 1999

1. INTRODUCTION 1.1 Motivation Data analysis underlies many computing applications, either in a design phase or as part of their on-line operations. Data analysis procedures can be dichotomized as either exploratory or confirmatory, based on the availability of appropriate models for the data source, but a key element in both types of procedures (whether for hypothesis formation or decision-making) is the grouping, or classification of measurements based on either (i) goodness-of-fit to a postulated model, or (ii) natural groupings (clustering) revealed through analysis. Cluster analysis is the organization of a collection of patterns (usually represented as a vector of measurements, or a point in a multidimensional space) into clusters based on similarity. Intuitively, patterns within a valid cluster are more similar to each other than they are to a pattern belonging to a different cluster. An example of clustering is depicted in Figure 1. The input patterns are shown in Figure 1(a), and the desired clusters are shown in Figure 1(b). Here, points belonging to the same cluster are given the same label. The variety of techniques for representing data, measuring proximity (similarity) between data elements, and grouping data elements has produced a rich and often confusing assortment of clustering methods. It is important to understand the difference between clustering (unsupervised classification) and discriminant analysis (supervised classification). In supervised classification, we are provided with a collection of labeled (preclassified) patterns; the problem is to label a newly encountered, yet unlabeled, pattern. Typically, the given labeled (training) patterns are used to learn the descriptions of classes which in turn are used to label a new pattern. In the case of clustering, the problem is to group a given collection of unlabeled patterns into meaningful clusters. In a sense, labels are associated with clusters also, but these category labels are data driven; that is, they are obtained solely from the data. Clustering is useful in several exploratory pattern-analysis, grouping, decision-making, and machine-learning situations, including data mining, document retrieval, image segmentation, and pattern classification. However, in many such problems, there is little prior information (e.g., statistical models) available about the data, and the decision-maker must make as few assumptions about the data as possible. It is under these restrictions that clustering methodology is particularly appropriate for the exploration of interrelationships among the data points to make an assessment (perhaps preliminary) of their structure. The term “clustering” is used in several research communities to describe CONTENTS 1. Introduction 1.1 Motivation 1.2 Components of a Clustering Task 1.3 The User’s Dilemma and the Role of Expertise 1.4 History 1.5 Outline 2. Definitions and Notation 3. Pattern Representation, Feature Selection and Extraction 4. Similarity Measures 5. Clustering Techniques 5.1 Hierarchical Clustering Algorithms 5.2 Partitional Algorithms 5.3 Mixture-Resolving and Mode-Seeking Algorithms 5.4 Nearest Neighbor Clustering 5.5 Fuzzy Clustering 5.6 Representation of Clusters 5.7 Artificial Neural Networks for Clustering 5.8 Evolutionary Approaches for Clustering 5.9 Search-Based Approaches 5.10 A Comparison of Techniques 5.11 Incorporating Domain Constraints in Clustering 5.12 Clustering Large Data Sets 6. Applications 6.1 Image Segmentation Using Clustering 6.2 Object and Character Recognition 6.3 Information Retrieval 6.4 Data Mining 7. Summary Data Clustering • 265 ACM Computing Surveys, Vol. 31, No. 3, September 1999

266 A.Jain et al.. 22 7 (b) Figure 1.Data clustering. methods for grouping of unlabeled data.sionals(who should view it as an acces- These communities have different ter-sible introduction to a mature field that minologies and assumptions for the is making important contributions to components of the clustering process computing application areas). and the contexts in which clustering is used.Thus,we face a dilemma regard- 1.2 Components of a Clustering Task ing the scope of this survey.The produc- tion of a truly comprehensive survey Typical pattern clustering activity in- would be a monumental task given the volves the following steps [Jain and sheer mass of literature in this area. Dubes 1988]: The accessibility of the survey might (1)pattern representation (optionally also be questionable given the need to including feature extraction and/or reconcile very different vocabularies selection), and assumptions regarding clustering in the various communities. (2)definition of a pattern proximity The goal of this paper is to survey the measure appropriate to the data do- core concepts and techniques in the main, large subset of cluster analysis with its (3)clustering or grouping, roots in statistics and decision theory. Where appropriate,references will be (4)data abstraction(if needed),and made to key concepts and techniques arising from clustering methodology in (5)assessment of output(if needed). the machine-learning and other commu-Figure 2 depicts a typical sequencing of nities. the first three of these steps,including The audience for this paper includes a feedback path where the grouping practitioners in the pattern recognition process output could affect subsequent and image analysis communities (who feature extraction and similarity com- should view it as a summarization of putations. current practice),practitioners in the Pattern representation refers to the machine-learning communities (who number of classes,the number of avail- should view it as a snapshot of a closely able patterns,and the number,type, related field with a rich history of well-and scale of the features available to the understood techniques),and the clustering algorithm.Some of this infor- broader audience of scientific profes-mation may not be controllable by the ACM Computing Surveys,Vol.31,No.3,September 1999

methods for grouping of unlabeled data. These communities have different terminologies and assumptions for the components of the clustering process and the contexts in which clustering is used. Thus, we face a dilemma regarding the scope of this survey. The production of a truly comprehensive survey would be a monumental task given the sheer mass of literature in this area. The accessibility of the survey might also be questionable given the need to reconcile very different vocabularies and assumptions regarding clustering in the various communities. The goal of this paper is to survey the core concepts and techniques in the large subset of cluster analysis with its roots in statistics and decision theory. Where appropriate, references will be made to key concepts and techniques arising from clustering methodology in the machine-learning and other communities. The audience for this paper includes practitioners in the pattern recognition and image analysis communities (who should view it as a summarization of current practice), practitioners in the machine-learning communities (who should view it as a snapshot of a closely related field with a rich history of wellunderstood techniques), and the broader audience of scientific professionals (who should view it as an accessible introduction to a mature field that is making important contributions to computing application areas). 1.2 Components of a Clustering Task Typical pattern clustering activity involves the following steps [Jain and Dubes 1988]: (1) pattern representation (optionally including feature extraction and/or selection), (2) definition of a pattern proximity measure appropriate to the data domain, (3) clustering or grouping, (4) data abstraction (if needed), and (5) assessment of output (if needed). Figure 2 depicts a typical sequencing of the first three of these steps, including a feedback path where the grouping process output could affect subsequent feature extraction and similarity computations. Pattern representation refers to the number of classes, the number of available patterns, and the number, type, and scale of the features available to the clustering algorithm. Some of this information may not be controllable by the X X Y Y (a) (b) x x x x x 1 1 1 x x 1 1 2 2 x x 2 2 x x x x x x x x x x x x x x x x x x x 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 x x x x x x x x 6 6 6 7 7 7 7 6 xxx x x x x 45 5 5 5 5 5 Figure 1. Data clustering. 266 • A. Jain et al. ACM Computing Surveys, Vol. 31, No. 3, September 1999

Data Clustering 267 Patterns Feature Pattern Interpattern Selection/ Clusters Similarity Grouping Extraction Representations feedback loop Figure 2.Stages in clustering. practitioner.Feature selection is the Data abstraction is the process of ex- process of identifying the most effective tracting a simple and compact represen- subset of the original features to use in tation of a data set.Here,simplicity is clustering.Feature extraction is the use either from the perspective of automatic of one or more transformations of the analysis (so that a machine can perform input features to produce new salient further processing efficiently)or it is features.Either or both of these tech-human-oriented(so that the representa- niques can be used to obtain an appro-tion obtained is easy to comprehend and priate set of features to use in cluster-intuitively appealing).In the clustering ing. context,a typical data abstraction is a Pattern proximity is usually measured compact description of each cluster, by a distance function defined on pairs usually in terms of cluster prototypes or of patterns.A variety of distance mea-representative patterns such as the cen- sures are in use in the various commu-troid [Diday and Simon 1976]. nities [Anderberg 1973;Jain and Dubes How is the output of a clustering algo- 1988;Diday and Simon 1976].A simple rithm evaluated?What characterizes a distance measure like Euclidean dis- good'clustering result and a poor'one? tance can often be used to reflect dis-All clustering algorithms will,when similarity between two patterns, presented with data,produce clusters- whereas other similarity measures can regardless of whether the data contain be used to characterize the conceptual clusters or not.If the data does contain similarity between patterns [Michalski clusters,some clustering algorithms and Stepp 1983].Distance measures are may obtain better'clusters than others. discussed in Section 4. The assessment of a clustering proce- The grouping step can be performed dure's output,then,has several facets. in a number of ways.The output clus-One is actually an assessment of the tering (or clusterings)can be hard (a data domain rather than the clustering partition of the data into groups)or algorithm itself data which do not fuzzy (where each pattern has a vari-contain clusters should not be processed able degree of membership in each of by a clustering algorithm.The study of the output clusters).Hierarchical clus-cluster tendency,wherein the input data tering algorithms produce a nested se-are examined to see if there is any merit ries of partitions based on a criterion for to a cluster analysis prior to one being merging or splitting clusters based on performed,is a relatively inactive re- similarity.Partitional clustering algo-search area,and will not be considered rithms identify the partition that opti-further in this survey.The interested mizes (usually locally)a clustering cri-reader is referred to Dubes [1987]and terion.Additional techniques for the Cheng [1995]for information. grouping operation include probabilistic Cluster validity analysis,by contrast, [Brailovski 1991]and graph-theoretic is the assessment of a clustering proce- [Zahn 1971]clustering methods.The dure's output.Often this analysis uses a variety of techniques for cluster forma- specific criterion of optimality;however, tion is described in Section 5. these criteria are usually arrived at ACM Computing Surveys,Vol.31,No.3,September 1999

practitioner. Feature selection is the process of identifying the most effective subset of the original features to use in clustering. Feature extraction is the use of one or more transformations of the input features to produce new salient features. Either or both of these techniques can be used to obtain an appropriate set of features to use in clustering. Pattern proximity is usually measured by a distance function defined on pairs of patterns. A variety of distance measures are in use in the various communities [Anderberg 1973; Jain and Dubes 1988; Diday and Simon 1976]. A simple distance measure like Euclidean distance can often be used to reflect dissimilarity between two patterns, whereas other similarity measures can be used to characterize the conceptual similarity between patterns [Michalski and Stepp 1983]. Distance measures are discussed in Section 4. The grouping step can be performed in a number of ways. The output clustering (or clusterings) can be hard (a partition of the data into groups) or fuzzy (where each pattern has a variable degree of membership in each of the output clusters). Hierarchical clustering algorithms produce a nested series of partitions based on a criterion for merging or splitting clusters based on similarity. Partitional clustering algorithms identify the partition that optimizes (usually locally) a clustering criterion. Additional techniques for the grouping operation include probabilistic [Brailovski 1991] and graph-theoretic [Zahn 1971] clustering methods. The variety of techniques for cluster formation is described in Section 5. Data abstraction is the process of extracting a simple and compact representation of a data set. Here, simplicity is either from the perspective of automatic analysis (so that a machine can perform further processing efficiently) or it is human-oriented (so that the representation obtained is easy to comprehend and intuitively appealing). In the clustering context, a typical data abstraction is a compact description of each cluster, usually in terms of cluster prototypes or representative patterns such as the centroid [Diday and Simon 1976]. How is the output of a clustering algorithm evaluated? What characterizes a ‘good’ clustering result and a ‘poor’ one? All clustering algorithms will, when presented with data, produce clusters — regardless of whether the data contain clusters or not. If the data does contain clusters, some clustering algorithms may obtain ‘better’ clusters than others. The assessment of a clustering procedure’s output, then, has several facets. One is actually an assessment of the data domain rather than the clustering algorithm itself— data which do not contain clusters should not be processed by a clustering algorithm. The study of cluster tendency, wherein the input data are examined to see if there is any merit to a cluster analysis prior to one being performed, is a relatively inactive research area, and will not be considered further in this survey. The interested reader is referred to Dubes [1987] and Cheng [1995] for information. Cluster validity analysis, by contrast, is the assessment of a clustering procedure’s output. Often this analysis uses a specific criterion of optimality; however, these criteria are usually arrived at Feature Selection/ Extraction Pattern Grouping Clusters Interpattern Similarity Representations Patterns feedback loop Figure 2. Stages in clustering. Data Clustering • 267 ACM Computing Surveys, Vol. 31, No. 3, September 1999

268 A.Jain et al. subjectively.Hence,little in the way of -How can a vary large data set (say,a gold standards'exist in clustering ex- million patterns)be clustered effi- cept in well-prescribed subdomains.Va- ciently? lidity assessments are objective [Dubes 1993]and are performed to determine These issues have motivated this sur- whether the output is meaningful.A vey,and its aim is to provide a perspec- clustering structure is valid if it cannot tive on the state of the art in clustering reasonably have occurred by chance or methodology and algorithms.With such as an artifact of a clustering algorithm. a perspective,an informed practitioner When statistical approaches to cluster- should be able to confidently assess the ing are used,validation is accomplished tradeoffs of different techniques,and by carefully applying statistical meth- ultimately make a competent decision ods and testing hypotheses.There are on a technique or suite of techniques to three types of validation studies.An employ in a particular application. external assessment of validity com- There is no clustering technique that pares the recovered structure to an a is universally applicable in uncovering priori structure.An internal examina- the variety of structures present in mul- tion of validity tries to determine if the tidimensional data sets.For example, structure is intrinsically appropriate for consider the two-dimensional data set the data.A relative test compares two shown in Figure 1(a).Not all clustering structures and measures their relative techniques can uncover all the clusters merit.Indices used for this comparison present here with equal facility,because are discussed in detail in Jain and clustering algorithms often contain im- Dubes [1988]and Dubes [1993],and are plicit assumptions about cluster shape not discussed further in this paper. or multiple-cluster configurations based on the similarity measures and group- 1.3 The User's Dilemma and the Role of ing criteria used. Expertise Humans perform competitively with automatic clustering procedures in two The availability of such a vast collection dimensions,but most real problems in- of clustering algorithms in the litera- volve clustering in higher dimensions.It ture can easily confound a user attempt- is difficult for humans to obtain an intu- ing to select an algorithm suitable for itive interpretation of data embedded in the problem at hand.In Dubes and Jain a high-dimensional space.In addition, [1976],a set of admissibility criteria data hardly follow the“ideal”structures defined by Fisher and Van Ness [1971] (e.g.,hyperspherical,linear)shown in are used to compare clustering algo- rithms.These admissibility criteria are Figure 1.This explains the large num- ber of clustering algorithms which con- based on:(1)the manner in which clus- ters are formed,(2)the structure of the tinue to appear in the literature;each new clustering algorithm performs data,and(3)sensitivity of the cluster-slightly better than the existing ones on ing technique to changes that do not a specific distribution of patterns. affect the structure of the data.How- It is essential for the user of a cluster- ever,there is no critical analysis of clus- ing algorithm to not only have a thor- tering algorithms dealing with the im- ough understanding of the particular portant questions such as technique being utilized,but also to -How should the data be normalized? know the details of the data gathering -Which similarity measure is appropri- process and to have some domain exper- tise;the more information the user has ate to use in a given situation? about the data at hand,the more likely -How should domain knowledge be uti-the user would be able to succeed in lized in a particular clustering prob- assessing its true class structure [JJain lem? and Dubes 19881.This domain informa- ACM Computing Surveys,Vol.31,No.3,September 1999

subjectively. Hence, little in the way of ‘gold standards’ exist in clustering except in well-prescribed subdomains. Validity assessments are objective [Dubes 1993] and are performed to determine whether the output is meaningful. A clustering structure is valid if it cannot reasonably have occurred by chance or as an artifact of a clustering algorithm. When statistical approaches to clustering are used, validation is accomplished by carefully applying statistical methods and testing hypotheses. There are three types of validation studies. An external assessment of validity compares the recovered structure to an a priori structure. An internal examination of validity tries to determine if the structure is intrinsically appropriate for the data. A relative test compares two structures and measures their relative merit. Indices used for this comparison are discussed in detail in Jain and Dubes [1988] and Dubes [1993], and are not discussed further in this paper. 1.3 The User’s Dilemma and the Role of Expertise The availability of such a vast collection of clustering algorithms in the literature can easily confound a user attempting to select an algorithm suitable for the problem at hand. In Dubes and Jain [1976], a set of admissibility criteria defined by Fisher and Van Ness [1971] are used to compare clustering algorithms. These admissibility criteria are based on: (1) the manner in which clusters are formed, (2) the structure of the data, and (3) sensitivity of the clustering technique to changes that do not affect the structure of the data. However, there is no critical analysis of clustering algorithms dealing with the important questions such as —How should the data be normalized? —Which similarity measure is appropriate to use in a given situation? —How should domain knowledge be utilized in a particular clustering problem? —How can a vary large data set (say, a million patterns) be clustered efficiently? These issues have motivated this survey, and its aim is to provide a perspective on the state of the art in clustering methodology and algorithms. With such a perspective, an informed practitioner should be able to confidently assess the tradeoffs of different techniques, and ultimately make a competent decision on a technique or suite of techniques to employ in a particular application. There is no clustering technique that is universally applicable in uncovering the variety of structures present in multidimensional data sets. For example, consider the two-dimensional data set shown in Figure 1(a). Not all clustering techniques can uncover all the clusters present here with equal facility, because clustering algorithms often contain implicit assumptions about cluster shape or multiple-cluster configurations based on the similarity measures and grouping criteria used. Humans perform competitively with automatic clustering procedures in two dimensions, but most real problems involve clustering in higher dimensions. It is difficult for humans to obtain an intuitive interpretation of data embedded in a high-dimensional space. In addition, data hardly follow the “ideal” structures (e.g., hyperspherical, linear) shown in Figure 1. This explains the large number of clustering algorithms which continue to appear in the literature; each new clustering algorithm performs slightly better than the existing ones on a specific distribution of patterns. It is essential for the user of a clustering algorithm to not only have a thorough understanding of the particular technique being utilized, but also to know the details of the data gathering process and to have some domain expertise; the more information the user has about the data at hand, the more likely the user would be able to succeed in assessing its true class structure [Jain and Dubes 1988]. This domain informa- 268 • A. Jain et al. ACM Computing Surveys, Vol. 31, No. 3, September 1999

点击进入文档下载页（PDF格式）

共60页，可试读20页，点击继续阅读 ↓↓

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录