Data Clustering 279 tioning.ISODATA will first merge the clusters (A)and (B,C)into one cluster G because the distance between their cen- troids is small and then split the cluster (D,E,F,G),which has a large variance, into two clusters (D,E)and (F,G). Another variation of the k-means al- gorithm involves selecting a different DE criterion function altogether.The dy- namic clustering algorithm (which per- mits representations other than the X centroid for each cluster)was proposed Figure 14.The k-means algorithm is sensitive in Diday [1973],and Symon [1977]and to the initial partition. describes a dynamic clustering ap- proach obtained by formulating the clustering problem in the framework of maximum-likelihood estimation.The (2)Assign each pattern to the closest regularized Mahalanobis distance was cluster center. used in Mao and Jain [1996]to obtain (3)Recompute the cluster centers using hyperellipsoidal clusters. the current cluster memberships. 5.2.2 Graph-Theoretic Clustering. (4)If a convergence criterion is not met, The best-known graph-theoretic divisive go to step 2.Typical convergence clustering algorithm is based on con- criteria are:no (or minimal)reas- struction of the minimal spanning tree signment of patterns to new cluster (MST)of the data [Zahn 1971],and then centers,or minimal decrease in deleting the MST edges with the largest squared error. lengths to generate clusters.Figure 15 depicts the MST obtained from nine Several variants [Anderberg 1973]of two-dimensional points.By breaking the k-means algorithm have been re-the link labeled CD with a length of 6 ported in the literature.Some of them units (the edge with the maximum Eu- attempt to select a good initial partition clidean length),two clusters ((A,B,C) so that the algorithm is more likely to and (D,E,F,G,H,I))are obtained.The find the global minimum value. second cluster can be further divided Another variation is to permit split-into two clusters by breaking the edge ting and merging of the resulting clus-EF,which has a length of 4.5 units. ters.Typically,a cluster is split when The hierarchical approaches are also its variance is above a pre-specified related to graph-theoretic clustering. threshold,and two clusters are merged Single-link clusters are subgraphs of when the distance between their cen-the minimum spanning tree of the data troids is below another pre-specified [Gower and Ross 1969]which are also threshold.Using this variant,it is pos- the connected components [Gotlieb and sible to obtain the optimal partition Kumar 1968].Complete-link clusters starting from any arbitrary initial parti-are maximal complete subgraphs,and tion,provided proper threshold values are related to the node colorability of are specified.The well-known ISO- graphs [Backer and Hubert 1976].The DATA [Ball and Hall 1965]algorithm maximal complete subgraph was consid- employs this technique of merging and ered the strictest definition of a cluster splitting clusters.If ISODATA is given in Augustson and Minker [1970]and the“ellipse”partitioning shown in Fig Raghavan and Yu [1981].A graph-ori- ure 14 as an initial partitioning,it will ented approach for non-hierarchical produce the optimal three-cluster parti- structures and overlapping clusters is ACM Computing Surveys,Vol.31,No.3,September 1999
(2) Assign each pattern to the closest cluster center. (3) Recompute the cluster centers using the current cluster memberships. (4) If a convergence criterion is not met, go to step 2. Typical convergence criteria are: no (or minimal) reassignment of patterns to new cluster centers, or minimal decrease in squared error. Several variants [Anderberg 1973] of the k-means algorithm have been reported in the literature. Some of them attempt to select a good initial partition so that the algorithm is more likely to find the global minimum value. Another variation is to permit splitting and merging of the resulting clusters. Typically, a cluster is split when its variance is above a pre-specified threshold, and two clusters are merged when the distance between their centroids is below another pre-specified threshold. Using this variant, it is possible to obtain the optimal partition starting from any arbitrary initial partition, provided proper threshold values are specified. The well-known ISODATA [Ball and Hall 1965] algorithm employs this technique of merging and splitting clusters. If ISODATA is given the “ellipse” partitioning shown in Figure 14 as an initial partitioning, it will produce the optimal three-cluster partitioning. ISODATA will first merge the clusters {A} and {B,C} into one cluster because the distance between their centroids is small and then split the cluster {D,E,F,G}, which has a large variance, into two clusters {D,E} and {F,G}. Another variation of the k-means algorithm involves selecting a different criterion function altogether. The dynamic clustering algorithm (which permits representations other than the centroid for each cluster) was proposed in Diday [1973], and Symon [1977] and describes a dynamic clustering approach obtained by formulating the clustering problem in the framework of maximum-likelihood estimation. The regularized Mahalanobis distance was used in Mao and Jain [1996] to obtain hyperellipsoidal clusters. 5.2.2 Graph-Theoretic Clustering. The best-known graph-theoretic divisive clustering algorithm is based on construction of the minimal spanning tree (MST) of the data [Zahn 1971], and then deleting the MST edges with the largest lengths to generate clusters. Figure 15 depicts the MST obtained from nine two-dimensional points. By breaking the link labeled CD with a length of 6 units (the edge with the maximum Euclidean length), two clusters ({A, B, C} and {D, E, F, G, H, I}) are obtained. The second cluster can be further divided into two clusters by breaking the edge EF, which has a length of 4.5 units. The hierarchical approaches are also related to graph-theoretic clustering. Single-link clusters are subgraphs of the minimum spanning tree of the data [Gower and Ross 1969] which are also the connected components [Gotlieb and Kumar 1968]. Complete-link clusters are maximal complete subgraphs, and are related to the node colorability of graphs [Backer and Hubert 1976]. The maximal complete subgraph was considered the strictest definition of a cluster in Augustson and Minker [1970] and Raghavan and Yu [1981]. A graph-oriented approach for non-hierarchical structures and overlapping clusters is Figure 14. The k-means algorithm is sensitive to the initial partition. Data Clustering • 279 ACM Computing Surveys, Vol. 31, No. 3, September 1999
280 A.Jain et al. sible description of the technique.In the EM framework,the parameters of the component densities are unknown,as are the mixing parameters,and these are estimated from the patterns.The .5 EM procedure begins with an initial estimate of the parameter vector and 2.3 iteratively rescores the patterns against 2 the mixture density produced by the parameter vector.The rescored patterns edge with the maximum length are then used to update the parameter estimates.In a clustering context,the scores of the patterns(which essentially Figure 15.Using the minimal spanning tree to measure their likelihood of being drawn form clusters. from particular components of the mix- ture)can be viewed as hints at the class of the pattern.Those patterns,placed presented in Ozawa [1985].The Delau- (by their scores)in a particular compo- nay graph (DG)is obtained by connect- nent,would therefore be viewed as be- ing all the pairs of points that are longing to the same cluster. Voronoi neighbors.The DG contains all Nonparametric techniques for densi- the neighborhood information contained ty-based clustering have also been de- in the MST and the relative neighbor- veloped [Jain and Dubes 1988].Inspired hood graph(RNG)[Toussaint 1980]. by the Parzen window approach to non- parametric density estimation,the cor- 5.3 Mixture-Resolving and Mode-Seeking responding clustering procedure Algorithms searches for bins with large counts in a multidimensional histogram of the in- The mixture resolving approach to clus-put pattern set.Other approaches in- ter analysis has been addressed in a clude the application of another parti- number of ways.The underlying as- tional or hierarchical clustering sumption is that the patterns to be clus- algorithm using a distance measure tered are drawn from one of several based on a nonparametric density esti- distributions,and the goal is to identify mate. the parameters of each and (perhaps) their number.Most of the work in this area has assumed that the individual 5.4 Nearest Neighbor Clustering components of the mixture density are Since proximity plays a key role in our Gaussian,and in this case the parame- intuitive notion of a cluster,nearest- ters of the individual Gaussians are to neighbor distances can serve as the ba- be estimated by the procedure.Tradi- sis of clustering procedures.An itera- tional approaches to this problem in- tive procedure was proposed in Lu and volve obtaining (iteratively)a maximum Fu [1978];it assigns each unlabeled likelihood estimate of the parameter pattern to the cluster of its nearest la- vectors of the component densities [Jain beled neighbor pattern,provided the and Dubes 1988]. distance to that labeled neighbor is be- More recently,the Expectation Maxi-low a threshold.The process continues mization (EM)algorithm (a general-until all patterns are labeled or no addi- purpose maximum likelihood algorithm tional labelings occur.The mutual [Dempster et al.1977]for missing-data neighborhood value(described earlier in problems)has been applied to the prob-the context of distance computation)can lem of parameter estimation.A recent also be used to grow clusters from near book [Mitchell 1997]provides an acces-neighbors. ACM Computing Surveys,Vol.31,No.3,September 1999
presented in Ozawa [1985]. The Delaunay graph (DG) is obtained by connecting all the pairs of points that are Voronoi neighbors. The DG contains all the neighborhood information contained in the MST and the relative neighborhood graph (RNG) [Toussaint 1980]. 5.3 Mixture-Resolving and Mode-Seeking Algorithms The mixture resolving approach to cluster analysis has been addressed in a number of ways. The underlying assumption is that the patterns to be clustered are drawn from one of several distributions, and the goal is to identify the parameters of each and (perhaps) their number. Most of the work in this area has assumed that the individual components of the mixture density are Gaussian, and in this case the parameters of the individual Gaussians are to be estimated by the procedure. Traditional approaches to this problem involve obtaining (iteratively) a maximum likelihood estimate of the parameter vectors of the component densities [Jain and Dubes 1988]. More recently, the Expectation Maximization (EM) algorithm (a generalpurpose maximum likelihood algorithm [Dempster et al. 1977] for missing-data problems) has been applied to the problem of parameter estimation. A recent book [Mitchell 1997] provides an accessible description of the technique. In the EM framework, the parameters of the component densities are unknown, as are the mixing parameters, and these are estimated from the patterns. The EM procedure begins with an initial estimate of the parameter vector and iteratively rescores the patterns against the mixture density produced by the parameter vector. The rescored patterns are then used to update the parameter estimates. In a clustering context, the scores of the patterns (which essentially measure their likelihood of being drawn from particular components of the mixture) can be viewed as hints at the class of the pattern. Those patterns, placed (by their scores) in a particular component, would therefore be viewed as belonging to the same cluster. Nonparametric techniques for density-based clustering have also been developed [Jain and Dubes 1988]. Inspired by the Parzen window approach to nonparametric density estimation, the corresponding clustering procedure searches for bins with large counts in a multidimensional histogram of the input pattern set. Other approaches include the application of another partitional or hierarchical clustering algorithm using a distance measure based on a nonparametric density estimate. 5.4 Nearest Neighbor Clustering Since proximity plays a key role in our intuitive notion of a cluster, nearestneighbor distances can serve as the basis of clustering procedures. An iterative procedure was proposed in Lu and Fu [1978]; it assigns each unlabeled pattern to the cluster of its nearest labeled neighbor pattern, provided the distance to that labeled neighbor is below a threshold. The process continues until all patterns are labeled or no additional labelings occur. The mutual neighborhood value (described earlier in the context of distance computation) can also be used to grow clusters from near neighbors. X B A C D E F G H I edge with the maximum length 2 2 6 2.3 4.5 2 2 2 X2 1 Figure 15. Using the minimal spanning tree to form clusters. 280 • A. Jain et al. ACM Computing Surveys, Vol. 31, No. 3, September 1999
Data Clustering 281 5.5 Fuzzy Clustering Traditional clustering approaches gen- erate partitions;in a partition,each pattern belongs to one and only one cluster.Hence,the clusters in a hard clustering are disjoint.Fuzzy clustering extends this notion to associate each pattern with every cluster using a mem- bership function [Zadeh 1965].The out- put of such algorithms is a clustering, H but not a partition.We give a high-level partitional fuzzy clustering algorithm below. X Fuzzy Clustering Algorithm Figure 16.Fuzzy clusters. (1)Select an initial fuzzy partition of the N objects into K clusters by have membership values in [0,1]for selecting the NX K membership each cluster.For example,fuzzy cluster matrix U.An element uij of this F could be compactly described as matrix represents the grade of mem- bership of object xi in cluster cj. {1,0.9),(2,0.8),(3,0.7),(4,0.6),(5,0.55), Typically,u∈[0,1]. (6,0.2),(7,0.2),(8,0.0),(9,0.0)} (2)Using U,find the value of a fuzzy criterion function,e.g.,a weighted and F2 could be described as squared error criterion function,as- sociated with the corresponding par- {(1,0.0),(2,0.0),(3,0.0)(4,0.1),(5,0.15) tition.One possible fuzzy criterion function is (6,0.4),(7,0.35),(8,1.0),(9,0.9)} N K The ordered pairs(i,ui)in each cluster E2(,U)=∑∑uk-cs, represent the ith pattern and its mem- i=1k=1 bership value to the cluster ui.Larger N membership values indicate higher con- where ca=uikxi is the kth fuzzy fidence in the assignment of the pattern i=1 cluster center. to the cluster.A hard clustering can be Reassign patterns to clusters to re- obtained from a fuzzy partition by duce this criterion function value thresholding the membership value Fuzzy set theory was initially applied and recompute U. to clustering in Ruspini [1969].The (3)Repeat step 2 until entries in U do book by Bezdek [1981]is a good source not change significantly. for material on fuzzy clustering.The most popular fuzzy clustering algorithm In fuzzy clustering,each cluster is a is the fuzzy c-means(FCM)algorithm. fuzzy set of all the patterns.Figure 16 Even though it is better than the hard illustrates the idea.The rectangles en-k-means algorithm at avoiding local close two "hard"clusters in the data: minima,FCM can still converge to local H={1,2,3,4,5}and H2=(6,7,8,9).minima of the squared error criterion. A fuzzy clustering algorithm might pro-The design of membership functions is duce the two fuzzy clusters F and F2 the most important problem in fuzzy depicted by ellipses.The patterns will clustering;different choices include ACM Computing Surveys,Vol.31,No.3,September 1999
5.5 Fuzzy Clustering Traditional clustering approaches generate partitions; in a partition, each pattern belongs to one and only one cluster. Hence, the clusters in a hard clustering are disjoint. Fuzzy clustering extends this notion to associate each pattern with every cluster using a membership function [Zadeh 1965]. The output of such algorithms is a clustering, but not a partition. We give a high-level partitional fuzzy clustering algorithm below. Fuzzy Clustering Algorithm (1) Select an initial fuzzy partition of the N objects into K clusters by selecting the N 3 K membership matrix U. An element uij of this matrix represents the grade of membership of object xi in cluster cj. Typically, uij [ @0,1#. (2) Using U, find the value of a fuzzy criterion function, e.g., a weighted squared error criterion function, associated with the corresponding partition. One possible fuzzy criterion function is E2 ~-, U! 5 O i51 N O k51 K uijixi 2 cki 2 , where ck 5 ( i51 N uikxi is the kth fuzzy cluster center. Reassign patterns to clusters to reduce this criterion function value and recompute U. (3) Repeat step 2 until entries in U do not change significantly. In fuzzy clustering, each cluster is a fuzzy set of all the patterns. Figure 16 illustrates the idea. The rectangles enclose two “hard” clusters in the data: H1 5 $1,2,3,4,5% and H2 5 $6,7,8,9%. A fuzzy clustering algorithm might produce the two fuzzy clusters F1 and F2 depicted by ellipses. The patterns will have membership values in [0,1] for each cluster. For example, fuzzy cluster F1 could be compactly described as $~1,0.9!, ~2,0.8!, ~3,0.7!, ~4,0.6!, ~5,0.55!, ~6,0.2!, ~7,0.2!, ~8,0.0!, ~9,0.0!% and F2 could be described as $~1,0.0!, ~2,0.0!, ~3,0.0!, ~4,0.1!, ~5,0.15!, ~6,0.4!, ~7,0.35!, ~8,1.0!, ~9,0.9!% The ordered pairs ~i, mi! in each cluster represent the ith pattern and its membership value to the cluster mi. Larger membership values indicate higher confidence in the assignment of the pattern to the cluster. A hard clustering can be obtained from a fuzzy partition by thresholding the membership value. Fuzzy set theory was initially applied to clustering in Ruspini [1969]. The book by Bezdek [1981] is a good source for material on fuzzy clustering. The most popular fuzzy clustering algorithm is the fuzzy c-means (FCM) algorithm. Even though it is better than the hard k-means algorithm at avoiding local minima, FCM can still converge to local minima of the squared error criterion. The design of membership functions is the most important problem in fuzzy clustering; different choices include X Y 1 2 3 4 5 6 7 8 9 H 2 H F F 1 1 2 Figure 16. Fuzzy clusters. Data Clustering • 281 ACM Computing Surveys, Vol. 31, No. 3, September 1999
282 A.Jain et al. ⊙ X By The Centroid By Three Distant Points' Figure 17.Representation of a cluster by points. those based on similarity decomposition (2)Represent clusters using nodes in a and centroids of clusters.A generaliza- classification tree.This is illus- tion of the FCM algorithm was proposed trated in Figure 18. by Bezdek [1981]through a family of objective functions.A fuzzy c-shell algo- (3)Represent clusters by using conjunc- rithm and an adaptive variant for de- tive logical expressions.For example, tecting circular and elliptical bound- the expression [X>3][X2<2]in aries was presented in Dave [1992]. Figure 18 stands for the logical state- ment Xi is greater than 3'and'X2 is 5.6 Representation of Clusters less than 2'. In applications where the number of Use of the centroid to represent a classes or clusters in a data set must be cluster is the most popular scheme.It discovered,a partition of the data set is works well when the clusters are com- the end product.Here,a partition gives pact or isotropic.However,when the an idea about the separability of the clusters are elongated or non-isotropic, data points into clusters and whether it then this scheme fails to represent them is meaningful to employ a supervised properly.In such a case,the use of a classifier that assumes a given number collection of boundary points in a clus- of classes in the data set.However,in ter captures its shape well.The number many other applications that involve of points used to represent a cluster decision making,the resulting clusters should increase as the complexity of its have to be represented or described in a shape increases.The two different rep- compact form to achieve data abstrac- resentations illustrated in Figure 18 are tion.Even though the construction of a equivalent.Every path in a classifica- cluster representation is an important tion tree from the root node to a leaf step in decision making,it has not been node corresponds to a conjunctive state- examined closely by researchers.The ment.An important limitation of the notion of cluster representation was in- typical use of the simple conjunctive troduced in Duran and Odell [1974]and concept representations is that they can was subsequently studied in Diday and describe only rectangular or isotropic Simon [1976]and Michalski et al. clusters in the feature space. [1981].They suggested the following Data abstraction is useful in decision representation schemes: making because of the following: (1)Represent a cluster of points by (1)It gives a simple and intuitive de- their centroid or by a set of distant scription of clusters which is easy points in the cluster.Figure 17 de- for human comprehension.In both picts these two ideas. conceptual clustering [Michalski ACM Computing Surveys,Vol.31,No.3,September 1999
those based on similarity decomposition and centroids of clusters. A generalization of the FCM algorithm was proposed by Bezdek [1981] through a family of objective functions. A fuzzy c-shell algorithm and an adaptive variant for detecting circular and elliptical boundaries was presented in Dave [1992]. 5.6 Representation of Clusters In applications where the number of classes or clusters in a data set must be discovered, a partition of the data set is the end product. Here, a partition gives an idea about the separability of the data points into clusters and whether it is meaningful to employ a supervised classifier that assumes a given number of classes in the data set. However, in many other applications that involve decision making, the resulting clusters have to be represented or described in a compact form to achieve data abstraction. Even though the construction of a cluster representation is an important step in decision making, it has not been examined closely by researchers. The notion of cluster representation was introduced in Duran and Odell [1974] and was subsequently studied in Diday and Simon [1976] and Michalski et al. [1981]. They suggested the following representation schemes: (1) Represent a cluster of points by their centroid or by a set of distant points in the cluster. Figure 17 depicts these two ideas. (2) Represent clusters using nodes in a classification tree. This is illustrated in Figure 18. (3) Represent clusters by using conjunctive logical expressions. For example, the expression @X1 . 3#@X2 , 2# in Figure 18 stands for the logical statement ‘X1 is greater than 3’ and ’X2 is less than 2’. Use of the centroid to represent a cluster is the most popular scheme. It works well when the clusters are compact or isotropic. However, when the clusters are elongated or non-isotropic, then this scheme fails to represent them properly. In such a case, the use of a collection of boundary points in a cluster captures its shape well. The number of points used to represent a cluster should increase as the complexity of its shape increases. The two different representations illustrated in Figure 18 are equivalent. Every path in a classification tree from the root node to a leaf node corresponds to a conjunctive statement. An important limitation of the typical use of the simple conjunctive concept representations is that they can describe only rectangular or isotropic clusters in the feature space. Data abstraction is useful in decision making because of the following: (1) It gives a simple and intuitive description of clusters which is easy for human comprehension. In both conceptual clustering [Michalski X X By The Centroid By Three Distant Points * * * * * * * * * * * * * * 1 X2 X2 1 Figure 17. Representation of a cluster by points. 282 • A. Jain et al. ACM Computing Surveys, Vol. 31, No. 3, September 1999
Data Clustering 283 X>3 >2 Using Nodes in a Classification Tree 1:[X<312:[X>3]X2]3:X>3]X>2] Using Conjunctive Statements Figure 18.Representation of clusters by a classification tree or by conjunctive statements and Stepp 1983]and symbolic clus- by representing the subclusters by tering [Gowda and Diday 1992]this their centroids. representation is obtained without using an additional step.These al- (3)It increases the efficiency of the de- gorithms generate the clusters as cision making task.In a cluster- well as their descriptions.A set of based document retrieval technique fuzzy rules can be obtained from [Salton 1991],a large collection of fuzzy clusters of a data set.These documents is clustered and each of rules can be used to build fuzzy clas- the clusters is represented using its sifiers and fuzzy controllers. centroid.In order to retrieve docu- (2)It helps in achieving data compres- ments relevant to a query,the query sion that can be exploited further by is matched with the cluster cen- a computer [Murty and Krishna troids rather than with all the docu- 1980].Figure 19(a)shows samples ments.This helps in retrieving rele- belonging to two chain-like clusters vant documents efficiently.Also in labeled 1 and 2.A partitional clus- several applications involving large tering like the k-means algorithm data sets,clustering is used to per- cannot separate these two struc- form indexing,which helps in effi- tures properly.The single-link algo- cient decision making [Dorai and rithm works well on this data,but is Jain 1995]. computationally expensive.So a hy- brid approach may be used to ex- ploit the desirable properties of both 5.7 Artificial Neural Networks for these algorithms.We obtain 8 sub- Clustering clusters of the data using the (com- putationally efficient)k-means algo- Artificial neural networks (ANNs) rithm.Each of these subclusters can [Hertz et al.1991]are motivated by be represented by their centroids as biological neural networks.ANNs have shown in Figure 19(a).Now the sin-been used extensively over the past gle-link algorithm can be applied on three decades for both classification and these centroids alone to cluster clustering [Sethi and Jain 1991;Jain them into 2 groups.The resulting and Mao 1994].Some of the features of groups are shown in Figure 19(b). the ANNs that are important in pattern Here,a data reduction is achieved clustering are: ACM Computing Surveys,Vol.31,No.3,September 1999
and Stepp 1983] and symbolic clustering [Gowda and Diday 1992] this representation is obtained without using an additional step. These algorithms generate the clusters as well as their descriptions. A set of fuzzy rules can be obtained from fuzzy clusters of a data set. These rules can be used to build fuzzy classifiers and fuzzy controllers. (2) It helps in achieving data compression that can be exploited further by a computer [Murty and Krishna 1980]. Figure 19(a) shows samples belonging to two chain-like clusters labeled 1 and 2. A partitional clustering like the k-means algorithm cannot separate these two structures properly. The single-link algorithm works well on this data, but is computationally expensive. So a hybrid approach may be used to exploit the desirable properties of both these algorithms. We obtain 8 subclusters of the data using the (computationally efficient) k-means algorithm. Each of these subclusters can be represented by their centroids as shown in Figure 19(a). Now the single-link algorithm can be applied on these centroids alone to cluster them into 2 groups. The resulting groups are shown in Figure 19(b). Here, a data reduction is achieved by representing the subclusters by their centroids. (3) It increases the efficiency of the decision making task. In a clusterbased document retrieval technique [Salton 1991], a large collection of documents is clustered and each of the clusters is represented using its centroid. In order to retrieve documents relevant to a query, the query is matched with the cluster centroids rather than with all the documents. This helps in retrieving relevant documents efficiently. Also in several applications involving large data sets, clustering is used to perform indexing, which helps in efficient decision making [Dorai and Jain 1995]. 5.7 Artificial Neural Networks for Clustering Artificial neural networks (ANNs) [Hertz et al. 1991] are motivated by biological neural networks. ANNs have been used extensively over the past three decades for both classification and clustering [Sethi and Jain 1991; Jain and Mao 1994]. Some of the features of the ANNs that are important in pattern clustering are: 012345 X 0 1 2 3 4 5 | | | | | | | | | | | | | | - - - - - - - - - - 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 X < 3 X >3 1 2 3 Using Nodes in a Classification Tree Using Conjunctive Statements X2 1 1: [X <3]; 2: [X >3][X <2]; 3:[X >3][X >2] 1 1 1 X <2 X >2 2 2 1 2 1 2 Figure 18. Representation of clusters by a classification tree or by conjunctive statements. Data Clustering • 283 ACM Computing Surveys, Vol. 31, No. 3, September 1999