274 。A.Jain et al. XX -Monothetic us.polythetic:This aspect relates to the sequential or simulta- neous use of features in the clustering X process.Most algorithms are polythe- X tic;that is,all features enter into the B computation of distances between X patterns,and decisions are based on XX XXXXXXXXXX XXXX those distances.A simple monothetic A algorithm reported in Anderberg [1973]considers features sequentially to divide the given collection of pat- X terns.This is illustrated in Figure 8. X 中 Here,the collection is divided into two groups using feature x;the verti- Figure 6.Conceptual similarity be- cal broken line V is the separating tween points. line.Each of these clusters is further divided independently using feature x2,as depicted by the broken lines HI discuss several pragmatic issues associ- and H2.The major problem with this ated with its use in Section 5. algorithm is that it generates 2d clus- ters where d is the dimensionality of 5.CLUSTERING TECHNIQUES the patterns.For large values of d Different approaches to clustering data (d >100 is typical in information re- can be described with the help of the trieval applications [Salton 1991]), hierarchy shown in Figure 7(other tax- the number of clusters generated by onometric representations of clustering this algorithm is so large that the methodology are possible;ours is based data set is divided into uninterest- on the discussion in Jain and Dubes ingly small and fragmented clusters. [1988]).At the top level,there is a dis- tinction between hierarchical and parti- -Hard vs.fuzzy:A hard clustering al- tional approaches (hierarchical methods gorithm allocates each pattern to a produce a nested series of partitions, single cluster during its operation and while partitional methods produce only in its output.A fuzzy clustering one). method assigns degrees of member- The taxonomy shown in Figure 7 ship in several clusters to each input must be supplemented by a discussion pattern.A fuzzy clustering can be of cross-cutting issues that may (in converted to a hard clustering by as- principle)affect all of the different ap- signing each pattern to the cluster proaches regardless of their placement with the largest measure of member- in the taxonomy. ship. -Agglomerative us.divisive:This as- -Deterministic vs.stochastic:This is- pect relates to algorithmic structure sue is most relevant to partitional and operation.An agglomerative ap- approaches designed to optimize a proach begins with each pattern in a squared error function.This optimiza- distinct (singleton)cluster,and suc- tion can be accomplished using tradi- cessively merges clusters together un- tional techniques or through a ran- til a stopping criterion is satisfied.A dom search of the state space divisive method begins with all pat- consisting of all possible labelings. terns in a single cluster and performs splitting until a stopping criterion is -Incremental us.non-incremental: met. This issue arises when the pattern set ACM Computing Surveys,Vol.31,No.3,September 1999
discuss several pragmatic issues associated with its use in Section 5. 5. CLUSTERING TECHNIQUES Different approaches to clustering data can be described with the help of the hierarchy shown in Figure 7 (other taxonometric representations of clustering methodology are possible; ours is based on the discussion in Jain and Dubes [1988]). At the top level, there is a distinction between hierarchical and partitional approaches (hierarchical methods produce a nested series of partitions, while partitional methods produce only one). The taxonomy shown in Figure 7 must be supplemented by a discussion of cross-cutting issues that may (in principle) affect all of the different approaches regardless of their placement in the taxonomy. —Agglomerative vs. divisive: This aspect relates to algorithmic structure and operation. An agglomerative approach begins with each pattern in a distinct (singleton) cluster, and successively merges clusters together until a stopping criterion is satisfied. A divisive method begins with all patterns in a single cluster and performs splitting until a stopping criterion is met. —Monothetic vs. polythetic: This aspect relates to the sequential or simultaneous use of features in the clustering process. Most algorithms are polythetic; that is, all features enter into the computation of distances between patterns, and decisions are based on those distances. A simple monothetic algorithm reported in Anderberg [1973] considers features sequentially to divide the given collection of patterns. This is illustrated in Figure 8. Here, the collection is divided into two groups using feature x1; the vertical broken line V is the separating line. Each of these clusters is further divided independently using feature x2, as depicted by the broken lines H1 and H2. The major problem with this algorithm is that it generates 2d clusters where d is the dimensionality of the patterns. For large values of d (d . 100 is typical in information retrieval applications [Salton 1991]), the number of clusters generated by this algorithm is so large that the data set is divided into uninterestingly small and fragmented clusters. —Hard vs. fuzzy: A hard clustering algorithm allocates each pattern to a single cluster during its operation and in its output. A fuzzy clustering method assigns degrees of membership in several clusters to each input pattern. A fuzzy clustering can be converted to a hard clustering by assigning each pattern to the cluster with the largest measure of membership. —Deterministic vs. stochastic: This issue is most relevant to partitional approaches designed to optimize a squared error function. This optimization can be accomplished using traditional techniques or through a random search of the state space consisting of all possible labelings. —Incremental vs. non-incremental: This issue arises when the pattern set x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x A B C Figure 6. Conceptual similarity between points . 274 • A. Jain et al. ACM Computing Surveys, Vol. 31, No. 3, September 1999
Data Clustering 275 Clustering Hierarchical Partitional Single Complete Square Graph Mixture Mode Link Link Error Theoretic Resolving Seeking k-means Expectation Maximization Figure 7.A taxonomy of clustering approaches. M to be clustered is large,and con- straints on execution time or memory 222,22 3 space affect the architecture of the algorithm.The early history of clus- 2路 333,3 H,1 3333 tering methodology does not contain 3333 many examples of clustering algo- 1111 33 33 H rithms designed to work with large ,1111 L data sets,but the advent of data min- 111111 4 ing has fostered the development of 111 1 clustering algorithms that minimize the number of scans through the pat- / tern set,reduce the number of pat- w terns examined during execution,or reduce the size of data structures Figure 8.Monothetic partitional clustering. used in the algorithm's operations. A cogent observation in Jain and Dubes [1988]is that the specification of points in Figure 9 (obtained from the an algorithm for clustering usually single-link algorithm [Jain and Dubes leaves considerable flexibilty in imple- 1988])is shown in Figure 10.The den- mentation. drogram can be broken at different lev- els to yield different clusterings of the 5.1 Hierarchical Clustering Algorithms data. Most hierarchical clustering algo- The operation of a hierarchical cluster-rithms are variants of the single-link ing algorithm is illustrated using the [Sneath and Sokal 1973],complete-link two-dimensional data set in Figure 9. [King 1967],and minimum-variance This figure depicts seven patterns la-[Ward 1963;Murtagh 1984]algorithms. beled A,B,C,D,E,F,and G in three Of these,the single-link and complete- clusters.A hierarchical algorithm yields link algorithms are most popular.These a dendrogram representing the nested two algorithms differ in the way they grouping of patterns and similarity lev-characterize the similarity between a els at which groupings change.A den-pair of clusters.In the single-link drogram corresponding to the seven method,the distance between two clus- ACM Computing Surveys,Vol.31,No.3,September 1999
to be clustered is large, and constraints on execution time or memory space affect the architecture of the algorithm. The early history of clustering methodology does not contain many examples of clustering algorithms designed to work with large data sets, but the advent of data mining has fostered the development of clustering algorithms that minimize the number of scans through the pattern set, reduce the number of patterns examined during execution, or reduce the size of data structures used in the algorithm’s operations. A cogent observation in Jain and Dubes [1988] is that the specification of an algorithm for clustering usually leaves considerable flexibilty in implementation. 5.1 Hierarchical Clustering Algorithms The operation of a hierarchical clustering algorithm is illustrated using the two-dimensional data set in Figure 9. This figure depicts seven patterns labeled A, B, C, D, E, F, and G in three clusters. A hierarchical algorithm yields a dendrogram representing the nested grouping of patterns and similarity levels at which groupings change. A dendrogram corresponding to the seven points in Figure 9 (obtained from the single-link algorithm [Jain and Dubes 1988]) is shown in Figure 10. The dendrogram can be broken at different levels to yield different clusterings of the data. Most hierarchical clustering algorithms are variants of the single-link [Sneath and Sokal 1973], complete-link [King 1967], and minimum-variance [Ward 1963; Murtagh 1984] algorithms. Of these, the single-link and completelink algorithms are most popular. These two algorithms differ in the way they characterize the similarity between a pair of clusters. In the single-link method, the distance between two clusClustering Partitional Single Link Complete Link Hierarchical Square Error Graph Theoretic Mixture Resolving Mode Seeking k-means Maximization Expectation Figure 7. A taxonomy of clustering approaches. X 1 1 1 1 1 1 1 11 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 22 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 V H H2 1 X2 1 Figure 8. Monothetic partitional clustering. Data Clustering • 275 ACM Computing Surveys, Vol. 31, No. 3, September 1999
276·A.Jain et al. tween patterns in the two clusters.In G either case,two clusters are merged to form a larger cluster based on minimum Cluster3 distance criteria.The complete-link al- gorithm produces tightly bound or com- pact clusters [Baeza-Yates 1992].The Cluster2 single-link algorithm,by contrast,suf- Clusterl DE fers from a chaining effect [Nagy 1968]. B It has a tendency to produce clusters that are straggly or elongated.There X are two clusters in Figures 12 and 13 separated by a "bridge"of noisy pat- Figure 9.Points falling in three clusters terns.The single-link algorithm pro- duces the clusters shown in Figure 12 whereas the complete-link algorithm ob- tains the clustering shown in Figure 13. The clusters obtained by the complete- link algorithm are more compact than those obtained by the single-link algo- rithm;the cluster labeled 1 obtained using the single-link algorithm is elon- gated because of the noisy patterns la- beled“*”.The single-link algorithm is more versatile than the complete-link algorithm,otherwise.For example,the single-link algorithm can extract the concentric clusters shown in Figure 11, but the complete-link algorithm cannot. Figure 10.The dendrogram obtained using However,from a pragmatic viewpoint,it the single-link algorithm. has been observed that the complete- link algorithm produces more useful hi- erarchies in many applications than the single-link algorithm [Jain and Dubes 1988]. Agglomerative Single-Link Clus- tering Algorithm (1)Place each pattern in its own clus- ter.Construct a list of interpattern distances for all distinct unordered pairs of patterns,and sort this list Figure 11.Two concentric clusters. in ascending order. (2)Step through the sorted list of dis- ters is the minimum of the distances tances,forming for each distinct dis- between all pairs of patterns drawn similarity value d a graph on the from the two clusters (one pattern from patterns where pairs of patterns the first cluster,the other from the sec- closer than d are connected by a ond).In the complete-link algorithm, graph edge.If all the patterns are the distance between two clusters is the members of a connected graph,stop. maximum of all pairwise distances be- Otherwise,repeat this step. ACM Computing Surveys,Vol.31,No.3,September 1999
ters is the minimum of the distances between all pairs of patterns drawn from the two clusters (one pattern from the first cluster, the other from the second). In the complete-link algorithm, the distance between two clusters is the maximum of all pairwise distances between patterns in the two clusters. In either case, two clusters are merged to form a larger cluster based on minimum distance criteria. The complete-link algorithm produces tightly bound or compact clusters [Baeza-Yates 1992]. The single-link algorithm, by contrast, suffers from a chaining effect [Nagy 1968]. It has a tendency to produce clusters that are straggly or elongated. There are two clusters in Figures 12 and 13 separated by a “bridge” of noisy patterns. The single-link algorithm produces the clusters shown in Figure 12, whereas the complete-link algorithm obtains the clustering shown in Figure 13. The clusters obtained by the completelink algorithm are more compact than those obtained by the single-link algorithm; the cluster labeled 1 obtained using the single-link algorithm is elongated because of the noisy patterns labeled “*”. The single-link algorithm is more versatile than the complete-link algorithm, otherwise. For example, the single-link algorithm can extract the concentric clusters shown in Figure 11, but the complete-link algorithm cannot. However, from a pragmatic viewpoint, it has been observed that the completelink algorithm produces more useful hierarchies in many applications than the single-link algorithm [Jain and Dubes 1988]. Agglomerative Single-Link Clustering Algorithm (1) Place each pattern in its own cluster. Construct a list of interpattern distances for all distinct unordered pairs of patterns, and sort this list in ascending order. (2) Step through the sorted list of distances, forming for each distinct dissimilarity value dk a graph on the patterns where pairs of patterns closer than dk are connected by a graph edge. If all the patterns are members of a connected graph, stop. Otherwise, repeat this step. X A B C DE F G Cluster1 Cluster2 Cluster3 X 1 2 Figure 9. Points falling in three clusters. ABCDEF G S i m i l a r i t y Figure 10. The dendrogram obtained using the single-link algorithm. X Y 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 Figure 11. Two concentric clusters. 276 • A. Jain et al. ACM Computing Surveys, Vol. 31, No. 3, September 1999
Data Clustering 277 2 2 2 -222 222 22 2 2 2 22 2 222 X Figure 12.A single-link clustering of a pattern Figure 13.A complete-link clustering of a pat- set containing two classes(1 and 2)connected by tern set containing two classes (1 and 2)con- a chain of noisy patterns (*) nected by a chain of noisy patterns(*). (3)The output of the algorithm is a well-separated,chain-like,and concen- nested hierarchy of graphs which tric clusters,whereas a typical parti- can be cut at a desired dissimilarity tional algorithm such as the k-means level forming a partition(clustering) algorithm works well only on data sets identified by simply connected com- having isotropic clusters [Nagy 1968]. ponents in the corresponding graph.On the other hand,the time and space Agglomerative Complete-Link Clus- complexities [Day 1992]of the parti- lower tering Algorithm tional algorithms are typically than those of the hierarchical algo- (1)Place each pattern in its own clus-rithms.It is possible to develop hybrid ter.Construct a list of interpattern algorithms [Murty and Krishna 1980] distances for all distinct unordered that exploit the good features of both pairs of patterns,and sort this list categories. in ascending order. Hierarchical Agglomerative Clus- (2)Step through the sorted list of dis- tering Algorithm tances,forming for each distinct dis-(1)Compute the proximity matrix con- similarity value d a graph on the taining the distance between each patterns where pairs of patterns pair of patterns.Treat each pattern closer than d are connected by a as a cluster. graph edge.If all the patterns are (2)Find the most similar pair of clus- members of a completely connected ters using the proximity matrix. graph,stop. Merge these two clusters into one (3)The output of the algorithm is a cluster.Update the proximity ma- nested hierarchy of graphs which trix to reflect this merge operation. can be cut at a desired dissimilarity (3)If all patterns are in one cluster, level forming a partition(clustering) stop.Otherwise,go to step 2. identified by completely connected components in the corresponding Based on the way the proximity matrix is updated in step 2,a variety of ag- graph. glomerative algorithms can be designed. Hierarchical algorithms are more ver- Hierarchical divisive algorithms start satile than partitional algorithms.For with a single cluster of all the given example,the single-link clustering algo-objects and keep splitting the clusters rithm works well on data sets contain- based on some criterion to obtain a par- ing non-isotropic clusters including tition of singleton clusters. ACM Computing Surveys,Vol.31,No.3,September 1999
(3) The output of the algorithm is a nested hierarchy of graphs which can be cut at a desired dissimilarity level forming a partition (clustering) identified by simply connected components in the corresponding graph. Agglomerative Complete-Link Clustering Algorithm (1) Place each pattern in its own cluster. Construct a list of interpattern distances for all distinct unordered pairs of patterns, and sort this list in ascending order. (2) Step through the sorted list of distances, forming for each distinct dissimilarity value dk a graph on the patterns where pairs of patterns closer than dk are connected by a graph edge. If all the patterns are members of a completely connected graph, stop. (3) The output of the algorithm is a nested hierarchy of graphs which can be cut at a desired dissimilarity level forming a partition (clustering) identified by completely connected components in the corresponding graph. Hierarchical algorithms are more versatile than partitional algorithms. For example, the single-link clustering algorithm works well on data sets containing non-isotropic clusters including well-separated, chain-like, and concentric clusters, whereas a typical partitional algorithm such as the k-means algorithm works well only on data sets having isotropic clusters [Nagy 1968]. On the other hand, the time and space complexities [Day 1992] of the partitional algorithms are typically lower than those of the hierarchical algorithms. It is possible to develop hybrid algorithms [Murty and Krishna 1980] that exploit the good features of both categories. Hierarchical Agglomerative Clustering Algorithm (1) Compute the proximity matrix containing the distance between each pair of patterns. Treat each pattern as a cluster. (2) Find the most similar pair of clusters using the proximity matrix. Merge these two clusters into one cluster. Update the proximity matrix to reflect this merge operation. (3) If all patterns are in one cluster, stop. Otherwise, go to step 2. Based on the way the proximity matrix is updated in step 2, a variety of agglomerative algorithms can be designed. Hierarchical divisive algorithms start with a single cluster of all the given objects and keep splitting the clusters based on some criterion to obtain a partition of singleton clusters. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 X 1 1 1 1 1 1 2 2 2 2 2 2 2 *** * * * * ** 1 X2 Figure 12. A single-link clustering of a pattern set containing two classes (1 and 2) connected by a chain of noisy patterns (*). 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 X 1 1 1 1 1 1 2 2 2 2 2 2 2 *** * * * * ** 1 X2 Figure 13. A complete-link clustering of a pattern set containing two classes (1 and 2) connected by a chain of noisy patterns (*). Data Clustering • 277 ACM Computing Surveys, Vol. 31, No. 3, September 1999
278 A.Jain et al. 5.2 Partitional Algorithms a convergence criterion is met (e.g., A partitional clustering algorithm ob- there is no reassignment of any pattern from one cluster to another,or the tains a single partition of the data in- stead of a clustering structure,such as squared error ceases to decrease signifi- the dendrogram produced by a hierar- cantly after some number of iterations). chical technique.Partitional methods The k-means algorithm is popular be- have advantages in applications involv- cause it is easy to implement,and its ing large data sets for which the con- time complexity is O(n),where n is the struction of a dendrogram is computa-number of patterns.A major problem tionally prohibitive.A problem with this algorithm is that it is sensitive accompanying the use of a partitional to the selection of the initial partition algorithm is the choice of the number of and may converge to a local minimum of desired output clusters.A seminal pa- the criterion function value if the initial per [Dubes 1987]provides guidance on partition is not properly chosen.Figure this key design decision.The partitional 14 shows seven two-dimensional pat- techniques usually produce clusters by terns.If we start with patterns A,B optimizing a criterion function defined and C as the initial means around either locally (on a subset of the pat-which the three clusters are built,then terns)or globally(defined over all of the we end up with the partition ((A),(B, patterns).Combinatorial search of the C),(D,E,F,G))shown by ellipses.The set of possible labelings for an optimum squared error criterion value is much value of a criterion is clearly computa- larger for this partition than for the tionally prohibitive.In practice,there-best partition ((A,B,C),(D,E),(F,G) fore,the algorithm is typically run mul-shown by rectangles,which yields the tiple times with different starting global minimum value of the squared states,and the best configuration ob-error criterion function for a clustering tained from all of the runs is used as the containing three clusters.The correct output clustering. three-cluster solution is obtained by choosing,for example,A,D,and F as 5.2.1 Squared Error Algorithms. the initial cluster means. The most intuitive and frequently used criterion function in partitional cluster- Squared Error Clustering Method ing techniques is the squared error cri-(1)Select an initial partition of the pat- terion,which tends to work well with terns with a fixed number of clus- isolated and compact clusters.The ters and cluster centers. squared error for a clustering f of a (2)Assign each pattern to its closest pattern set (containing K clusters)is cluster center and compute the new cluster centers as the centroids of K nj e2(慨,9,=∑x9-c, the clusters.Repeat this step until convergence is achieved,i.e.,until j=1i=1 the cluster membership is stable. where x is the ith pattern belonging to (3)Merge and split clusters based on the jth cluster and c is the centroid of some heuristic information,option- ally repeating step 2. the h cluster. The k-means is the simplest and most k-Means Clustering Algorithm commonly used algorithm employing a squared error criterion [McQueen 1967]. (1)Choose k cluster centers to coincide It starts with a random initial partition with k randomly-chosen patterns or and keeps reassigning the patterns to k randomly defined points inside clusters based on the similarity between the hypervolume containing the pat- the pattern and the cluster centers until tern set. ACM Computing Surveys,Vol.31,No.3,September 1999
5.2 Partitional Algorithms A partitional clustering algorithm obtains a single partition of the data instead of a clustering structure, such as the dendrogram produced by a hierarchical technique. Partitional methods have advantages in applications involving large data sets for which the construction of a dendrogram is computationally prohibitive. A problem accompanying the use of a partitional algorithm is the choice of the number of desired output clusters. A seminal paper [Dubes 1987] provides guidance on this key design decision. The partitional techniques usually produce clusters by optimizing a criterion function defined either locally (on a subset of the patterns) or globally (defined over all of the patterns). Combinatorial search of the set of possible labelings for an optimum value of a criterion is clearly computationally prohibitive. In practice, therefore, the algorithm is typically run multiple times with different starting states, and the best configuration obtained from all of the runs is used as the output clustering. 5.2.1 Squared Error Algorithms. The most intuitive and frequently used criterion function in partitional clustering techniques is the squared error criterion, which tends to work well with isolated and compact clusters. The squared error for a clustering + of a pattern set - (containing K clusters) is e2 ~-, +! 5 O j51 K O i51 nj ixi ~ j! 2 cji 2 , where xi ~ j! is the ith pattern belonging to the jth cluster and cj is the centroid of the jth cluster. The k-means is the simplest and most commonly used algorithm employing a squared error criterion [McQueen 1967]. It starts with a random initial partition and keeps reassigning the patterns to clusters based on the similarity between the pattern and the cluster centers until a convergence criterion is met (e.g., there is no reassignment of any pattern from one cluster to another, or the squared error ceases to decrease significantly after some number of iterations). The k-means algorithm is popular because it is easy to implement, and its time complexity is O~n!, where n is the number of patterns. A major problem with this algorithm is that it is sensitive to the selection of the initial partition and may converge to a local minimum of the criterion function value if the initial partition is not properly chosen. Figure 14 shows seven two-dimensional patterns. If we start with patterns A, B, and C as the initial means around which the three clusters are built, then we end up with the partition {{A}, {B, C}, {D, E, F, G}} shown by ellipses. The squared error criterion value is much larger for this partition than for the best partition {{A, B, C}, {D, E}, {F, G}} shown by rectangles, which yields the global minimum value of the squared error criterion function for a clustering containing three clusters. The correct three-cluster solution is obtained by choosing, for example, A, D, and F as the initial cluster means. Squared Error Clustering Method (1) Select an initial partition of the patterns with a fixed number of clusters and cluster centers. (2) Assign each pattern to its closest cluster center and compute the new cluster centers as the centroids of the clusters. Repeat this step until convergence is achieved, i.e., until the cluster membership is stable. (3) Merge and split clusters based on some heuristic information, optionally repeating step 2. k-Means Clustering Algorithm (1) Choose k cluster centers to coincide with k randomly-chosen patterns or k randomly defined points inside the hypervolume containing the pattern set. 278 • A. Jain et al. ACM Computing Surveys, Vol. 31, No. 3, September 1999