Data Clustering 269 tion can also be used to improve the survey of the state of the art in cluster- quality of feature extraction,similarity ing circa 1978 was reported in Dubes computation,grouping,and cluster rep- and Jain [1980].A comparison of vari- resentation [Murty and Jain 1995]. ous clustering algorithms for construct- Appropriate constraints on the data ing the minimal spanning tree and the source can be incorporated into a clus- short spanning path was given in Lee tering procedure.One example of this is [1981].Cluster analysis was also sur- mixture resolving [Titterington et al. veyed in Jain et al.[1986].A review of 1985],wherein it is assumed that the image segmentation by clustering was data are drawn from a mixture of an reported in Jain and Flynn [1996].Com- unknown number of densities (often as- parisons of various combinatorial opti- sumed to be multivariate Gaussian). mization schemes,based on experi- The clustering problem here is to iden-ments,have been reported in Mishra tify the number of mixture components and Raghavan [1994]and Al-Sultan and and the parameters of each component. han[1996]. The concept of density clustering and a methodology for decomposition of fea- ture spaces [Bajcsy 1997]have also 1.5 Outline been incorporated into traditional clus- tering methodology,yielding a tech- This paper is organized as follows.Sec- tion 2 presents definitions of terms to be nique for extracting overlapping clus- ters. used throughout the paper.Section 3 summarizes pattern representation, feature extraction,and feature selec- 1.4 History tion.Various approaches to the compu- Even though there is an increasing in- tation of proximity between patterns terest in the use of clustering methods are discussed in Section 4.Section 5 in pattern recognition [Anderberg presents a taxonomy of clustering ap- 1973],image processing [Jain and proaches,describes the major tech- Flynn 1996]and information retrieval niques in use,and discusses emerging [Rasmussen 1992;Salton 1991],cluster- techniques for clustering incorporating ing has a rich history in other disci- non-numeric constraints and the clus- plines [Jain and Dubes 1988]such as tering of large sets of patterns.Section biology,psychiatry,psychology,archae- 6 discusses applications of clustering ology,geology,geography,and market- methods to image analysis and data ing.Other terms more or less synony- mining problems.Finally,Section 7 pre- mous with clustering include sents some concluding remarks. unsupervised learning [Jain and Dubes 1988],numerical taxonomy [Sneath and 2.DEFINITIONS AND NOTATION Sokal 1973],vector quantization [Oehler and Gray 1995],and learning by obser- The following terms and notation are vation [Michalski and Stepp 1983].The used throughout this paper. field of spatial analysis of point pat- terns [Ripley 1988]is also related to -A pattern (or feature vector,observa- cluster analysis.The importance and tion,or datum)x is a single data item interdisciplinary nature of clustering is used by the clustering algorithm.It evident through its vast literature. typically consists of a vector of d mea- A number of books on clustering have been published [Jain and Dubes 1988; surements:x =(x1,...xa). Anderberg 1973;Hartigan 1975;Spath 1980;Duran and Odell 1974;Everitt -The individual scalar components xi 1993;Backer 1995],in addition to some of a pattern x are called features (or useful and influential review papers.A attributes). ACM Computing Surveys,Vol.31,No.3,September 1999
tion can also be used to improve the quality of feature extraction, similarity computation, grouping, and cluster representation [Murty and Jain 1995]. Appropriate constraints on the data source can be incorporated into a clustering procedure. One example of this is mixture resolving [Titterington et al. 1985], wherein it is assumed that the data are drawn from a mixture of an unknown number of densities (often assumed to be multivariate Gaussian). The clustering problem here is to identify the number of mixture components and the parameters of each component. The concept of density clustering and a methodology for decomposition of feature spaces [Bajcsy 1997] have also been incorporated into traditional clustering methodology, yielding a technique for extracting overlapping clusters. 1.4 History Even though there is an increasing interest in the use of clustering methods in pattern recognition [Anderberg 1973], image processing [Jain and Flynn 1996] and information retrieval [Rasmussen 1992; Salton 1991], clustering has a rich history in other disciplines [Jain and Dubes 1988] such as biology, psychiatry, psychology, archaeology, geology, geography, and marketing. Other terms more or less synonymous with clustering include unsupervised learning [Jain and Dubes 1988], numerical taxonomy [Sneath and Sokal 1973], vector quantization [Oehler and Gray 1995], and learning by observation [Michalski and Stepp 1983]. The field of spatial analysis of point patterns [Ripley 1988] is also related to cluster analysis. The importance and interdisciplinary nature of clustering is evident through its vast literature. A number of books on clustering have been published [Jain and Dubes 1988; Anderberg 1973; Hartigan 1975; Spath 1980; Duran and Odell 1974; Everitt 1993; Backer 1995], in addition to some useful and influential review papers. A survey of the state of the art in clustering circa 1978 was reported in Dubes and Jain [1980]. A comparison of various clustering algorithms for constructing the minimal spanning tree and the short spanning path was given in Lee [1981]. Cluster analysis was also surveyed in Jain et al. [1986]. A review of image segmentation by clustering was reported in Jain and Flynn [1996]. Comparisons of various combinatorial optimization schemes, based on experiments, have been reported in Mishra and Raghavan [1994] and Al-Sultan and Khan [1996]. 1.5 Outline This paper is organized as follows. Section 2 presents definitions of terms to be used throughout the paper. Section 3 summarizes pattern representation, feature extraction, and feature selection. Various approaches to the computation of proximity between patterns are discussed in Section 4. Section 5 presents a taxonomy of clustering approaches, describes the major techniques in use, and discusses emerging techniques for clustering incorporating non-numeric constraints and the clustering of large sets of patterns. Section 6 discusses applications of clustering methods to image analysis and data mining problems. Finally, Section 7 presents some concluding remarks. 2. DEFINITIONS AND NOTATION The following terms and notation are used throughout this paper. —A pattern (or feature vector, observation, or datum) x is a single data item used by the clustering algorithm. It typically consists of a vector of d measurements: x 5 ~x1,... xd!. —The individual scalar components xi of a pattern x are called features (or attributes). Data Clustering • 269 ACM Computing Surveys, Vol. 31, No. 3, September 1999
270 A.Jain et al. -d is the dimensionality of the pattern clustering system.Because of the diffi- or of the pattern space. culties surrounding pattern representa- tion,it is conveniently assumed that the -A pattern set is denoted pattern representation is available prior {x1,...x.The ith pattern in is to clustering.Nonetheless,a careful in- denoted x;=(xi,,··.xi,d).In many vestigation of the available features and cases a pattern set to be clustered is any available transformations (even simple ones)can yield significantly im- viewed as an n x d pattern matrix. proved clustering results.A good pat- -A class,in the abstract,refers to a tern representation can often yield a state of nature that governs the pat-simple and easily understood clustering; tern generation process in some cases. a poor pattern representation may yield More concretely,a class can be viewed a complex clustering whose true struc- as a source of patterns whose distri-ture is difficult or impossible to discern. bution in feature space is governed by Figure 3 shows a simple example.The a probability density specific to the points in this 2D feature space are ar- class.Clustering techniques attempt ranged in a curvilinear cluster of ap- to group patterns so that the classes proximately constant distance from the thereby obtained reflect the different origin.If one chooses Cartesian coordi- pattern generation processes repre- nates to represent the patterns,many sented in the pattern set. clustering algorithms would be likely to fragment the cluster into two or more -Hard clustering techniques assign a clusters,since it is not compact.If,how- class label li to each patterns xi,iden-ever,one uses a polar coordinate repre- tifying its class.The set of all labels sentation for the clusters,the radius for a pattern set is coordinate exhibits tight clustering and l1,...I,with li(1,...,k),a one-cluster solution is likely to be where k is the number of clusters. easily obtained. A pattern can measure either a phys- -Fuzzy clustering procedures assign to ical object (e.g.,a chair)or an abstract each input pattern x:a fractional de-notion (e.g.,a style of writing).As noted gree of membership fi;in each output above,patterns are represented conven- tionally as multidimensional vectors, cluster / where each dimension is a single fea- -A distance measure (a specialization ture [Duda and Hart 1973].These fea- of a proximity measure)is a metric tures can be either quantitative or qual- (or quasi-metric)on the feature space itative.For example,if weight and color used to quantify the similarity of pat-are the two features used,then terns. (20,black)is the representation of a black object with 20 units of weight. 3.PATTERN REPRESENTATION,FEATURE The features can be subdivided into the SELECTION AND EXTRACTION following types [Gowda and Diday 1992: There are no theoretical guidelines that suggest the appropriate patterns and (1)Quantitative features:e.g. features to use in a specific situation. (a)continuous values (e.g.,weight); Indeed,the pattern generation process (b)discrete values (e.g.,the number is often not directly controllable;the of computers); user's role in the pattern representation (c)interval values (e.g.,the dura- process is to gather facts and conjec- tion of an event). tures about the data,optionally perform feature selection and extraction,and de- (2)Qualitative features: sign the subsequent elements of the (a)nominal or unordered (e.g.,color); ACM Computing Surveys,Vol.31,No.3,September 1999
—d is the dimensionality of the pattern or of the pattern space. —A pattern set is denoted - 5 $x1,... xn%. The ith pattern in - is denoted xi 5 ~xi,1,... xi,d!. In many cases a pattern set to be clustered is viewed as an n 3 d pattern matrix. —A class, in the abstract, refers to a state of nature that governs the pattern generation process in some cases. More concretely, a class can be viewed as a source of patterns whose distribution in feature space is governed by a probability density specific to the class. Clustering techniques attempt to group patterns so that the classes thereby obtained reflect the different pattern generation processes represented in the pattern set. —Hard clustering techniques assign a class label li to each patterns xi, identifying its class. The set of all labels for a pattern set - is + 5 $l1,... ln%, with li [ $1, · · ·, k%, where k is the number of clusters. —Fuzzy clustering procedures assign to each input pattern xi a fractional degree of membership fij in each output cluster j. —A distance measure (a specialization of a proximity measure) is a metric (or quasi-metric) on the feature space used to quantify the similarity of patterns. 3. PATTERN REPRESENTATION, FEATURE SELECTION AND EXTRACTION There are no theoretical guidelines that suggest the appropriate patterns and features to use in a specific situation. Indeed, the pattern generation process is often not directly controllable; the user’s role in the pattern representation process is to gather facts and conjectures about the data, optionally perform feature selection and extraction, and design the subsequent elements of the clustering system. Because of the difficulties surrounding pattern representation, it is conveniently assumed that the pattern representation is available prior to clustering. Nonetheless, a careful investigation of the available features and any available transformations (even simple ones) can yield significantly improved clustering results. A good pattern representation can often yield a simple and easily understood clustering; a poor pattern representation may yield a complex clustering whose true structure is difficult or impossible to discern. Figure 3 shows a simple example. The points in this 2D feature space are arranged in a curvilinear cluster of approximately constant distance from the origin. If one chooses Cartesian coordinates to represent the patterns, many clustering algorithms would be likely to fragment the cluster into two or more clusters, since it is not compact. If, however, one uses a polar coordinate representation for the clusters, the radius coordinate exhibits tight clustering and a one-cluster solution is likely to be easily obtained. A pattern can measure either a physical object (e.g., a chair) or an abstract notion (e.g., a style of writing). As noted above, patterns are represented conventionally as multidimensional vectors, where each dimension is a single feature [Duda and Hart 1973]. These features can be either quantitative or qualitative. For example, if weight and color are the two features used, then ~20, black! is the representation of a black object with 20 units of weight. The features can be subdivided into the following types [Gowda and Diday 1992]: (1) Quantitative features: e.g. (a) continuous values (e.g., weight); (b) discrete values (e.g., the number of computers); (c) interval values (e.g., the duration of an event). (2) Qualitative features: (a) nominal or unordered (e.g., color); 270 • A. Jain et al. ACM Computing Surveys, Vol. 31, No. 3, September 1999
Data Clustering 271 tify a subset of the existing features for subsequent use,while feature extrac- tion techniques compute new features from the original set.In either case,the goal is to improve classification perfor- mance and/or computational efficiency Feature selection is a well-explored : topic in statistical pattern recognition [Duda and Hart 1973];however,in a clustering context (i.e.,lacking class la- bels for patterns),the feature selection process is of necessity ad hoc,and might involve a trial-and-error process where Figure 3.A curvilinear cluster whose points various subsets of features are selected, are approximately equidistant from the origin. the resulting patterns clustered,and Different pattern representations (coordinate the output evaluated using a validity systems)would cause clustering algorithms to yield different results for this data(see text). index.In contrast,some of the popular feature extraction processes (e.g.,prin- cipal components analysis [Fukunaga (b)ordinal (e.g.,military rank or 1990])do not depend on labeled data qualitative evaluations of tem- and can be used directly.Reduction of perature(“cool”or "hot”")or the number of features has an addi- sound intensity(“quiet'”or tional benefit,namely the ability to pro- oud”). duce output that can be visually in- spected by a human. Quantitative features can be measured on a ratio scale (with a meaningful ref- 4.SIMILARITY MEASURES erence value,such as temperature),or on nominal or ordinal scales. Since similarity is fundamental to the One can also use structured features definition of a cluster,a measure of the [Michalski and Stepp 1983]which are similarity between two patterns drawn represented as trees,where the parent from the same feature space is essential node represents a generalization of its to most clustering procedures.Because child nodes.For example,a parent node of the variety of feature types and “vehicle”may be a generalization of scales,the distance measure (or mea- children labeled “cars,”buses,” sures)must be chosen carefully.It is “trucks,”and“motorcycles.”Further, most common to calculate the dissimi- the node“cars”could be a generaliza- larity between two patterns using a dis- tion of cars of the type“Toyota,”“Ford," tance measure defined on the feature “Benz,”etc.A generalized representa- space.We will focus on the well-known tion of patterns,called symbolic objects distance measures used for patterns was proposed in Diday [1988].Symbolic whose features are all continuous. objects are defined by a logical conjunc- The most popular metric for continu- tion of events.These events link values ous features is the euclidean distance and features in which the features can take one or more values and all the objects need not be defined on the same d2(&,x)=(∑(x.k-x元.)22 set of features. k=1 It is often valuable to isolate only the most descriptive and discriminatory fea- =区:-x2, tures in the input set,and utilize those features exclusively in subsequent anal- which is a special case (p=2)of the ysis.Feature selection techniques iden- Minkowski metric ACM Computing Surveys,Vol.31,No.3,September 1999
(b) ordinal (e.g., military rank or qualitative evaluations of temperature (“cool” or “hot”) or sound intensity (“quiet” or “loud”)). Quantitative features can be measured on a ratio scale (with a meaningful reference value, such as temperature), or on nominal or ordinal scales. One can also use structured features [Michalski and Stepp 1983] which are represented as trees, where the parent node represents a generalization of its child nodes. For example, a parent node “vehicle” may be a generalization of children labeled “cars,” “buses,” “trucks,” and “motorcycles.” Further, the node “cars” could be a generalization of cars of the type “Toyota,” “Ford,” “Benz,” etc. A generalized representation of patterns, called symbolic objects was proposed in Diday [1988]. Symbolic objects are defined by a logical conjunction of events. These events link values and features in which the features can take one or more values and all the objects need not be defined on the same set of features. It is often valuable to isolate only the most descriptive and discriminatory features in the input set, and utilize those features exclusively in subsequent analysis. Feature selection techniques identify a subset of the existing features for subsequent use, while feature extraction techniques compute new features from the original set. In either case, the goal is to improve classification performance and/or computational efficiency. Feature selection is a well-explored topic in statistical pattern recognition [Duda and Hart 1973]; however, in a clustering context (i.e., lacking class labels for patterns), the feature selection process is of necessity ad hoc, and might involve a trial-and-error process where various subsets of features are selected, the resulting patterns clustered, and the output evaluated using a validity index. In contrast, some of the popular feature extraction processes (e.g., principal components analysis [Fukunaga 1990]) do not depend on labeled data and can be used directly. Reduction of the number of features has an additional benefit, namely the ability to produce output that can be visually inspected by a human. 4. SIMILARITY MEASURES Since similarity is fundamental to the definition of a cluster, a measure of the similarity between two patterns drawn from the same feature space is essential to most clustering procedures. Because of the variety of feature types and scales, the distance measure (or measures) must be chosen carefully. It is most common to calculate the dissimilarity between two patterns using a distance measure defined on the feature space. We will focus on the well-known distance measures used for patterns whose features are all continuous. The most popular metric for continuous features is the Euclidean distance d2~xi, xj! 5 ~O k51 d ~xi, k 2 xj, k! 2 ! 1/ 2 5 ixi 2 xji2, which is a special case (p52) of the Minkowski metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3. A curvilinear cluster whose points are approximately equidistant from the origin. Different pattern representations (coordinate systems) would cause clustering algorithms to yield different results for this data (see text). Data Clustering • 271 ACM Computing Surveys, Vol. 31, No. 3, September 1999
272 A.。Jain et al. n(n -1)/2 pairwise distance values for the n patterns and store them in a k=1 (symmetric)matrix. Computation of distances between =区:-b: patterns with some or all features being noncontinuous is problematic,since the The Euclidean distance has an intuitive different types of features are not com- appeal as it is commonly used to evalu- parable and (as an extreme example) ate the proximity of objects in two or three-dimensional space.It works well the notion of proximity is effectively bi- nary-valued for nominal-scaled fea- when a data set has“compact”or“iso- lated"clusters [Mao and Jain 1996]. tures.Nonetheless,practitioners (espe- The drawback to direct use of the cially those in machine learning,where Minkowski metrics is the tendency of mixed-type patterns are common)have the largest-scaled feature to dominate developed proximity measures for heter- the others.Solutions to this problem ogeneous type patterns.A recent exam- include normalization of the continuous ple is Wilson and Martinez [1997], features (to a common range or vari- which proposes a combination of a mod- ance)or other weighting schemes.Lin- ified Minkowski metric for continuous ear correlation among features can also features and a distance based on counts distort distance measures;this distor- (population)for nominal attributes.A tion can be alleviated by applying a variety of other metrics have been re- whitening transformation to the data or ported in Diday and Simon [1976]and by using the squared Mahalanobis dis- Ichino and Yaguchi [1994]for comput- tance ing the similarity between patterns rep- resented using quantitative as well as dM(&,x)=(x:-x)2-(x:-x)T, qualitative features. Patterns can also be represented us- where the patterns x:and x;are as- ing string or tree structures [Knuth 1973].Strings are used in syntactic sumed to be row vectors,and is the clustering [Fu and Lu 1977].Several sample covariance matrix of the pat- measures of similarity between strings terns or the known covariance matrix of are described in Baeza-Yates [1992].A the pattern generation process;d(.,) good summary of similarity measures assigns different weights to different between trees is given by Zhang [1995]. features based on their variances and A comparison of syntactic and statisti- pairwise linear correlations.Here,it is implicitly assumed that class condi- cal approaches for pattern recognition tional densities are unimodal and char- using several criteria was presented in Tanaka [1995]and the conclusion was acterized by multidimensional spread, i.e.,that the densities are multivariate that syntactic methods are inferior in Gaussian.The regularized Mahalanobis every aspect.Therefore,we do not con- distance was used in Mao and Jain sider syntactic methods further in this [1996]to extract hyperellipsoidal clus- paper. ters.Recently,several researchers There are some distance measures re- [Huttenlocher et al.1993;Dubuisson ported in the literature [Gowda and and Jain 1994]have used the Hausdorff Krishna 1977;Jarvis and Patrick 1973] distance in a point set matching con- that take into account the effect of sur- text. rounding or neighboring points.These Some clustering algorithms work on a surrounding points are called context in matrix of proximity values instead of on Michalski and Stepp [1983].The simi- the original pattern set.It is useful in larity between two points xi and xi, such situations to precompute all the given this context,is given by ACM Computing Surveys,Vol.31,No.3,September 1999
dp~xi, xj! 5 ~O k51 d ?xi, k 2 xj, k? p ! 1/p 5 ixi 2 xjip. The Euclidean distance has an intuitive appeal as it is commonly used to evaluate the proximity of objects in two or three-dimensional space. It works well when a data set has “compact” or “isolated” clusters [Mao and Jain 1996]. The drawback to direct use of the Minkowski metrics is the tendency of the largest-scaled feature to dominate the others. Solutions to this problem include normalization of the continuous features (to a common range or variance) or other weighting schemes. Linear correlation among features can also distort distance measures; this distortion can be alleviated by applying a whitening transformation to the data or by using the squared Mahalanobis distance dM~xi, xj! 5 ~xi 2 xj!S21 ~xi 2 xj! T, where the patterns xi and xj are assumed to be row vectors, and S is the sample covariance matrix of the patterns or the known covariance matrix of the pattern generation process; dM~z , z! assigns different weights to different features based on their variances and pairwise linear correlations. Here, it is implicitly assumed that class conditional densities are unimodal and characterized by multidimensional spread, i.e., that the densities are multivariate Gaussian. The regularized Mahalanobis distance was used in Mao and Jain [1996] to extract hyperellipsoidal clusters. Recently, several researchers [Huttenlocher et al. 1993; Dubuisson and Jain 1994] have used the Hausdorff distance in a point set matching context. Some clustering algorithms work on a matrix of proximity values instead of on the original pattern set. It is useful in such situations to precompute all the n~n 2 1! / 2 pairwise distance values for the n patterns and store them in a (symmetric) matrix. Computation of distances between patterns with some or all features being noncontinuous is problematic, since the different types of features are not comparable and (as an extreme example) the notion of proximity is effectively binary-valued for nominal-scaled features. Nonetheless, practitioners (especially those in machine learning, where mixed-type patterns are common) have developed proximity measures for heterogeneous type patterns. A recent example is Wilson and Martinez [1997], which proposes a combination of a modified Minkowski metric for continuous features and a distance based on counts (population) for nominal attributes. A variety of other metrics have been reported in Diday and Simon [1976] and Ichino and Yaguchi [1994] for computing the similarity between patterns represented using quantitative as well as qualitative features. Patterns can also be represented using string or tree structures [Knuth 1973]. Strings are used in syntactic clustering [Fu and Lu 1977]. Several measures of similarity between strings are described in Baeza-Yates [1992]. A good summary of similarity measures between trees is given by Zhang [1995]. A comparison of syntactic and statistical approaches for pattern recognition using several criteria was presented in Tanaka [1995] and the conclusion was that syntactic methods are inferior in every aspect. Therefore, we do not consider syntactic methods further in this paper. There are some distance measures reported in the literature [Gowda and Krishna 1977; Jarvis and Patrick 1973] that take into account the effect of surrounding or neighboring points. These surrounding points are called context in Michalski and Stepp [1983]. The similarity between two points xi and xj, given this context, is given by 272 • A. Jain et al. ACM Computing Surveys, Vol. 31, No. 3, September 1999
Data Clustering 273 X2 B A FE X I Figure 4.A and B are more similar than A Figure 5.After a change in context,B and C and C. are more similar than B and A. Watanabe's theorem of the ugly duck- s(x,X)=fx,,), ling [Watanabe 1985]states: where is the context (the set of sur- "Insofar as we use a finite set of rounding points).One metric defined predicates that are capable of dis- using context is the mutual neighbor tinguishing any two objects con- distance(MND),proposed in Gowda and sidered,the number of predicates Krishna [1977],which is given by shared by any two such objects is constant,independent of the MND(xi,x)=NN(xi,x)+NN(xi,xi), choice of objects." This implies that it is possible to where NN(xi,x)is the neighbor num- make any two arbitrary patterns ber of x;with respect to xi.Figures 4 equally similar by encoding them with a and 5 give an example.In Figure 4,the sufficiently large number of features.As nearest neighbor of A is B,and B's a consequence,any two arbitrary pat- nearest neighbor is A.So,NN(A,B)=terns are equally similar,unless we use NN(B,A)=1 and the MND between some additional domain information. A and B is 2.However,NN(B,C)=1 For example,in the case of conceptual clustering [Michalski and Stepp 1983], but NN(C,B)=2,and therefore the similarity between xi and x;is de- MND(B,C)=3.Figure 5 was ob- fined as tained from Figure 4 by adding three new points D,E,and F.Now MND(B,C) S(X,x}=fx,,6,), =3 (as before),but MND(A,B)=5. The MND between A and B has in- where 6 is a set of pre-defined concepts. creased by introducing additional This notion is illustrated with the help points,even though A and B have not of Figure 6.Here,the Euclidean dis- moved.The MND is not a metric (it does tance between points A and B is less not satisfy the triangle inequality than that between B and C.However,B [Zhang 1995]).In spite of this,MND has and C can be viewed as "more similar" been successfully applied in several than A and B because B and C belong to clustering applications [Gowda and Di-the same concept (ellipse)and A belongs day 1992].This observation supports to a different concept (rectangle).The the viewpoint that the dissimilarity conceptual similarity measure is the does not need to be a metric. most general similarity measure.We ACM Computing Surveys,Vol.31,No.3,September 1999
s~xi, xj! 5 f~xi, xj, %!, where % is the context (the set of surrounding points). One metric defined using context is the mutual neighbor distance (MND), proposed in Gowda and Krishna [1977], which is given by MND~xi, xj! 5 NN~xi, xj! 1 NN~xj, xi!, where NN~xi, xj! is the neighbor number of xj with respect to xi. Figures 4 and 5 give an example. In Figure 4, the nearest neighbor of A is B, and B’s nearest neighbor is A. So, NN~A, B! 5 NN~B, A! 5 1 and the MND between A and B is 2. However, NN~B, C! 5 1 but NN~C, B! 5 2, and therefore MND~B, C! 5 3. Figure 5 was obtained from Figure 4 by adding three new points D, E, and F. Now MND~B, C! 5 3 (as before), but MND~A, B! 5 5. The MND between A and B has increased by introducing additional points, even though A and B have not moved. The MND is not a metric (it does not satisfy the triangle inequality [Zhang 1995]). In spite of this, MND has been successfully applied in several clustering applications [Gowda and Diday 1992]. This observation supports the viewpoint that the dissimilarity does not need to be a metric. Watanabe’s theorem of the ugly duckling [Watanabe 1985] states: “Insofar as we use a finite set of predicates that are capable of distinguishing any two objects considered, the number of predicates shared by any two such objects is constant, independent of the choice of objects.” This implies that it is possible to make any two arbitrary patterns equally similar by encoding them with a sufficiently large number of features. As a consequence, any two arbitrary patterns are equally similar, unless we use some additional domain information. For example, in the case of conceptual clustering [Michalski and Stepp 1983], the similarity between xi and xj is defined as s~xi, xj! 5 f~xi, xj, #, %!, where # is a set of pre-defined concepts. This notion is illustrated with the help of Figure 6. Here, the Euclidean distance between points A and B is less than that between B and C. However, B and C can be viewed as “more similar” than A and B because B and C belong to the same concept (ellipse) and A belongs to a different concept (rectangle). The conceptual similarity measure is the most general similarity measure. We A B C X X 1 2 Figure 4. A and B are more similar than A and C. A B C X X 1 2 D F E Figure 5. After a change in context, B and C are more similar than B and A. Data Clustering • 273 ACM Computing Surveys, Vol. 31, No. 3, September 1999