Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching Andrew McCallum*t Kamal Nigamt Lyle H.Ungar* #WhizBang!Labs-Research tSchool of Computer Science .Computer and Info.Science 4616 Henry Street Carnegie Mellon University University of Pennsylvania Pittsburgh,PA USA Pittsburgh,PA USA Philadelphia,PA USA mccallum@cs.cmu.edu knigam@cs.cmu.edu ungar@cis.upenn.edu ABSTRACT 1.INTRODUCTION Many important problems involve clustering large datasets. Unsupervised clustering techniques have been applied to Although naive implementations of clustering are computa- many important problems.By clustering patient records, tionally expensive,there are established efficient techniques health care trends are discovered.By clustering address for clustering when the dataset has either(1)a limited num- lists.duplicate entries are eliminated.By clustering astro- ber of clusters,(2)a low feature dimensionality,or (3)a nomical data.new classes of stars are identified.By cluster- small number of data points.However,there has been much ing documents,hierarchical organizations of information are less work on methods of efficiently clustering datasets that derived.To address these applications,and many others,a are large in all three ways at once-for example,having variety of clustering algorithms have been developed. millions of data points that exist in many thousands of di- mensions representing many thousands of clusters. However.traditional clustering algorithms become compu- tationally expensive when the data set to be clustered is We present a new technique for clustering these large,high- large.There are three different ways in which the data set dimensional datasets.The key idea involves using a cheap, can be large:(1)there can be a large number of elements in approximate distance measure to efficiently divide the data the data set,(2)each element can have many features,and into overlapping subsets we call canopies.Then cluster- (3)there can be many clusters to discover.Recent advances ing is performed by measuring exact distances only between in clustering algorithms have addressed these efficiency is- points that occur in a common canopy.Using canopies,large sues,but only partially.For example,KD-trees [15]provide clustering problems that were formerly impossible become for efficient EM-style clustering of many elements,but re- practical.Under reasonable assumptions about the cheap quire that the dimensionality of each element be small.An- distance metric,this reduction in computational cost comes other algorithm 3 efficiently performs K-means clustering without any loss in clustering accuracy.Canopies can be by finding good initial starting points,but is not efficient applied to many domains and used with a variety of cluster- when the number of clusters is large.There has been al- ing approaches,including Greedy Agglomerative Clustering. most no work on algorithms that work efficiently when the K-means and Expectation-Maximization.We present ex- data set is large in all three senses at once-when there are perimental results on grouping bibliographic citations from millions of elements,many thousands of features,and many the reference sections of research papers.Here the canopy thousands of clusters. approach reduces computation time over a traditional clus- tering approach by more than an order of magnitude and This paper introduces a technique for clustering that is effi- decreases error in comparison to a previously used algorithm cient when the problem is large in all of these three ways at by25%. once.The key idea is to perform clustering in two stages. first a rough and quick stage that divides the data into over- Categories and Subject Descriptors lapping subsets we call "canopies,"then a more rigorous I.5.3 Pattern Recognition]:Clustering;H.3.3[Information final stage in which expensive distance measurements are Storage and Retrieval:Information Search and Retrieval- only made among points that occur in a common canopy Clustering This differs from previous clustering methods in that it uses two different distance metrics for the two stages,and forms overlapping regions. The first stage can make use of extremely inexpensive meth- ods for finding data elements near the center of a region Many proximity measurement methods,such as the inverted index commonly used in information retrieval systems,are very efficient in high dimensions and can find elements near the query by examining only a small fraction of a data set
Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching Andrew McCallum‡† ‡WhizBang! Labs - Research 4616 Henry Street Pittsburgh, PA USA mccallum@cs.cmu.edu Kamal Nigam† †School of Computer Science Carnegie Mellon University Pittsburgh, PA USA knigam@cs.cmu.edu Lyle H. Ungar∗ ∗Computer and Info. Science University of Pennsylvania Philadelphia, PA USA ungar@cis.upenn.edu ABSTRACT Many important problems involve clustering large datasets. Although naive implementations of clustering are computationally expensive, there are established efficient techniques for clustering when the dataset has either (1) a limited number of clusters, (2) a low feature dimensionality, or (3) a small number of data points. However, there has been much less work on methods of efficiently clustering datasets that are large in all three ways at once—for example, having millions of data points that exist in many thousands of dimensions representing many thousands of clusters. We present a new technique for clustering these large, highdimensional datasets. The key idea involves using a cheap, approximate distance measure to efficiently divide the data into overlapping subsets we call canopies. Then clustering is performed by measuring exact distances only between points that occur in a common canopy. Using canopies, large clustering problems that were formerly impossible become practical. Under reasonable assumptions about the cheap distance metric, this reduction in computational cost comes without any loss in clustering accuracy. Canopies can be applied to many domains and used with a variety of clustering approaches, including Greedy Agglomerative Clustering, K-means and Expectation-Maximization. We present experimental results on grouping bibliographic citations from the reference sections of research papers. Here the canopy approach reduces computation time over a traditional clustering approach by more than an order of magnitude and decreases error in comparison to a previously used algorithm by 25%. Categories and Subject Descriptors I.5.3 [Pattern Recognition]: Clustering; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval— Clustering 1. INTRODUCTION Unsupervised clustering techniques have been applied to many important problems. By clustering patient records, health care trends are discovered. By clustering address lists, duplicate entries are eliminated. By clustering astronomical data, new classes of stars are identified. By clustering documents, hierarchical organizations of information are derived. To address these applications, and many others, a variety of clustering algorithms have been developed. However, traditional clustering algorithms become computationally expensive when the data set to be clustered is large. There are three different ways in which the data set can be large: (1) there can be a large number of elements in the data set, (2) each element can have many features, and (3) there can be many clusters to discover. Recent advances in clustering algorithms have addressed these efficiency issues, but only partially. For example, KD-trees [15] provide for efficient EM-style clustering of many elements, but require that the dimensionality of each element be small. Another algorithm [3] efficiently performs K-means clustering by finding good initial starting points, but is not efficient when the number of clusters is large. There has been almost no work on algorithms that work efficiently when the data set is large in all three senses at once—when there are millions of elements, many thousands of features, and many thousands of clusters. This paper introduces a technique for clustering that is effi- cient when the problem is large in all of these three ways at once. The key idea is to perform clustering in two stages, first a rough and quick stage that divides the data into overlapping subsets we call “canopies,” then a more rigorous final stage in which expensive distance measurements are only made among points that occur in a common canopy. This differs from previous clustering methods in that it uses two different distance metrics for the two stages, and forms overlapping regions. The first stage can make use of extremely inexpensive methods for finding data elements near the center of a region. Many proximity measurement methods, such as the inverted index commonly used in information retrieval systems, are very efficient in high dimensions and can find elements near the query by examining only a small fraction of a data set
Variants of the inverted index can also work for real-valued We divide the clustering process into two stages.In the first data. stage we use the cheap distance measure in order to cre- ate some number of overlapping subsets,called "canopies." Once the canopies are built using the approximate distance A canopy is simply a subset of the elements (i.e.data measure,the second stage completes the clustering by run- points or items)that,according to the approximate simi- ning a standard clustering algorithm using a rigorous,and larity measure,are within some distance threshold from a thus more expensive distance metric.However,significant central point.Significantly,an element may appear under computation is saved by eliminating all of the distance com- more than one canopy,and every element must appear in at parisons among points that do not fall within a common least one canopy.Canopies are created with the intention canopy.Unlike hard-partitioning schemes such as "block- that points not appearing in any common canopy are far ing"13 and KD-trees,the overall algorithm is tolerant to enough apart that they could not possibly be in the same inaccuracies in the approximate distance measure used to cluster.Since the distance measure used to create canopies create the canopies because the canopies may overlap with is approximate,there may not be a guarantee of this prop- each other. erty,but by allowing canopies to overlap with each other,by choosing a large enough distance threshold,and by under- From a theoretical standpoint,if we can guarantee certain standing the properties of the approximate distance mea- properties of the inexpensive distance metric,we can show sure,we can have a guarantee in some cases. that the original,accurate clustering solution can still be recovered with the canopies approach.In other words,if The circles with solid outlines in Figure 1 show an example the inexpensive clustering does not exclude the solution for of overlapping canopies that cover a data set.The method the expensive clustering,then we will not lose any clustering by which canopies such as these may be created is described accuracy.In practice,we have actually found small accuracy in the next subsection. increases due to the combined usage of two distance metrics. In the second stage,we execute some traditional cluster- Clustering based on canopies can be applied to many differ- ing algorithm,such as Greedy Agglomerative Clustering or ent underlying clustering algorithms,including Greedy Ag- K-means using the accurate distance measure,but with the glomerative Clustering,K-means,and Expectation-Maxim- restriction that we do not calculate the distance between two ization. points that never appear in the same canopy,i.e.we assume their distance to be infinite.For example,if all items are This paper presents experimental results in which we apply trivially placed into a single canopy,then the second round the canopies method with Greedy Agglomerative Clustering of clustering degenerates to traditional,unconstrained clus- to the problem of clustering bibliographic citations from the tering with an expensive distance metric.If,however,the reference sections of computer science research papers.The canopies are not too large and do not overlap too much. Cora research paper search engine [11]has gathered over a then a large number of expensive distance measurements million such references referring to about a hundred thou- will be avoided,and the amount of computation required sand unique papers;each reference is a text segment exist- for clustering will be greatly reduced.Furthermore,if the ing in a vocabulary of about hundred thousand dimensions. constraints on the clustering imposed by the canopies still in- Multiple citations to the same paper differ in many ways, clude the traditional clustering solution among the possibil- particularly in the way author.editor and journal names are ities,then the canopies procedure may not lose any cluster- abbreviated,formatted and ordered.Our goal is to cluster ing accuracy,while still increasing computational efficiency these citations into sets that each refer to the same article significantly. Measuring accuracy on a hand-clustered subset consisting of about 2000 references,we find that the canopies approach We can state more formally the conditions under which the speeds the clustering by an order of magnitude while also canopies procedure will perfectly preserve the results of tra- providing a modest improvement in clustering accuracy.On ditional clustering.If the underlying traditional clusterer is the full data set we expect the speedup to be five orders of K-means,Expectation-Maximization or Greedy Agglomer- magnitude. ative Clustering in which distance to a cluster is measured to the centroid of the cluster,then clustering accuracy will 2. EFFICIENT CLUSTERING be preserved exactly when: WITH CANOPIES The key idea of the canopy algorithm is that one can greatly For every traditional cluster,there exists a canopy reduce the number of distance computations required for such that all elements of the cluster are in the clustering by first cheaply partitioning the data into over- canopy. lapping subsets,and then only measuring distances among pairs of data points that belong to a common subset. If instead we perform Greedy Agglomerative Clustering clus- The canopies technique thus uses two different sources of tering in which distance to a cluster is measure to the closest information to cluster items:a cheap and approximate sim- point in the cluster,then clustering accuracy will be pre- served exactly when: ilarity measure (e.g.,for household address,the proportion of words in common between two address)and a more ex- pensive and accurate similarity measure(e.g.,detailed field- For every cluster,there exists a set of canopies by-field string edit distance measured with tuned transfor- such that the elements of the cluster "connect" mation costs and dynamic programming). the canopies
Variants of the inverted index can also work for real-valued data. Once the canopies are built using the approximate distance measure, the second stage completes the clustering by running a standard clustering algorithm using a rigorous, and thus more expensive distance metric. However, significant computation is saved by eliminating all of the distance comparisons among points that do not fall within a common canopy. Unlike hard-partitioning schemes such as “blocking” [13] and KD-trees, the overall algorithm is tolerant to inaccuracies in the approximate distance measure used to create the canopies because the canopies may overlap with each other. From a theoretical standpoint, if we can guarantee certain properties of the inexpensive distance metric, we can show that the original, accurate clustering solution can still be recovered with the canopies approach. In other words, if the inexpensive clustering does not exclude the solution for the expensive clustering, then we will not lose any clustering accuracy. In practice, we have actually found small accuracy increases due to the combined usage of two distance metrics. Clustering based on canopies can be applied to many different underlying clustering algorithms, including Greedy Agglomerative Clustering, K-means, and Expectation-Maximization. This paper presents experimental results in which we apply the canopies method with Greedy Agglomerative Clustering to the problem of clustering bibliographic citations from the reference sections of computer science research papers. The Cora research paper search engine [11] has gathered over a million such references referring to about a hundred thousand unique papers; each reference is a text segment existing in a vocabulary of about hundred thousand dimensions. Multiple citations to the same paper differ in many ways, particularly in the way author, editor and journal names are abbreviated, formatted and ordered. Our goal is to cluster these citations into sets that each refer to the same article. Measuring accuracy on a hand-clustered subset consisting of about 2000 references, we find that the canopies approach speeds the clustering by an order of magnitude while also providing a modest improvement in clustering accuracy. On the full data set we expect the speedup to be five orders of magnitude. 2. EFFICIENT CLUSTERING WITH CANOPIES The key idea of the canopy algorithm is that one can greatly reduce the number of distance computations required for clustering by first cheaply partitioning the data into overlapping subsets, and then only measuring distances among pairs of data points that belong to a common subset. The canopies technique thus uses two different sources of information to cluster items: a cheap and approximate similarity measure (e.g., for household address, the proportion of words in common between two address) and a more expensive and accurate similarity measure (e.g., detailed fieldby-field string edit distance measured with tuned transformation costs and dynamic programming). We divide the clustering process into two stages. In the first stage we use the cheap distance measure in order to create some number of overlapping subsets, called “canopies.” A canopy is simply a subset of the elements (i.e. data points or items) that, according to the approximate similarity measure, are within some distance threshold from a central point. Significantly, an element may appear under more than one canopy, and every element must appear in at least one canopy. Canopies are created with the intention that points not appearing in any common canopy are far enough apart that they could not possibly be in the same cluster. Since the distance measure used to create canopies is approximate, there may not be a guarantee of this property, but by allowing canopies to overlap with each other, by choosing a large enough distance threshold, and by understanding the properties of the approximate distance measure, we can have a guarantee in some cases. The circles with solid outlines in Figure 1 show an example of overlapping canopies that cover a data set. The method by which canopies such as these may be created is described in the next subsection. In the second stage, we execute some traditional clustering algorithm, such as Greedy Agglomerative Clustering or K-means using the accurate distance measure, but with the restriction that we do not calculate the distance between two points that never appear in the same canopy, i.e. we assume their distance to be infinite. For example, if all items are trivially placed into a single canopy, then the second round of clustering degenerates to traditional, unconstrained clustering with an expensive distance metric. If, however, the canopies are not too large and do not overlap too much, then a large number of expensive distance measurements will be avoided, and the amount of computation required for clustering will be greatly reduced. Furthermore, if the constraints on the clustering imposed by the canopies still include the traditional clustering solution among the possibilities, then the canopies procedure may not lose any clustering accuracy, while still increasing computational efficiency significantly. We can state more formally the conditions under which the canopies procedure will perfectly preserve the results of traditional clustering. If the underlying traditional clusterer is K-means, Expectation-Maximization or Greedy Agglomerative Clustering in which distance to a cluster is measured to the centroid of the cluster, then clustering accuracy will be preserved exactly when: For every traditional cluster, there exists a canopy such that all elements of the cluster are in the canopy. If instead we perform Greedy Agglomerative Clustering clustering in which distance to a cluster is measure to the closest point in the cluster, then clustering accuracy will be preserved exactly when: For every cluster, there exists a set of canopies such that the elements of the cluster “connect” the canopies
tations might be clustered with a cheap similarity metric which only looks at the last names of the authors and the year of publication,even though the whole text of the ref. erence and the article are available. At other times,especially when the individual features are noisy,one may still want to use all of the many features of an item.The following section describes a distance metric 00 and method of creating canopies that often provides good performance in such cases.For concreteness,we consider the case in which documents are the items and words in the document are the features:the method is also broadly applicable to other problems with similar structure. 2.1.1 A Cheap Distance Metric All the very fast distance metrics for text used by search engines are based on the inverted index.An inverted index is a sparse matrix representation in which,for each word, we can directly access the list of documents containing that word.When we want to find all documents close to a given Figure 1:An example of four data clusters and the query,we need not explicitly measure the distance to all canopies that cover them.Points belonging to the documents in the collection,but need only examine the list same cluster are colored in the same shade of gray. of documents associated with each word in the query.The The canopies were created by the method outlined great majority of the documents,which have no words in in section 2.1.Point A was selected at random common with the query,need never be considered.Thus we and forms a canopy consisting of all points within can use an inverted index to efficiently calculate a distance the outer (solid)threshold.Points inside the in- metric that is based on the number of words two documents ner (dashed)threshold are excluded from being the have in common. center of,and forming new canopies.Canopies for B,C,D,and E were formed similarly to A.Note Given the above distance metric,one can create canopies that the optimality condition holds:for each clus- as follows.Start with a list of the data points in any order, ter there exists at least one canopy that completely and with two distance thresholds,T and T2.where T1>T2 contains that cluster.Note also that while there is (These thresholds can be set by the user,or,as in our ex- some overlap,there are many points excluded by periments,selected by cross-validation.)Pick a point off each canopy.Expensive distance measurements will the list and approximately measure its distance to all other only be made between pairs of points in the same points.(This is extremely cheap with an inverted index.) canopies,far fewer than all possible pairs in the data Put all points that are within distance threshold T into a set. canopy.Remove from the list all points that are within dis- tance threshold T2.Repeat until the list is empty.Figure 1 We have found in practice that it is not difficult to create in- shows some canopies that were created by this procedure. expensive distance measures that nearly always satisfy these “canopy properties." The idea of an inverted index can also be applied to high- dimensional real-valued data.Each dimension would be dis- cretized into some number of bins,each containing a bal- 2.1 Creating Canopies anced number of data points.Then each data point is ef- In most cases,a user of the canopies technique will be able to fectively turned into a"document"containing "words"con- leverage domain-specific features in order to design a cheap sisting of the unique bin identifiers for each dimension of the distance metric and efficiently create canopies using the met- point.If one is worried about edge effects at the boundaries ric.For example,if the data consist of a large number of between bins,we can include in a data point's document the hospital patient records including diagnoses,treatments and identifiers not only of the bin in which the point is found, payment histories,a cheap measure of similarity between the but also the bins on either side.Then,as above,a cheap patients might be "1"if they have a diagnosis in common distance measure can be based on thethe number of bin and "0"if they do not.In this case canopy creation is triv- identifiers the two points have in common.A similar proce- ial:people with a common diagnosis fall in the same canopy. dure has been used previously with success [8]. (More sophisticated versions could take account of the hier- archical structure of diagnoses such as ICD9 codes,or could also include secondary diagnoses.)Note,however,that peo- 2.2 Canopies with Greedy Agglomerative ple with multiple diagnoses will fall into multiple canopies Clustering and thus the canopies will overlap. Greedy Agglomerative Clustering(GAC)is a common clus- tering technique used to group items together based on sim- Often one or a small number of features suffice to build ilarity.In standard greedy agglomerative clustering,we are canopies,even if the items being clustered(e.g.the patients) given as input a set of items and a means of computing the have thousands of features.For example,bibliographic ci- distance (or similarity)between any of the pairs of items
A C E B D Figure 1: An example of four data clusters and the canopies that cover them. Points belonging to the same cluster are colored in the same shade of gray. The canopies were created by the method outlined in section 2.1. Point A was selected at random and forms a canopy consisting of all points within the outer (solid) threshold. Points inside the inner (dashed) threshold are excluded from being the center of, and forming new canopies. Canopies for B, C, D, and E were formed similarly to A. Note that the optimality condition holds: for each cluster there exists at least one canopy that completely contains that cluster. Note also that while there is some overlap, there are many points excluded by each canopy. Expensive distance measurements will only be made between pairs of points in the same canopies, far fewer than all possible pairs in the data set. We have found in practice that it is not difficult to create inexpensive distance measures that nearly always satisfy these “canopy properties.” 2.1 Creating Canopies In most cases, a user of the canopies technique will be able to leverage domain-specific features in order to design a cheap distance metric and efficiently create canopies using the metric. For example, if the data consist of a large number of hospital patient records including diagnoses, treatments and payment histories, a cheap measure of similarity between the patients might be “1” if they have a diagnosis in common and “0” if they do not. In this case canopy creation is trivial: people with a common diagnosis fall in the same canopy. (More sophisticated versions could take account of the hierarchical structure of diagnoses such as ICD9 codes, or could also include secondary diagnoses.) Note, however, that people with multiple diagnoses will fall into multiple canopies and thus the canopies will overlap. Often one or a small number of features suffice to build canopies, even if the items being clustered (e.g. the patients) have thousands of features. For example, bibliographic citations might be clustered with a cheap similarity metric which only looks at the last names of the authors and the year of publication, even though the whole text of the reference and the article are available. At other times, especially when the individual features are noisy, one may still want to use all of the many features of an item. The following section describes a distance metric and method of creating canopies that often provides good performance in such cases. For concreteness, we consider the case in which documents are the items and words in the document are the features; the method is also broadly applicable to other problems with similar structure. 2.1.1 A Cheap Distance Metric All the very fast distance metrics for text used by search engines are based on the inverted index. An inverted index is a sparse matrix representation in which, for each word, we can directly access the list of documents containing that word. When we want to find all documents close to a given query, we need not explicitly measure the distance to all documents in the collection, but need only examine the list of documents associated with each word in the query. The great majority of the documents, which have no words in common with the query, need never be considered. Thus we can use an inverted index to efficiently calculate a distance metric that is based on the number of words two documents have in common. Given the above distance metric, one can create canopies as follows. Start with a list of the data points in any order, and with two distance thresholds, T1 and T2, where T1 > T2. (These thresholds can be set by the user, or, as in our experiments, selected by cross-validation.) Pick a point off the list and approximately measure its distance to all other points. (This is extremely cheap with an inverted index.) Put all points that are within distance threshold T1 into a canopy. Remove from the list all points that are within distance threshold T2. Repeat until the list is empty. Figure 1 shows some canopies that were created by this procedure. The idea of an inverted index can also be applied to highdimensional real-valued data. Each dimension would be discretized into some number of bins, each containing a balanced number of data points. Then each data point is effectively turned into a “document” containing “words” consisting of the unique bin identifiers for each dimension of the point. If one is worried about edge effects at the boundaries between bins, we can include in a data point’s document the identifiers not only of the bin in which the point is found, but also the bins on either side. Then, as above, a cheap distance measure can be based on the the number of bin identifiers the two points have in common. A similar procedure has been used previously with success [8]. 2.2 Canopies with Greedy Agglomerative Clustering Greedy Agglomerative Clustering (GAC) is a common clustering technique used to group items together based on similarity. In standard greedy agglomerative clustering, we are given as input a set of items and a means of computing the distance (or similarity) between any of the pairs of items
The items are then combined into clusters by successively Prototypes moved by Number of prototypes combining the two closest clusters until one has reduced the only points in same constant,initialized canopy as prototype per canopy number of clusters to a target number. points in same constant.initialized canopy as prototype over whole data set We use a standard implementation of greedy agglomerative plus all others summarized clustering (GAC):Initialize each element to be a cluster of by canopy centers size one,compute the distances between all pairs of clusters, only points in same initialized per canopy, sort the distances from smallest to largest,and then repeat- canopy as prototype but created and destroyed edly merge the two clusters which are closest together until dynamically one is left with the desired number of clusters. Table 1:A summary of the three different methods In the standard GAC implementation,we need to apply the of combining canopies and EM. distance function O(n)times to calculate all pair-wise dis- tances between items.A canopies-based implementation of GAC can drastically reduce this required number of com- parisons. canopies Using a cheap,approximate distance measure overlapping Note that by this procedure prototypes may move across canopy boundaries when canopies overlap.Prototypes may canopies are created.When the canopies property holds, move to cover the data in the overlapping region,and then we are guaranteed that any two points that do not share a canopy will not fall into the same cluster.Thus,we do move entirely into another canopy in order to cover data there not need to calculate the distances between these pairs of points.Equivalently,we can initialize all distances to in- finity and only replace pairwise distances when two items The canopy-modified EM algorithm behaves very similarly to traditional EM,with the slight difference that points out- fall into the same canopy.As discussed in Section 2.4,this vastly reduces the required number of distance calculations side the canopy have no influence on points in the canopy, rather than a minute influence.If the canopy property holds, for greedy agglomerative clustering. and points in the same cluster fall in the same canopy,then 2.3 Canopies with Expectation-Maximization the canopy-modified EM will almost always converge to the same maximum in likelihood as the traditional EM.In fact. Clustering the difference in each iterative step (apart from the enor- One can also use the canopies idea to speed up prototype- mous computational savings of computing fewer terms)will based clustering methods like K-means and Expectation- be negligible since points outside the canopy will have ex- Maximization (EM).In general,neither K-means nor EM ponentially small influence. specify how many clusters to use.The canopies technique does not help this choice. K-means gives not just similar results for canopies and the traditional setting,but exactly identical clusters.In K- As before,the canopies reduce the number of expensive dis- means each data point is assigned to a single prototype. tance comparisons that need to be performed.We create the As long as the cheap and expensive distance metrics are canopies as before.We now describe three different meth- sufficiently similar that the nearest prototype (as calculated ods of using canopies with prototype-based clustering tech- by the expensive distance metric)is within the boundaries niques. of the canopies that contain that data point (as calculated with the cheap distance metric),then the same prototype Method 1:In this approach,prototypes (our estimates of will always win. the cluster centroids)are associated with the canopies that contain them,and the prototypes are only influenced by Method 2:We might like to have a version of the canopies data that are inside their associated canopies. method that,instead of forcing us to pick a number of pro- totypes for each canopy separately,allows us to select the After creating the canopies,we decide how many prototypes number of prototypes that will cover the whole data set. will be created for each canopy.This could be done,for Method 2 makes this possible.As before,prototypes are as- example,using the number of data points in a canopy and sociated with canopies,and are only influenced by individual AIC or BIC 1-where points that occur in more than one data points that are inside their canopies.However,in this canopy are counted fractionally.Then we place prototypes method,prototypes are influenced by all the other data too, into each canopy.This initial placement can be random,as but data points outside the prototype's associated canopy long as it is within the canopy in question,as determined are represented simply by the mean of their canopies.In by the inexpensive distance metric. this respect,Method 2 is identical to Omohundro's balltrees [17].Method 2 differs,however,in that the canopies are cre- Then,instead of calculating the distance from each proto- ated highly efficiently,using a cheap distance metric.Not type to every point (as is traditional,a O(nk)operation),the only is it more computationally efficient to compute the dis- E-step instead calculates the distance from each prototype tance between two points using the cheap distance metric, to a much smaller number of points.For each prototype,we but the use of inverted indices avoids completely computing find the canopies that contain it (using the cheap distance the distance to many points. metric),and only calculate distances (using the expensive distance metric)from that prototype to points within those Method 3:There are many existing traditional methods
The items are then combined into clusters by successively combining the two closest clusters until one has reduced the number of clusters to a target number. We use a standard implementation of greedy agglomerative clustering (GAC): Initialize each element to be a cluster of size one, compute the distances between all pairs of clusters, sort the distances from smallest to largest, and then repeatedly merge the two clusters which are closest together until one is left with the desired number of clusters. In the standard GAC implementation, we need to apply the distance function O(n2) times to calculate all pair-wise distances between items. A canopies-based implementation of GAC can drastically reduce this required number of comparisons. Using a cheap, approximate distance measure overlapping canopies are created. When the canopies property holds, we are guaranteed that any two points that do not share a canopy will not fall into the same cluster. Thus, we do not need to calculate the distances between these pairs of points. Equivalently, we can initialize all distances to in- finity and only replace pairwise distances when two items fall into the same canopy. As discussed in Section 2.4, this vastly reduces the required number of distance calculations for greedy agglomerative clustering. 2.3 Canopies with Expectation-Maximization Clustering One can also use the canopies idea to speed up prototypebased clustering methods like K-means and ExpectationMaximization (EM). In general, neither K-means nor EM specify how many clusters to use. The canopies technique does not help this choice. As before, the canopies reduce the number of expensive distance comparisons that need to be performed. We create the canopies as before. We now describe three different methods of using canopies with prototype-based clustering techniques. Method 1: In this approach, prototypes (our estimates of the cluster centroids) are associated with the canopies that contain them, and the prototypes are only influenced by data that are inside their associated canopies. After creating the canopies, we decide how many prototypes will be created for each canopy. This could be done, for example, using the number of data points in a canopy and AIC or BIC [1]—where points that occur in more than one canopy are counted fractionally. Then we place prototypes into each canopy. This initial placement can be random, as long as it is within the canopy in question, as determined by the inexpensive distance metric. Then, instead of calculating the distance from each prototype to every point (as is traditional, a O(nk) operation), the E-step instead calculates the distance from each prototype to a much smaller number of points. For each prototype, we find the canopies that contain it (using the cheap distance metric), and only calculate distances (using the expensive distance metric) from that prototype to points within those Prototypes moved by Number of prototypes 1 only points in same constant, initialized canopy as prototype per canopy 2 points in same constant, initialized canopy as prototype over whole data set plus all others summarized by canopy centers 3 only points in same initialized per canopy, canopy as prototype but created and destroyed dynamically Table 1: A summary of the three different methods of combining canopies and EM. canopies. Note that by this procedure prototypes may move across canopy boundaries when canopies overlap. Prototypes may move to cover the data in the overlapping region, and then move entirely into another canopy in order to cover data there. The canopy-modified EM algorithm behaves very similarly to traditional EM, with the slight difference that points outside the canopy have no influence on points in the canopy, rather than a minute influence. If the canopy property holds, and points in the same cluster fall in the same canopy, then the canopy-modified EM will almost always converge to the same maximum in likelihood as the traditional EM. In fact, the difference in each iterative step (apart from the enormous computational savings of computing fewer terms) will be negligible since points outside the canopy will have exponentially small influence. K-means gives not just similar results for canopies and the traditional setting, but exactly identical clusters. In Kmeans each data point is assigned to a single prototype. As long as the cheap and expensive distance metrics are sufficiently similar that the nearest prototype (as calculated by the expensive distance metric) is within the boundaries of the canopies that contain that data point (as calculated with the cheap distance metric), then the same prototype will always win. Method 2: We might like to have a version of the canopies method that, instead of forcing us to pick a number of prototypes for each canopy separately, allows us to select the number of prototypes that will cover the whole data set. Method 2 makes this possible. As before, prototypes are associated with canopies, and are only influenced by individual data points that are inside their canopies. However, in this method, prototypes are influenced by all the other data too, but data points outside the prototype’s associated canopy are represented simply by the mean of their canopies. In this respect, Method 2 is identical to Omohundro’s balltrees [17]. Method 2 differs, however, in that the canopies are created highly efficiently, using a cheap distance metric. Not only is it more computationally efficient to compute the distance between two points using the cheap distance metric, but the use of inverted indices avoids completely computing the distance to many points. Method 3: There are many existing traditional methods
for dynamically determining the number of prototypes (e.g. Fahlman,Scott Lebiere,Christian (1989).The cascade- [18]).Techniques for creating and destroying prototypes are correlation learning architecture. In Touretzky,D.,editor, particularly attractive when thinking about Method 1.Here Advances in Neural Information Processing Systems (volume 2), we have the simplicity of completely ignoring all data points (pp.524-532),San Mateo,CA.Morgan Kaufmann. outside a prototype's associated cluster,with the problem, Fahlman.S.E.and Lebiere,C.."The Cascade Correlation however,that we may have too many or too few prototypes Learning Architecture,"NIPS,Vol.2,pp.524-532,Morgan within the canopy.In Method 3,as in Method 1,prototypes Kaufmann,1990. are associated with canopies and only see points within their canopy.Here,however,we use techniques that create (and Fahlmann,S.E.and Lebiere,C.(1989). The cascade- possibly destroy)prototypes dynamically during clustering. correlation learning architecture.In Advances in Neural In- formation Processing Systems 2 (NIPS-2),Denver,Colorado. We avoid creating multiple prototypes to cover points that fall into more than one canopy by invoking a"conservation Figure 2:Three sample citations to the same paper. of mass"principle for points.In practice,this means that Note the different layout formats,and the mistakes the contribution of each point is divided among the canopies in spelling that make it difficult to identify these as as falling out naturally from the normalization in the E-step: citations to the same paper. as is traditional,membership to a prototype is determined by dividing the inverse distance to the prototype by the sum of inverse distances to all the prototypes to which its overlap factor f as data points do.Then,each cluster needs distance was measured,even if some of those prototypes fall to compare itself to the fn/c points in f different canopies. in different canopies. For all clusters,this is O(nkf2/c)per iteration,yielding the same complexity reduction as seen for GAC 2.4 Computational Complexity We can formally quantify the computational savings of the In the experimental results described in a following section. canopies technique.The technique has two components:a c is on the order of 1,000,so the savings are significant.In relatively fast step where the canopies are created,followed an industrial sized merge-purge problem,far more canopies by a relatively slow clustering process.If we create canopies would be used,and full pair-wise distance calculations would using an inverted index,we do not need to perform even the not at all be feasible. complete pair-wise distance comparisons.If we assume that each document has w words,and these words are evenly distributed across the vocabulary V,then we can compare 3. CLUSTERING TEXTUAL a document to n other documents in O(nw2/VI).This BIBLIOGRAPHIC REFERENCES canopy creation cost is typically insignificant in comparison In this section we provide empirical evidence that using to the more detailed clustering phase canopies for clustering can increase computational efficiency by an order of magnitude without losing any clustering ac Assume that we have n data points that originated from curacy.We demonstrate these results in the domain of bib- k clusters.For concreteness,first consider the Greedy Ag- liographic references. glomerative Clustering algorithm.Clustering without canop- ies requires calculating the distance between all pairs of The Cora web site (urwnw.cora.whizbang.com)provides a search points,an O(n)operation.This is the dominating com- interface to over 50,000 computer science research papers plexity term in GAC clustering.Now consider how this [11].As part of the site's functionality,we provide an inter- is reduced by using canopies.Assume that there are c face for traversing the citation graph.That is,for a given canopies and that each data point on average falls into f paper,we provide links to all other papers it references,and canopies.This factor f estimates the amount to which the links to all other papers that reference it,in the respective canopies overlap with each other.Now,in an optimistic bibliography section of each paper.To provide this interface scenario where each canopy is of equal size,there will be to the data,it is necessary to recognize when two citations roughly fn/c data points per canopy.When clustering with from different papers are referencing the same third paper, canopies we need only calculate the distances between pairs even though the text of the citations may differ.For exam- of points within the same canopy.This requires at most ple,one paper may abbreviate first author names,while the O(c(fn/c))=O(f-n/c)distance measurements.(This is second may include them.Some typical citations are shown probably an over-estimate,as the same points tend to fall in Figure 2.Note that different styles of reference format- into multiple canopies together,and we only need to cal- ting and abbreviation,as well as outright citation mistakes, culate their distances once.)This represents a reduction in make this a difficult task to automate.We pose this as a complexity of f2/c.In general,c will be much larger than f. clustering problem,where the task is to take all the cita- Given a typical case in which n =1,000,000,k 10,000. tions from all papers in the collection,and cluster them so c=1,000,and f is a small constant,the canopies technique that each cluster contains all and only citations to a sin- reduces computation by a factor of 1,000. gle paper.Since the Cora collection contains over a million bibliographic entries,it is necessary to perform this cluster- In the case of K-means or Expectation-Maximization,clus- ing efficiently.Using straightforward GAC would take more tering without canopies requires O(nk)distance compar- than one CPU year,assuming unlimited memory.If we isons per iteration of clustering (finding the distance be- estimate the total number of canopies and average canopy tween each data point and each cluster prototype).Con- membership from labeled dataset used below,the canopies sider the EM method 1,where each cluster belongs to one approach will provide a speedup of five orders of magnitude or more canopy.Assume that clusters have roughly the same reducing the clustering time to a couple hours
for dynamically determining the number of prototypes (e.g. [18]). Techniques for creating and destroying prototypes are particularly attractive when thinking about Method 1. Here we have the simplicity of completely ignoring all data points outside a prototype’s associated cluster, with the problem, however, that we may have too many or too few prototypes within the canopy. In Method 3, as in Method 1, prototypes are associated with canopies and only see points within their canopy. Here, however, we use techniques that create (and possibly destroy) prototypes dynamically during clustering. We avoid creating multiple prototypes to cover points that fall into more than one canopy by invoking a “conservation of mass” principle for points. In practice, this means that the contribution of each point is divided among the canopies, as falling out naturally from the normalization in the E-step: as is traditional, membership to a prototype is determined by dividing the inverse distance to the prototype by the sum of inverse distances to all the prototypes to which its distance was measured, even if some of those prototypes fall in different canopies. 2.4 Computational Complexity We can formally quantify the computational savings of the canopies technique. The technique has two components: a relatively fast step where the canopies are created, followed by a relatively slow clustering process. If we create canopies using an inverted index, we do not need to perform even the complete pair-wise distance comparisons. If we assume that each document has w words, and these words are evenly distributed across the vocabulary V , then we can compare a document to n other documents in O(nw2/|V |). This canopy creation cost is typically insignificant in comparison to the more detailed clustering phase. Assume that we have n data points that originated from k clusters. For concreteness, first consider the Greedy Agglomerative Clustering algorithm. Clustering without canopies requires calculating the distance between all pairs of points, an O(n2) operation. This is the dominating complexity term in GAC clustering. Now consider how this is reduced by using canopies. Assume that there are c canopies and that each data point on average falls into f canopies. This factor f estimates the amount to which the canopies overlap with each other. Now, in an optimistic scenario where each canopy is of equal size, there will be roughly fn/c data points per canopy. When clustering with canopies we need only calculate the distances between pairs of points within the same canopy. This requires at most O(c(fn/c) 2) = O(f 2n2/c) distance measurements. (This is probably an over-estimate, as the same points tend to fall into multiple canopies together, and we only need to calculate their distances once.) This represents a reduction in complexity of f 2/c. In general, c will be much larger than f. Given a typical case in which n = 1, 000, 000, k = 10, 000, c = 1, 000, and f is a small constant, the canopies technique reduces computation by a factor of 1, 000. In the case of K-means or Expectation-Maximization, clustering without canopies requires O(nk) distance comparisons per iteration of clustering (finding the distance between each data point and each cluster prototype). Consider the EM method 1, where each cluster belongs to one or more canopy. Assume that clusters have roughly the same Fahlman, Scott & Lebiere, Christian (1989). The cascadecorrelation learning architecture. In Touretzky, D., editor, Advances in Neural Information Processing Systems (volume 2), (pp. 524-532), San Mateo, CA. Morgan Kaufmann. Fahlman, S.E. and Lebiere, C., “The Cascade Correlation Learning Architecture,” NIPS, Vol. 2, pp. 524-532, Morgan Kaufmann, 1990. Fahlmann, S. E. and Lebiere, C. (1989). The cascadecorrelation learning architecture. In Advances in Neural Information Processing Systems 2 (NIPS-2), Denver, Colorado. Figure 2: Three sample citations to the same paper. Note the different layout formats, and the mistakes in spelling that make it difficult to identify these as citations to the same paper. overlap factor f as data points do. Then, each cluster needs to compare itself to the fn/c points in f different canopies. For all clusters, this is O(nkf 2/c) per iteration, yielding the same complexity reduction as seen for GAC. In the experimental results described in a following section, c is on the order of 1,000, so the savings are significant. In an industrial sized merge-purge problem, far more canopies would be used, and full pair-wise distance calculations would not at all be feasible. 3. CLUSTERING TEXTUAL BIBLIOGRAPHIC REFERENCES In this section we provide empirical evidence that using canopies for clustering can increase computational efficiency by an order of magnitude without losing any clustering accuracy. We demonstrate these results in the domain of bibliographic references. The Cora web site (www.cora.whizbang.com) provides a search interface to over 50,000 computer science research papers [11]. As part of the site’s functionality, we provide an interface for traversing the citation graph. That is, for a given paper, we provide links to all other papers it references, and links to all other papers that reference it, in the respective bibliography section of each paper. To provide this interface to the data, it is necessary to recognize when two citations from different papers are referencing the same third paper, even though the text of the citations may differ. For example, one paper may abbreviate first author names, while the second may include them. Some typical citations are shown in Figure 2. Note that different styles of reference formatting and abbreviation, as well as outright citation mistakes, make this a difficult task to automate. We pose this as a clustering problem, where the task is to take all the citations from all papers in the collection, and cluster them so that each cluster contains all and only citations to a single paper. Since the Cora collection contains over a million bibliographic entries, it is necessary to perform this clustering efficiently. Using straightforward GAC would take more than one CPU year, assuming unlimited memory. If we estimate the total number of canopies and average canopy membership from labeled dataset used below, the canopies approach will provide a speedup of five orders of magnitude, reducing the clustering time to a couple hours