Exploiting Hierarchical Domain Structure to Compute Similarity PRASANNA GANESAN. HECTOR GARCIA-MOLINA and JENNIFER WIDOM Stanford University The notion of similarity between objects finds use in many contexts, for example, in search engines, collaborative filtering, and clustering Objects being compared often are modeled as sets, with their similarity traditionally determined based on set intersection. Intersection-based measures do not accurately capture similarity in certain domains, such as when the data is sparse or when there are known relationships between items within sets. We propose new measures that exploit a hierarchical domain structure in order to produce more intuitive similarity scores. We extend imilar to provide appropriate results in the presence of multisets(also handled unsatisfactorily by traditional measures), for example, to correctly compute the similarity between customers who buy several instances of the same product(say milk), or who buy several products in the same category(say dairy products). We also provide an experimental comparison of our measures against traditional similarity measures, and report on a user study that evaluated how Categories and Subject Descriptors: H 3.1 [Information Storage and Retrieval: Content Analysis and Indexing General Terms: Algorithms Additional Key Words and Phrases: Similarity measures, hierarchy, collaborative filtering, data INTRODUCTION The notion of similarity is used in many contexts to identify objects having common"characteristics. For instance, a search engine finds documents that are similar to a query or to other documents. a clustering algorithm groups together gene sequences that have similar features. a collaborative filtering system looks for people sharing common interests [Goldberg et al. 1992] In many cases, the objects being compared are treated as sets or bags of elements drawn from a fat domain. Thus a document is a bag of words, a cus- tomer is a bag of purchases, and so on. The similarity between two objects is often determined by their bag intersection the more elements two customers This material is based upon work supported by the National Science Foundation under Grants IIS- 0085896, IIS-9817799, and IIS-9811947 Prasanna Ganesan is supported by a Stanford Graduate Fellowship. author prasannag@cs.stanford.edu. Permission to make digital/hard copy of all or part of this material is granted without fee for per onal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and otice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. o2003ACM10468188/03/0100-00648500 ACM Transactions on Information Systems, Vol 21, No 1, January 2003, Pages 64-93
Exploiting Hierarchical Domain Structure to Compute Similarity PRASANNA GANESAN, HECTOR GARCIA-MOLINA, and JENNIFER WIDOM Stanford University The notion of similarity between objects finds use in many contexts, for example, in search engines, collaborative filtering, and clustering. Objects being compared often are modeled as sets, with their similarity traditionally determined based on set intersection. Intersection-based measures do not accurately capture similarity in certain domains, such as when the data is sparse or when there are known relationships between items within sets. We propose new measures that exploit a hierarchical domain structure in order to produce more intuitive similarity scores. We extend our similarity measures to provide appropriate results in the presence of multisets (also handled unsatisfactorily by traditional measures), for example, to correctly compute the similarity between customers who buy several instances of the same product (say milk), or who buy several products in the same category (say dairy products). We also provide an experimental comparison of our measures against traditional similarity measures, and report on a user study that evaluated how well our measures match human intuition. Categories and Subject Descriptors: H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing General Terms: Algorithms Additional Key Words and Phrases: Similarity measures, hierarchy, collaborative filtering, data mining 1. INTRODUCTION The notion of similarity is used in many contexts to identify objects having common “characteristics.” For instance, a search engine finds documents that are similar to a query or to other documents. A clustering algorithm groups together gene sequences that have similar features. A collaborative filtering system looks for people sharing common interests [Goldberg et al. 1992]. In many cases, the objects being compared are treated as sets or bags of elements drawn from a flat domain. Thus a document is a bag of words, a customer is a bag of purchases, and so on. The similarity between two objects is often determined by their bag intersection: the more elements two customers This material is based upon work supported by the National Science Foundation under Grants IIS- 0085896, IIS-9817799, and IIS-9811947. Prasanna Ganesan is supported by a Stanford Graduate Fellowship. Authors’ address: 353 Serra Mall, #432, Stanford University, Stanford, CA 94305-9025; email: prasannag@cs.stanford.edu. Permission to make digital/hard copy of all or part of this material is granted without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. °C 2003 ACM 1046-8188/03/0100-0064 $5.00 ACM Transactions on Information Systems, Vol. 21, No. 1, January 2003, Pages 64–93
Exploiting Hierarchical Domain Structure. 65 A:(b1 b2) B:(b3b4) D:(b1m1) m1 m2 purchase in common, the more similar they are considered. In other cases, the objects are treated as vectors in an n-dimensional space where n is the cardi nality of the element domain. The cosine of the angle between two objects is then used as a measure of their similarity [McGill 1983]. We propose enhancing these object models by adding a hierarchy describing the relationships among domain elements. The"semantic knowledge"in the hierarchy helps us iden tify objects sharing common characteristics, leading to improved measures of larity To illustrate, let us look at a small three-level hierarchy on the music CD domain, as shown in Figure 1. Say customer a buys Beatles CDs b1 and b2, B buys Beatles CDs b3 and b4, and c buys Stones CDs si and s2. If we were to use a similarity measure based on set intersections, we would find that the similarity between any two of A, B, and C is zero. The Vector-Space Model would represent A, B, and C as three mutually perpendicular vectors and, therefore, the cosine similarity between any two of them is again zero However, looking at the hierarchy of Figure l, we see that a and b are rather similar since both of them like the beatles whereas a and C are somewhat less similar since, although both listen to rock music, they prefer different bands The similarity between two CDs is reflected in how far apart they are in the hierarchy. In this article, we develop measures that take this hierarchy into account, leading to similarity scores that are closer to human intuition than previous measures. There are several interesting challenges that arise in using a hierarchy for milarity computations In our CD example, for instance customers may pur- chase CDs from different portions of the hierarchy: for example, customer D in Figure 1 purchases both Beatles as well as Mozart CDs. In such a case it is not as obvious how similar d is to a or b or to other customers with mixed purchases. As we show, there are multiple ways in which the hierarchy can be used for similarity computations, and in this article we contrast different approaches Another challenge is handling multiple occurrences(multisets)at different levels of the hierarchy. For example, say we had another user E who buys a lot of Beatles CDs as well as a mozart cd m,(see Figure 1). The question is: Which of D or E is more similar to A? Customer D bought Beatles CD b1, just ike A On the other hand, customer e did not buy that CD, but did buy a lot of other Beatles CDs. The traditional cosine-similarity measure favors multiple occurrences of an element. That is, if a query word occurs a hundred times ACM Transactions on Information Systems, Vol 21, No. 1, January 2003
Exploiting Hierarchical Domain Structure • 65 Fig. 1. Music CD hierarchy. purchase in common, the more similar they are considered. In other cases, the objects are treated as vectors in an n-dimensional space, where n is the cardinality of the element domain. The cosine of the angle between two objects is then used as a measure of their similarity [McGill 1983]. We propose enhancing these object models by adding a hierarchy describing the relationships among domain elements. The “semantic knowledge” in the hierarchy helps us identify objects sharing common characteristics, leading to improved measures of similarity. To illustrate, let us look at a small three-level hierarchy on the music CD domain, as shown in Figure 1. Say customer A buys Beatles CDs b1 and b2, B buys Beatles CDs b3 and b4, and C buys Stones CDs s1 and s2. If we were to use a similarity measure based on set intersections, we would find that the similarity between any two of A, B, and C is zero. The Vector-Space Model would represent A, B, and C as three mutually perpendicular vectors and, therefore, the cosine similarity between any two of them is again zero. However, looking at the hierarchy of Figure 1, we see that A and B are rather similar since both of them like the Beatles, whereas A and C are somewhat less similar since, although both listen to rock music, they prefer different bands. The similarity between two CDs is reflected in how far apart they are in the hierarchy. In this article, we develop measures that take this hierarchy into account, leading to similarity scores that are closer to human intuition than previous measures. There are several interesting challenges that arise in using a hierarchy for similarity computations. In our CD example, for instance, customers may purchase CDs from different portions of the hierarchy: for example, customer D in Figure 1 purchases both Beatles as well as Mozart CDs. In such a case it is not as obvious how similar D is to A or B or to other customers with mixed purchases. As we show, there are multiple ways in which the hierarchy can be used for similarity computations, and in this article we contrast different approaches. Another challenge is handling multiple occurrences (multisets) at different levels of the hierarchy. For example, say we had another user E who buys a lot of Beatles CDs as well as a Mozart CD m1 (see Figure 1). The question is: Which of D or E is more similar to A? Customer D bought Beatles CD b1, just like A. On the other hand, customer E did not buy that CD, but did buy a lot of other Beatles CDs. The traditional cosine-similarity measure favors multiple occurrences of an element. That is, if a query word occurs a hundred times ACM Transactions on Information Systems, Vol. 21, No. 1, January 2003
P Ganesan et al in a document, the document is more similar to the query than one in which the query word appears only once. If we use this approach in our example, we would say that e is more similar to A than D is, because e buys 14 Beatles CDs, whereas D buys just one Unfortunately, it is not clear that this conclusion is the correct one. E probably a serious Beatles fan, whereas A and D appear more balanced and similar to each other. so it would also be reasonable to conclude that d is more similar to a than E is. Thus measures like cosine-similarity, although suit- able for query-document similarity, do not provide the right semantics for in- terobject similarity in many other situations. This problem has, in fact, been observed earlier even in the context of inter-document similarity [Shivakumar and Garcia-Molina 1995]. In this article, we study various semantics for mul tiple occurrences, and provide measures that map to these semantics There has been a lot of prior work related to similarity in various domains and, naturally, we rely on some of it for our own work In Sections 2 and 6 we discuss prior work in detail, but here we make some brief observations In our example we have seen that with traditional measures customers A B, and C have zero similarity to each other because their purchases do not in- tersect. When objects or collections are sparse(i.e, have few elements relative to the domain), intersections tend to be empty and traditional measures have difficulty identifying similar objects. There have been many attempts to over- come this sparsity problem through techniques such as dimension reduction [Sarwar et al. 2000], filtering agents [Sarwar et al. 1998, item-based filtering [ Sarwar et al. 2001, and the use of personal agents [good et al. 1999]. We be- lieve that using a richer data model (i. e, our hierarchy addresses this problem in a simple and effective way Hierarchies are often used to encode knowledge and have been used in a variety of ways for text classification, for mining association rules, for interac tive information retrieval, and various other tasks where similarity plays a role Feldman and Dagan 1995; Han and Fu 1995 Scott and Matwin 1998; Srikant and Agrawal 1995]. In this article, we focus on the case where attributes are confined to being leaves of the hierarchy. We believe that our work may be extended to deal with polyhierarchies, where the hierarchy is not required to be a strict tree to permit attributes at multiple levels in the hierarchy, and to deal with generalizations of hierarchies, where similarity relationships be- tween attributes are used to induce similarity relationships between objects consisting of sets of these attributes. We do not cover these extensions in this article. Our goal here is to rigorously study how a domain hierarchy can be used to compute similarity between sets of leaves and to explore, compare, and evaluate the various options available different applications require different notions of similarity and we expect that an analysis of the specific applica- tion would determine which of our proposed similarity measures suits it the re are many domains in which hierarchies exist and can be exploited as we suggest here For example there is an inherent hierarchical structure to the URls of pages within a single site. This structure can be exploited in clus- tering user sessions identified from a Web log Joshi and Krishnapuram 2000 ACM Transactions on Information Systems, Vol 21, No 1, January 2003
66 • P. Ganesan et al. in a document, the document is more similar to the query than one in which the query word appears only once. If we use this approach in our example, we would say that E is more similar to A than D is, because E buys 14 Beatles CDs, whereas D buys just one. Unfortunately, it is not clear that this conclusion is the correct one. E is probably a serious Beatles fan, whereas A and D appear more balanced and similar to each other, so it would also be reasonable to conclude that D is more similar to A than E is. Thus measures like cosine-similarity, although suitable for query-document similarity, do not provide the right semantics for interobject similarity in many other situations. This problem has, in fact, been observed earlier even in the context of inter-document similarity [Shivakumar and Garcia-Molina 1995]. In this article, we study various semantics for multiple occurrences, and provide measures that map to these semantics. There has been a lot of prior work related to similarity in various domains and, naturally, we rely on some of it for our own work. In Sections 2 and 6 we discuss prior work in detail, but here we make some brief observations. In our example we have seen that with traditional measures customers A, B, and C have zero similarity to each other because their purchases do not intersect. When objects or collections are sparse (i.e., have few elements relative to the domain), intersections tend to be empty and traditional measures have difficulty identifying similar objects. There have been many attempts to overcome this sparsity problem through techniques such as dimension reduction [Sarwar et al. 2000], filtering agents [Sarwar et al. 1998], item-based filtering [Sarwar et al. 2001], and the use of personal agents [Good et al. 1999]. We believe that using a richer data model (i.e., our hierarchy) addresses this problem in a simple and effective way. Hierarchies are often used to encode knowledge, and have been used in a variety of ways for text classification, for mining association rules, for interactive information retrieval, and various other tasks where similarity plays a role [Feldman and Dagan 1995; Han and Fu 1995; Scott and Matwin 1998; Srikant and Agrawal 1995]. In this article, we focus on the case where attributes are confined to being leaves of the hierarchy. We believe that our work may be extended to deal with polyhierarchies, where the hierarchy is not required to be a strict tree, to permit attributes at multiple levels in the hierarchy, and to deal with generalizations of hierarchies, where similarity relationships between attributes are used to induce similarity relationships between objects consisting of sets of these attributes. We do not cover these extensions in this article. Our goal here is to rigorously study how a domain hierarchy can be used to compute similarity between sets of leaves and to explore, compare, and evaluate the various options available. Different applications require different notions of similarity and we expect that an analysis of the specific application would determine which of our proposed similarity measures suits it the best. There are many domains in which hierarchies exist and can be exploited as we suggest here. For example, there is an inherent hierarchical structure to the URLs of pages within a single site. This structure can be exploited in clustering user sessions identified from a Web log [Joshi and Krishnapuram 2000; ACM Transactions on Information Systems, Vol. 21, No. 1, January 2003
Exploiting Hierarchical Domain Structure laccard's Coeff Dice,s Coeff. Pearson Correlation Coeff Earth-Movers Distance MAC Cosine Similarity Others First Generation Fig. 2. Evolution of similarity measures. Nasraoui et al. 1999]. Although Nasraoui et al. [1999] do attempt to exploit this hierarchical structure in computing similarity, the proposed similarity mea sure turns out to be highly unintuitive and can provide arbitrarily low values of similarity even between identical sessions. We would expect the use of our similarity measures to provide a dramatic improvement in the quality of the To name a few other examples, the Open Directory [opd] is a hierarchy on a subset of pages on the Web Thus, we can compute the similarity of Web users, for instance based on a trace of the Web pages they visit. In the music domain, songs can be organized into a hierarchy by genre, band, album, and so en be used, say, to find users with simila recommend new songs to them. In the document domain, we can use existing hierarchies such as WordNet [Miller et al. 1990] to compute document similar- ty. In all of these cases, our general-purpose extended similarity measures can be used to improve functionality. In summary, the main contributions of this article are the following - We introduce similarity measures that can exploit hierarchical domain struc ture, leading to similarity scores that are more intuitive than the ones gen- erated by traditional similarity measures. Ve extend these measures to deal with multiple occurrences of elements(and of ancestors in the hierarchy), such as those exhibited in a and e in Figure 1 in a semantically meaningful fashion We analyze the differences between our various measures, compare them empirically, and show that all of them are very different from measures that dont exploit the domain hierarchy. b.e report the findings of a user study" ate the quality of the various Figure 2 shows the evolution of the measures that we discuss, and erves as a roadma ap for the rest of the article. Section 2 describes traditional measures of similarity. Section 3 introduces our First-Generation measures, which exploit a hierarchical domain structure and are obtained as natural generalizations of the traditional measures. Section 4 introduces the multiple-occurrence prob- lem, and evolves the measures into our second- Generation measures. section 5 ACM Transactions on Information Systems, Vol 21, No 1, January 2003
Exploiting Hierarchical Domain Structure • 67 Fig. 2. Evolution of similarity measures. Nasraoui et al. 1999]. Although Nasraoui et al. [1999] do attempt to exploit this hierarchical structure in computing similarity, the proposed similarity measure turns out to be highly unintuitive and can provide arbitrarily low values of similarity even between identical sessions. We would expect the use of our similarity measures to provide a dramatic improvement in the quality of the results. To name a few other examples, the Open Directory [OPD ] is a hierarchy on a subset of pages on the Web. Thus, we can compute the similarity of Web users, for instance, based on a trace of the Web pages they visit. In the music domain, songs can be organized into a hierarchy by genre, band, album, and so on. This hierarchy can then be used, say, to find users with similar tastes, and recommend new songs to them. In the document domain, we can use existing hierarchies such as WordNet [Miller et al. 1990] to compute document similarity. In all of these cases, our general-purpose extended similarity measures can be used to improve functionality. In summary, the main contributions of this article are the following. —We introduce similarity measures that can exploit hierarchical domain structure, leading to similarity scores that are more intuitive than the ones generated by traditional similarity measures. —We extend these measures to deal with multiple occurrences of elements (and of ancestors in the hierarchy), such as those exhibited in A and E in Figure 1, in a semantically meaningful fashion. —We analyze the differences between our various measures, compare them empirically, and show that all of them are very different from measures that don’t exploit the domain hierarchy. —We report the findings of a user study to evaluate the quality of the various measures. Figure 2 shows the evolution of the measures that we discuss, and serves as a roadmap for the rest of the article. Section 2 describes traditional measures of similarity. Section 3 introduces our First-Generation measures, which exploit a hierarchical domain structure and are obtained as natural generalizations of the traditional measures. Section 4 introduces the multiple-occurrence problem, and evolves the measures into our Second-Generation measures. Section 5 ACM Transactions on Information Systems, Vol. 21, No. 1, January 2003
P Ganesan et al is devoted to a comparison of these measures and their evaluation. Section 6 describes related work 2. TRADITIONAL SIMILARITY MEASURES Given two objects, or collections of elements C1 and C2, our goal is to compute their similarity sim(C1, C2), a real number in [0, 1. The similarity should tend to 1 as C and c2 have more and more common"characteristics "There is no universal notion of which"characteristics"count and hence the notion of sim- ilarity is necessarily subjective. Here we define several notions of similarity, and discuss how intuitive they are 2.1 The Set/Bag Model In many applications, the simplest approach to modeling an object is to treat it as a set, or a bag, ofelements, which we term a collection. The similarity between wo collections is then computed on the basis of their set or bag intersection. There are many different measures in use, which differ primarily in the way hey normalize this intersection value Ivan Rijsbergen 1979. We describe two of them here Let X and y be two collections. Jaccard's Coefficient, sim Jace(X, Y), is defined to be X∩Y Thus, in Figure 1, simJace (A, D)=1/(2+2-1)=3. Dice's Coefficient, which we denote sim pice(X, y), is defined to be 2*x∩Y simplice(X, Y) Once again referring to Figure 1, simpice(A, D)=(2*1)/(2+2)=].Other such measures include the Inclusion Measure, the Overlap Coefficient, and the Extended Jaccard Coefficient [Strehl et al. 2000; van Rijsbergen 1979 2.2 The Vector-Space Model The Vector-Space Model is a popular model in the information retrieval do- main [ MeGill 1983]. In this model, each element in the domain is taken to be a dimension in a vector space. a collection is represented by a vector, with com- ponents along exactly those dimensions corresponding to the elements in the ollection. One advantage of this model is that we can now weight the compo- nents of the vectors, by using schemes such as TF-IDF [Salton and buckley 1988. The Cosine-Similarity Measure(CSM) defines the similarity between two vectors to be the cosine of the angle between them. This measure has proven to be very popular for query-document and document-document simi- larity in text retrieval (Salton and Buckley 1988]. again referring to Figure 1 ACM Transactions on Information Systems, Vol 21, No 1, January 2003
68 • P. Ganesan et al. is devoted to a comparison of these measures and their evaluation. Section 6 describes related work. 2. TRADITIONAL SIMILARITY MEASURES Given two objects, or collections of elements C1 and C2, our goal is to compute their similarity sim(C1, C2), a real number in [0, 1]. The similarity should tend to 1 as C1 and C2 have more and more common “characteristics.” There is no universal notion of which “characteristics” count, and hence the notion of similarity is necessarily subjective. Here we define several notions of similarity, and discuss how intuitive they are. 2.1 The Set/Bag Model In many applications, the simplest approach to modeling an object is to treat it as a set, or a bag, of elements, which we term a collection. The similarity between two collections is then computed on the basis of their set or bag intersection. There are many different measures in use, which differ primarily in the way they normalize this intersection value [van Rijsbergen 1979]. We describe two of them here. Let X and Y be two collections. Jaccard’s Coefficient, simJacc(X , Y ), is defined to be: simJacc(X , Y ) = |X ∩ Y | |X ∪ Y | . Thus, in Figure 1, simJacc(A, D) = 1/(2 + 2 − 1) = 1 3 . Dice’s Coefficient, which we denote simDice(X , Y ), is defined to be: simDice(X , Y ) = 2 ∗ |X ∩ Y | |X |+|Y | . Once again referring to Figure 1, simDice(A, D) = (2 ∗ 1)/(2 + 2) = 1 2 . Other such measures include the Inclusion Measure, the Overlap Coefficient, and the Extended Jaccard Coefficient [Strehl et al. 2000; van Rijsbergen 1979]. 2.2 The Vector-Space Model The Vector-Space Model is a popular model in the information retrieval domain [McGill 1983]. In this model, each element in the domain is taken to be a dimension in a vector space. A collection is represented by a vector, with components along exactly those dimensions corresponding to the elements in the collection. One advantage of this model is that we can now weight the components of the vectors, by using schemes such as TF-IDF [Salton and Buckley 1988]. The Cosine-Similarity Measure (CSM) defines the similarity between two vectors to be the cosine of the angle between them. This measure has proven to be very popular for query-document and document-document similarity in text retrieval [Salton and Buckley 1988]. Again referring to Figure 1, ACM Transactions on Information Systems, Vol. 21, No. 1, January 2003