Exploiting Hierarchical Domain Structure. 69 and using uniform weights of 1 x.苏五+6mi+B.6+Bmi simCos(A, D) 6i.五+砬.B五.6+m 1+0+0+01 √1+1√1+12 Collaborative-filtering systems such as groupLens [Resnick et al. 1994] use similar vector model, with each dimension being a"vote"of the user for a particular item. However, they use the Pearson Correlation Coefficient as a similarity measure, which first subtracts the average of the elements from each of the vectors before computing their cosine similarity. Formally, this similarity is given by the formula (X,Y) where xj is the value of vector X in dimension j, t is the average value of X along a dimension, and the summation is over all dimensions in which both X and Y are nonzero Resnick et al. 1994]. Inverse User Frequency may be used to weight the different components of the vectors. There have also been other enhancements such as default voting and case amplification Breese et al 1998, which modify the values of the vectors along the various dimensions 2.3 Measures Exploiting a Hierarchy There are quite a few distance measures proposed in the literature that may be adapted to compute similarity while making use of a hierarchy. One such measure, popular in a variety of domains, is Earth-mouer's distance [rubner et al. 1998; Chakrabarti et al. 2000]. This distance computes the dissimilarity between two collections of points in space by calculating the work to be done in moving"mounds of earth, "located at points in the first collection, to fill"holes located at the points in the second collection. It is possible to map bags in our domain into collections of points in space by using the hierarchy to compute the distance between any two elements of the domain in space We could thus apply Earth-movers distance to our problem, but this happens to be unsatisfactory for many reasons, some of which are demonstrated by our user study, described in Section 5.2. We also provide a detailed analysis of the reasons underlying the inapplicability of such measures in the extended version ofour article[Ganesan etal.2002] As noted in the introduction, there have been attempts to exploit hierar- chical URL structure in clustering user sessions from a Web log Joshi and Krishnapuram 2000; Nasraoui et al. 1999]. Unfortunately, the proposed mea sure turns out to be unintuitive, and can provide arbitrarily low similarity values between identical sessions. There are also other measures such as tree edit distance and Mac lloannidis and Ppoosala 1999] which could conceivably be applied to our problem. We provide more details on these measures in Section 6 ACM Transactions on Information Systems, Vol 21, No. 1, January 2003
Exploiting Hierarchical Domain Structure • 69 and using uniform weights of 1: simCos(A, D) = −→A · −→D | −→A ||−→D | = −→b1 · −→b1 + −→b1 · −→m1 + −→b2 · −→b1 + −→b2 · −→m1 q−→b1 · −→b1 + −→b2 · −→b2 q−→b1 · −→b1 + −→m1 · −→m1 = 1 + 0 + 0 + 0 √1 + 1 √1 + 1 = 1 2 . Collaborative-filtering systems such as GroupLens [Resnick et al. 1994] use a similar vector model, with each dimension being a “vote” of the user for a particular item. However, they use the Pearson Correlation Coefficient as a similarity measure, which first subtracts the average of the elements from each of the vectors before computing their cosine similarity. Formally, this similarity is given by the formula: c(X , Y ) = P j(x j − x)( y j − y) qP j (x j − x) 2 P j ( y j − y) 2 , where x j is the value of vector X in dimension j, x is the average value of X along a dimension, and the summation is over all dimensions in which both X and Y are nonzero [Resnick et al. 1994]. Inverse User Frequency may be used to weight the different components of the vectors. There have also been other enhancements such as default voting and case amplification [Breese et al. 1998], which modify the values of the vectors along the various dimensions. 2.3 Measures Exploiting a Hierarchy There are quite a few distance measures proposed in the literature that may be adapted to compute similarity while making use of a hierarchy. One such measure, popular in a variety of domains, is Earth-mover’s distance [Rubner et al. 1998; Chakrabarti et al. 2000]. This distance computes the dissimilarity between two collections of points in space by calculating the work to be done in moving “mounds of earth,” located at points in the first collection, to fill “holes,” located at the points in the second collection. It is possible to map bags in our domain into collections of points in space by using the hierarchy to compute the distance between any two elements of the domain in space. We could thus apply Earth-mover’s distance to our problem, but this happens to be unsatisfactory for many reasons, some of which are demonstrated by our user study, described in Section 5.2. We also provide a detailed analysis of the reasons underlying the inapplicability of such measures in the extended version of our article [Ganesan et al. 2002]. As noted in the introduction, there have been attempts to exploit hierarchical URL structure in clustering user sessions from a Web log [Joshi and Krishnapuram 2000; Nasraoui et al. 1999]. Unfortunately, the proposed measure turns out to be unintuitive, and can provide arbitrarily low similarity values between identical sessions. There are also other measures such as tree edit distance and MAC [Ioannidis and Poosala 1999] which could conceivably be applied to our problem. We provide more details on these measures in Section 6 ACM Transactions on Information Systems, Vol. 21, No. 1, January 2003
P Ganesan et al Fig 3. Induced trees for collections A and B. but, here, we simply note that neither of these measures proves to be a good fit for the problem at hand 3. THE FIRST GENERATION We now describe two new measures we developed, based fairly directly on the traditional measures, that exploit a hierarchical domain structure in computing similarity. We first describe our model formally, define some associated concepts and then proceed to develop the measures 3. 1 The model Let u be a rooted tree with all nodes carrying a distinct label. We do not impose any restrictions on the shape of U. Each node can have arbitrary fanout, and the leaves of U can be at different levels. Let lu be the set of all labels in U. Let LLu be the set of all labels on the leaves of U. Lly is the element domain, on which there is a superimposed hierarchy described by u. In our music example, [61, b2,..., S1, 82, .. ml, m2, .. C1, C2,.. A collection C is a bag whose elements are drawn from LLI Let w be a function from LLy to the set of real numbers. w is an a priori weight function on the leaves of U, which captures the relative importance of different elements. There are many ways of deriving this weight function. It could be an Inverse User Frequency such as the one defined in Breese et al [1998]. It could also be corpus-independent, and be determined by attributes of the elements, such as their cost (in monetary terms). Of course, the weight function also can be uniform Since there is a hierarchical structure imposed on llu, a collection C induce tree, a subgraph of U that consists of the ancestral paths of each leaf in C We refer to trees that are induced in this manner as induced trees. Notice that since C is a bag, the induced tree might have more than one leaf with the same label. Figure 3 shows the induced trees for the collections A and B from As is conventional, the depth of a node in the hierarchy is the number of edges on the path from the root of u to that node. Given any two leaves l1 and l2 in U, define the Lowest Common Ancestor LCA(1, l2) to be the node of greatest depth that is an ancestor of both l1 and l2. This LCa is always well defined since the two leaves have at least one common ancestor-the root nodeand no two common ancestors can have the same depth In Figure 1, LCA(b1, b2)=b, and LCA(61, S1=r ACM Transactions on Information Systems, Vol 21, No 1, January 2003
70 • P. Ganesan et al. Fig. 3. Induced trees for collections A and B. but, here, we simply note that neither of these measures proves to be a good fit for the problem at hand. 3. THE FIRST GENERATION We now describe two new measures we developed, based fairly directly on the traditional measures, that exploit a hierarchical domain structure in computing similarity. We first describe our model formally, define some associated concepts, and then proceed to develop the measures. 3.1 The Model Let U be a rooted tree, with all nodes carrying a distinct label. We do not impose any restrictions on the shape of U. Each node can have arbitrary fanout, and the leaves of U can be at different levels. Let LU be the set of all labels in U. Let LLU be the set of all labels on the leaves of U. LLU is the element domain, on which there is a superimposed hierarchy described by U. In our music example, LLU = {b1, b2, ... , s1, s2, ... , m1, m2, ... , c1, c2, ...}. A collection C is a bag whose elements are drawn from LLU . Let W be a function from LLU to the set of real numbers. W is an a priori weight function on the leaves of U, which captures the relative importance of different elements. There are many ways of deriving this weight function. It could be an Inverse User Frequency such as the one defined in Breese et al. [1998]. It could also be corpus-independent, and be determined by attributes of the elements, such as their cost (in monetary terms). Of course, the weight function also can be uniform. Since there is a hierarchical structure imposed on LLU , a collection C induces a tree, a subgraph of U that consists of the ancestral paths of each leaf in C. We refer to trees that are induced in this manner as induced trees. Notice that, since C is a bag, the induced tree might have more than one leaf with the same label. Figure 3 shows the induced trees for the collections A and B from Figure 1. As is conventional, the depth of a node in the hierarchy is the number of edges on the path from the root of U to that node. Given any two leaves l1 and l2 in U, define the Lowest Common Ancestor LCA(l1, l2) to be the node of greatest depth that is an ancestor of both l1 and l2. This LCA is always well defined since the two leaves have at least one common ancestor—the root node—and no two common ancestors can have the same depth. In Figure 1, LCA(b1, b2) = b, and LCA(b1, s1) = r. ACM Transactions on Information Systems, Vol. 21, No. 1, January 2003
Exploiting Hierarchical Domain Structure 3.2 The Generalized vector-Space Model To illustrate how the Vector-Space Model can be generalized to take the hier- archy into account, consider Figure 1 again. Let us say that the unit vector corresponding to a leaf l is represented by l. Now, according to the traditional cosine-similarity measure all leaf unit vectors are perpendicular to each other which means that the dot product of any two of them is zero. The dot product of a unit vector with itself is equal to 1 We have already observed that b, is, intuitively, somewhat similar to bs sinc they are both Beatles CDs. Thus, if a buys b1 and b buys b3, we need to make this fact contribute something to the similarity of A and B; that is, we want b1. b3 to be nonzero. In the vector space, we want to assert that b1 and bs are not really perpendicular to each other, since they are somewhat similar We use the hierarchy to decide exactly what value to assign to this dot product. For example, let us decide that b1. b3=a, since they have a com- mon ancestor that is two-thirds of the way down from the root. By a similar reasoning process, we let b1. si be 1. We let b1. mi continue to be 0 since they are in different sections of the hierarchy and don't seem to have anything to do with each other, except for the fact that they are both music CDs. Formally, let the set of leaf labels Lly be(1, l2, 23, .. In]. Let Counta(li)be the number of times li occurs in collection A. Then, collection a is represented A i)* Counta(i) for i= l.n. This usage of weights is identical to the standard Vector-Space Model's For any two elements l1 and l2, we define 2* depth(lCav(1, l2)) between0 and 1. Note that the dot product is equal to 1 if and only if11=4 This definition is consistent, since the right side of this equation always lie We continue to measure similarity by the cosine-similarity measure, except that we have now dropped the assumption that the different "components "of the vector are perpendicular to each other. If collection A is represented by the AB=∑∑ab7 Again, this equation is identical to the standard Vector-Space Model, except that li. li is not equal to o whenever i+j. Finally, the cosine similarity between A and b is given by the traditional formula x.官 sim(A, B) x.xV方. We call this measure the generalized Cosine-Similarity Measure(GCSm) 3.3 The Optimistic Genealogy Measure The Generalized Cosine-Similarity Measure from Section 3.2 is not the only, or even the most intuitive, way to exploit a hierarchy for similarity. Next we ACM Transactions on Information Systems, Vol 21, No. 1, January 2003
Exploiting Hierarchical Domain Structure • 71 3.2 The Generalized Vector-Space Model To illustrate how the Vector-Space Model can be generalized to take the hierarchy into account, consider Figure 1 again. Let us say that the unit vector corresponding to a leaf l is represented by →l. Now, according to the traditional cosine-similarity measure, all leaf unit vectors are perpendicular to each other, which means that the dot product of any two of them is zero. The dot product of a unit vector with itself is equal to 1. We have already observed that b1 is, intuitively, somewhat similar to b3 since they are both Beatles CDs. Thus, if A buys b1 and B buys b3, we need to make this fact contribute something to the similarity of A and B; that is, we want −→b1 · −→b3 to be nonzero. In the vector space, we want to assert that −→b1 and −→b3 are not really perpendicular to each other, since they are somewhat similar. We use the hierarchy to decide exactly what value to assign to this dot product. For example, let us decide that −→b1 · −→b3 = 2 3 , since they have a common ancestor that is two-thirds of the way down from the root. By a similar reasoning process, we let −→b1 · −→s1 be 1 3 . We let −→b1 · − m →1 continue to be 0 since they are in different sections of the hierarchy and don’t seem to have anything to do with each other, except for the fact that they are both music CDs. Formally, let the set of leaf labels LLU be {l1, l2, l3, ... , ln}. Let CountA(li) be the number of times li occurs in collection A. Then, collection A is represented by the vector −→A = Pn i=1 ai −→li , where ai = W(li) ∗ CountA(li) for i = 1..n. This usage of weights is identical to the standard Vector-Space Model’s. For any two elements l1 and l2, we define −→l1 · −→l2 = 2 ∗ depth(LCAU (l1, l2)) depth(l1) + depth(l2) . This definition is consistent, since the right side of this equation always lies between 0 and 1. Note that the dot product is equal to 1 if and only if l1 = l2. We continue to measure similarity by the cosine-similarity measure, except that we have now dropped the assumption that the different “components” of the vector are perpendicular to each other. If collection A is represented by the vector −→A = P i ai −→li and B by the vector −→B = P i bi −→li , then −→A . −→B = Xn i=1 Xn j=1 aibj −→li . −→l j . Again, this equation is identical to the standard Vector-Space Model, except that −→li . −→l j is not equal to 0 whenever i 6= j. Finally, the cosine similarity between A and B is given by the traditional formula: sim(A, B) = −→A · −→B q−→A · −→A q−→B · −→B . We call this measure the Generalized Cosine-Similarity Measure (GCSM). 3.3 The Optimistic Genealogy Measure The Generalized Cosine-Similarity Measure from Section 3.2 is not the only, or even the most intuitive, way to exploit a hierarchy for similarity. Next we ACM Transactions on Information Systems, Vol. 21, No. 1, January 2003
P Ganesan et al present a second, more natural, and intuitive measure, and contrast it with GCSM. Intuitively, the Optimistic Genealogy Measure computes a"similarity contribution"for each element in one collection, and then takes the weighted average of these contributions to be the similarity between the two collections. The contribution of an element is determined by how good a"match"it has in the other collection Let C1 and C2 be the collections to be compared and let Ti and T2 be their induced trees as defined in Section 3. 1. For any leaf l1 in T1, define LCAt,t(1) to be the ancestor of l1 of greatest depth that is present in T2, that is, the lowest of the LCAs thatli shares with the leaves of T2. This LCA provides an indication of how good the "best match"for l, can be For example, for the trees in Figure 3 LCAA B(b1) is Beatles, since it is present in tree B, and is the lowest ancestor of bi that is present in B (We abuse notation and let A and b refer both to the two collections and to their corresponding induced trees. Now define matchA, T(1)=2 E C2LCA(1, l2)=LCAT.t(l1). That is, match, T(1) is the set of all leaves in T2 that can be the"best match" for l1 In Figure 3, matchA B(b1) is the set (b3, b4 since both elements match b1 at its parent Beatles. Next, we define leafsimT T(,)- depth(LcAT, Ta (D) depth(1) The value leafsimT, T (1)measures how similar l1 is to its best match in T2 Ifl1 itself is present in T2, then LCAT,, T(1)=l1, and therefore leafsimmn(1)=1 On the other hand, if no ancestor of l1 except for the root is present in T2, we have depth(LCAt. T (1))=0 and, therefore, leafsimr, T(1)=0. In Figure 3, leafsimA, B(b1) is 3 and leafsimA B(b2)is also 3 Finally, for any two collections Cl and C2 with associated induced trees T1 and T2, respectively, we define the Optimistic Genealogy Measure(OGM) as sim(C, C,)- lec ZeafsimT, T(1)*W(1) (1) ∑;ec1W(l1) This is just the weighted average of the individual leafsim values of the leaves in T1. Note that since C is a bag, the summation is over all members of the bag, and is not the set average. In our example, sim(A, B)is also 3, since the contributions from b and b2 are identical Note that OGm is, in general, asymmetric; that is, sim(a, B)* sim(B, a) The symmetric simlarity score between A and b is defined to be the average of the two asymmetric scores 3.4 Discussion Table i shows the similarity values computed by various traditional measures discussed in Section 2, as well as by gCSm and ogm, for tl I The reason for the name becomes clear in the next section ACM Transactions on Information Systems, Vol 21, No 1, January 2003
72 • P. Ganesan et al. present a second, more natural, and intuitive measure, and contrast it with GCSM. Intuitively, the Optimistic Genealogy Measure1 computes a “similarity contribution” for each element in one collection, and then takes the weighted average of these contributions to be the similarity between the two collections. The contribution of an element is determined by how good a “match” it has in the other collection. Let C1 and C2 be the collections to be compared and let T1 and T2 be their induced trees as defined in Section 3.1. For any leaf l1 in T1, define LCAT1,T2 (l1) to be the ancestor of l1 of greatest depth that is present in T2, that is, the lowest of the LCAs thatl1 shares with the leaves of T2. This LCA provides an indication of how good the “best match” for l1 can be. For example, for the trees in Figure 3, LCAA,B(b1) is Beatles, since it is present in tree B, and is the lowest ancestor of b1 that is present in B. (We abuse notation and let A and B refer both to the two collections and to their corresponding induced trees.) Now define: matchT1,T2 (l1) = {l2 ∈ C2|LCA(l1, l2) = LCAT1,T2 (l1)}. That is, matchT1,T2 (l1) is the set of all leaves in T2 that can be the “best match” for l1. In Figure 3, matchA,B(b1) is the set {b3, b4} since both elements match b1 at its parent Beatles. Next, we define: leafsimT1,T2 (l1) = depth(LCAT1,T2 (l1)) depth(l1) . The value leafsimT1,T2 (l1) measures how similar l1 is to its best match in T2. If l1 itself is present in T2, then LCAT1,T2 (l1) = l1, and therefore leafsimT1,T2 (l1) = 1. On the other hand, if no ancestor of l1 except for the root is present in T2, we have depth(LCAT1,T2 (l1)) = 0 and, therefore, leafsimT1,T2 (l1) = 0. In Figure 3, leafsimA,B(b1) is 2 3 and leafsimA,B(b2) is also 2 3 . Finally, for any two collections C1 and C2 with associated induced trees T1 and T2, respectively, we define the Optimistic Genealogy Measure (OGM) as sim(C1, C2) = P l1∈C1 leafsimT1,T2 (l1) ∗ W(l1) P l1∈C1 W(l1) . (1) This is just the weighted average of the individual leafsim values of the leaves in T1. Note that since C1 is a bag, the summation is over all members of the bag, and is not the set average. In our example, sim(A, B) is also 2 3 , since the contributions from b1 and b2 are identical. Note that OGM is, in general, asymmetric; that is, sim(A, B) 6= sim(B, A). The symmetric simlarity score between A and B is defined to be the average of the two asymmetric scores. 3.4 Discussion Table I shows the similarity values computed by various traditional measures discussed in Section 2, as well as by GCSM and OGM, for the collections in 1The reason for the name becomes clear in the next section. ACM Transactions on Information Systems, Vol. 21, No. 1, January 2003
Exploiting Hierarchical Domain Structure Table I. Comparison of the Various Measures I sim JC DC CSM GCSM OGM I B. C 0.26 0.25 Figure 1. JC stands for Jaccard's Coefficient and DC for Dice,s Coefficient. We compute similarity between every pair of customers, except for customer E whom we use in a subsequent example. The values shown are symmetric sim ilarity values, with the average of the two asymmetric values being used for OGM. As motivated in Section 1, we would expect to find that customers a and b are more similar to each other than a and c. c and d should be even less similar From Table I, we see that both of our First-Generation measures produce this result, whereas the traditional measures do not. Intuitively, it is not clear whether sim(A, D)should be higher than sim(A, B). There is a case for saying that sim(a, b)is higher, since both a and b are pure?"Beatles persons. One could also contend that a and d have a CD in common, whereas A and b have none, and, therefore, that sim(a, D)ought to be higher. oGm gives them the same similarity values, and GCSm makes sim(A, B)higher. The traditional measures claim that sim(A, D)is higher, since they do not detect any similarity between A and B. gCsm and oGm can be tuned to adjust the conclusion in cases such as these We discuss how to achieve this tuning in the extended version of this article [ganesan et al. 2002] 3.4.1 Contrasting GCSM with OGM. Having seen how the First Generation measures fare on our simple example when compared with the traditional measures. we now examine the basic differences between GCSm -GCSM uses many-to-many matches, whereas OGM uses many-to-one matches. In gCsm, the similarity contribution of an element in one collec- tion is gathered from all elements in the other collection that have a nonzero similarity to that element On the other hand, oGM simply uses the best similarity score it can find for each element. GCSM is a symmetric measure, which means that we will not get high sim larity scores if one collection is a subset of the other [Shivakumar and garcia Molina 1995]. OGM can be used as an asymmetric measure and conveys more information that may help us identify different semantic notions of similar ity. For example, if we wanted to find an"expert "for a particular user A (i.e someone who is knowledgeable about the things that a buys), we would look for a user B such that her purchases are close to a superset of A' s purchases Thus sim(A, B)would be very high, but sim(B, A)might be fairly low GCSM has worst-case complexity quadratic in the number of elements in the two collections. oGM has complexity linear in the number of nodes in the induced trees of the two collections ACM Transactions on Information Systems, Vol 21, No. 1, January 2003
Exploiting Hierarchical Domain Structure • 73 Table I. Comparison of the Various Measures sim JC DC CSM GCSM OGM A, B 0 0 0 0.8 0.67 A, C 0 0 0 0.4 0.33 A, D 0.33 0.5 0.5 0.65 0.67 B, C 0 0 0 0.4 0.33 B, D 0 0 0 0.52 0.5 C, D 0 0 0 0.26 0.25 Figure 1. JC stands for Jaccard’s Coefficient and DC for Dice’s Coefficient. We compute similarity between every pair of customers, except for customer E, whom we use in a subsequent example. The values shown are symmetric similarity values, with the average of the two asymmetric values being used for OGM. As motivated in Section 1, we would expect to find that customers A and B are more similar to each other than A and C. C and D should be even less similar. From Table I, we see that both of our First-Generation measures produce this result, whereas the traditional measures do not. Intuitively, it is not clear whether sim(A, D) should be higher than sim(A, B). There is a case for saying that sim(A, B) is higher, since both A and B are “pure” Beatles persons. One could also contend that A and D have a CD in common, whereas A and B have none, and, therefore, that sim(A, D) ought to be higher. OGM gives them the same similarity values, and GCSM makes sim(A, B) higher. The traditional measures claim that sim(A, D) is higher, since they do not detect any similarity between A and B. GCSM and OGM can be tuned to adjust the conclusion in cases such as these. We discuss how to achieve this tuning in the extended version of this article [Ganesan et al. 2002]. 3.4.1 Contrasting GCSM with OGM. Having seen how the FirstGeneration measures fare on our simple example when compared with the traditional measures, we now examine the basic differences between GCSM and OGM. —GCSM uses many-to-many matches, whereas OGM uses many-to-one matches. In GCSM, the similarity contribution of an element in one collection is gathered from all elements in the other collection that have a nonzero similarity to that element. On the other hand, OGM simply uses the best similarity score it can find for each element. —GCSM is a symmetric measure, which means that we will not get high similarity scores if one collection is a subset of the other [Shivakumar and GarciaMolina 1995]. OGM can be used as an asymmetric measure, and conveys more information that may help us identify different semantic notions of similarity. For example, if we wanted to find an “expert” for a particular user A (i.e., someone who is knowledgeable about the things that A buys), we would look for a user B such that her purchases are close to a superset of A’s purchases. Thus sim(A, B) would be very high, but sim(B, A) might be fairly low. —GCSM has worst-case complexity quadratic in the number of elements in the two collections. OGM has complexity linear in the number of nodes in the induced trees of the two collections. ACM Transactions on Information Systems, Vol. 21, No. 1, January 2003