Finding Experts By Semantic Matching of User Profiles Rajesh Thiagarajan, Geetha Manjunath", and Markus Stumptner Advanced Computing Research Centre, University of South Australia Icisrkt, mst ecs unisa.edu.au 2 Hewlett-Packard Laboratories. India geetha.manjunathghp.com Abstract. Extracting interest profiles of users based on their personal documents ne of the key topics of IR research. However, when these extracted profiles are used in expert finding applications, only naive text-matching techniques an sed to rank experts for a given requirement. In this paper, we address this gap and escribe multiple techniques to match user profiles for better ranking of experts. le propose new metrics for computing semantic similarity of user profiles using spreading activation networks derived from ontologies. Our pilot evaluation shows that matching algorithms based on bipartite graphs over semantic user profiles provide the best results. We show that using these techniques, we can find an expert more accurately than other approaches, in particular within the top ranked results. In applications where a group of candidate users need to be short-listed (say, for a job interview ), we get very good precision and recall as well. 1 Introduction The problem of finding experts on a given set of topics is important for many lines of business e. g, consulting, recruitment, e-business. In these applications, one common way to model a user is with a user profile which is a set of topics with weights determining his level of interest. When used for personalization, these user profiles matched with a retrieved documentmay be a search result) for checking its relevance to him. a similar matching technique can be used for expert finding as well- wherein we first formulate the requis then carried out by matching the query profile with the available/extracted ment(query) as an expected profile of the expert who is sought after. Expert finding expert user profiles In the above context, automatic extraction of topics of expertise (interest) of a person ased on the documents authored(accessed) by the person through information extraction techniques is well known. However, when these extracted profiles are used for expert finding, the profile matching is often carried out by applying traditional content matching techniques which miss most potential candidates if the query is only an approximate escription of the expert (as is usually e. In thi multiple approaches for semantic matching of user profiles to enable better expert-finding in such Let us briefly look at the challenges in comparing user profiles. User profiles are generally represented in the bag-of-words(Bow) format-a set of weighted terms that describe the interest or the expertise of a user. The most commonly used content match ng technique is cosine similarity -cosine between the bOw vector representing the user profile and that of the document to match. Although this simple matching technique suf fices in a number of content matching applications, it is well known that considering just the words leads to problems due to lack of semantics in the representation Problems due to polysemy(terms such as apple, jaguar having two different meanings)and synonymy (two words meaning almost the same thing such as glad and happy)can be solved if profiles are described using semantic concepts instead of words. Once again simple
Finding Experts By Semantic Matching of User Profiles Rajesh Thiagarajan1 , Geetha Manjunath2 , and Markus Stumptner1 1 Advanced Computing Research Centre, University of South Australia {cisrkt,mst}@cs.unisa.edu.au 2 Hewlett-Packard Laboratories, India geetha.manjunath@hp.com Abstract. Extracting interest profiles of users based on their personal documents is one of the key topics of IR research. However, when these extracted profiles are used in expert finding applications, only naive text-matching techniques are used to rank experts for a given requirement. In this paper, we address this gap and describe multiple techniques to match user profiles for better ranking of experts. We propose new metrics for computing semantic similarity of user profiles using spreading activation networks derived from ontologies. Our pilot evaluation shows that matching algorithms based on bipartite graphs over semantic user profiles provide the best results. We show that using these techniques, we can find an expert more accurately than other approaches, in particular within the top ranked results. In applications where a group of candidate users need to be short-listed (say, for a job interview), we get very good precision and recall as well. 1 Introduction The problem of finding experts on a given set of topics is important for many lines of business e.g., consulting, recruitment, e-business. In these applications, one common way to model a user is with a user profile which is a set of topics with weights determining his level of interest. When used for personalization, these user profiles matched with a retrieved document (may be a search result) for checking its relevance to him. A similar matching technique can be used for expert finding as well - wherein we first formulate the requirement (query) as an expected profile of the expert who is sought after. Expert finding is then carried out by matching the query profile with the available/extracted expert user profiles. In the above context, automatic extraction of topics of expertise (interest) of a person based on the documents authored (accessed) by the person through information extraction techniques is well known. However, when these extracted profiles are used for expert finding, the profile matching is often carried out by applying traditional content matching techniques which miss most potential candidates if the query is only an approximate description of the expert (as is usually the case). In this paper, we propose and evaluate multiple approaches for semantic matching of user profiles to enable better expert-finding in such cases. Let us briefly look at the challenges in comparing user profiles. User profiles are generally represented in the bag-of-words (BOW) format - a set of weighted terms that describe the interest or the expertise of a user. The most commonly used content matching technique is cosine similarity - cosine between the BOW vector representing the user profile and that of the document to match. Although this simple matching technique suf- fices in a number of content matching applications, it is well known that considering just the words leads to problems due to lack of semantics in the representation. Problems due to polysemy (terms such as apple, jaguar having two different meanings) and synonymy (two words meaning almost the same thing such as glad and happy) can be solved if profiles are described using semantic concepts instead of words. Once again simple
matching techniques can be used on these bags-of-concepts(BOC). However, these ap. proaches fail to determine the right matches when there is no direct overlap/intersection in the concepts. For example, do two users with Yahoo and Google in their respective profiles have nothing in common? There does seem to be an intersection in these user interests for Web-based IT companies or web search tools! Such overlaps are missed as current approaches work under the assumption that the profile representations (B contain all the information about the user. As a result, relationships that are not explicit in the representations are usually ignored. Furthermore, these mechanisms cannot handle user profiles that are at different levels of granularity or abstractions(e. g, jazz and musi as the implicit relationship between the concepts is ignored In this paper, we solve the above issues in user profile matching through effective use of ontologies. We define the notion of semantic similarity between two user profiles to consider inherent relationships between concepts/words appearing in their respective BOW representation. We use the process of spreading to include additional related terms to a user profile by referring to an ontology (wordnet or wikipedia)and experiment with multiple techniques to enable better profile matching. We propose simple metrics for computing similarity between two user profiles with ontology-based Spreading Activation NetworkS(SAN). We evaluate multiple mechanisms for extending user profiles(set and graph based spreading) and semantic matching(set intersection and bipartite graphs) of profiles. We show the effectiveness of our user profile matching techniques for accuracy in expert-ranking as well as candidate selection. From a given set of user profiles, our bipartite-graph based algorithms can accurately spot an expert just within its top three ranks. In applications where a group of candidate users need to be found (for a job interview), we get very good precision and recall as well The organization of the rest of this document is as follows. We describe different related research efforts for profile building and ontology-based semantic matching tech- section 2 followed by a brief section giving some background and definitions needed to understand our solution. An overview of the our spreading process is presented in Section 4. We present our new similarity measures in Section 5. We describe our evaluation procedure for expert finding in Section 6 and share our improved results. We summarize our contributions and state possible future work in Section 7 2 Related work Determining interest profiles of users based on their personal documents is an important research topic in information extraction and a number of techniques to achieve this have been proposed. Expert finding techniques that combine multiple sources of expertise evidence such as academic papers and social citation network have also bee proposed [1] User profiles have been extracted using multiple types of corpora-utilizing knowledge about the expert in Wikipedia [2], analysing the expert's documents [3-5], and analysing openly accessible research contributions of the expert [6]. Use of Wikipedia corpus to generate semantic user profiles [7]have been seen. Pre-processing the profile terms by mapping terms to such ontology concepts prior to computing cosine similarity has been shown to yield better matching [ 3]. A number of traditional similarity measurement techniques such as the cosine similarity measure or term vector similarity [8, 9), Dice's efficient [10] and Jaccards index [ll] are used in profile matching. For example, Jaccards index is used in [2] to match expert profiles constructed using Wikipedia knowledge. This approach will not determine a semantic inexact match when there is no direct overlap in the concepts in the two user profiles. Use of knowledge obtained from
matching techniques can be used on these bags-of-concepts (BOC). However, these approaches fail to determine the right matches when there is no direct overlap/intersection in the concepts. For example, do two users with Yahoo and Google in their respective profiles have nothing in common? There does seem to be an intersection in these users’ interests for Web-based IT companies or web search tools! Such overlaps are missed as current approaches work under the assumption that the profile representations (BOW) contain all the information about the user. As a result, relationships that are not explicit in the representations are usually ignored. Furthermore, these mechanisms cannot handle user profiles that are at different levels of granularity or abstractions (e.g., jazz and music) as the implicit relationship between the concepts is ignored. In this paper, we solve the above issues in user profile matching through effective use of ontologies. We define the notion of semantic similarity between two user profiles to consider inherent relationships between concepts/words appearing in their respective BOW representation. We use the process of spreading to include additional related terms to a user profile by referring to an ontology (Wordnet or Wikipedia) and experiment with multiple techniques to enable better profile matching. We propose simple metrics for computing similarity between two user profiles with ontology-based Spreading Activation Networks (SAN). We evaluate multiple mechanisms for extending user profiles (set and graph based spreading) and semantic matching (set intersection and bipartite graphs) of profiles. We show the effectiveness of our user profile matching techniques for accuracy in expert-ranking as well as candidate selection. From a given set of user profiles, our bipartite-graph based algorithms can accurately spot an expert just within its top three ranks. In applications where a group of candidate users need to be found (for a job interview), we get very good precision and recall as well. The organization of the rest of this document is as follows. We describe different related research efforts for profile building and ontology-based semantic matching techniques in section 2 followed by a brief section giving some background and definitions needed to understand our solution. An overview of the our spreading process is presented in Section 4. We present our new similarity measures in Section 5. We describe our evaluation procedure for expert finding in Section 6 and share our improved results. We summarize our contributions and state possible future work in Section 7. 2 Related Work Determining interest profiles of users based on their personal documents is an important research topic in information extraction and a number of techniques to achieve this have been proposed. Expert finding techniques that combine multiple sources of expertise evidence such as academic papers and social citation network have also bee proposed [1]. User profiles have been extracted using multiple types of corpora - utilizing knowledge about the expert in Wikipedia [2], analysing the expert’s documents [3–5], and analysing openly accessible research contributions of the expert [6]. Use of Wikipedia corpus to generate semantic user profiles [7] have been seen. Pre-processing the profile terms by mapping terms to such ontology concepts prior to computing cosine similarity has been shown to yield better matching [3]. A number of traditional similarity measurement techniques such as the cosine similarity measure or term vector similarity [8, 9], Dice’s coefficient [10] and Jaccard’s index [11] are used in profile matching. For example, Jaccard’s index is used in [2] to match expert profiles constructed using Wikipedia knowledge.This approach will not determine a semantic inexact match when there is no direct overlap in the concepts in the two user profiles. Use of knowledge obtained from
an ontology, in our solution, enables similarity checks when there are no direct overlaps between user profiles and, therefore, result in more accurate similarity measurements. blem of automated routing of conference papers to their rev somewhat related problem to that of expert finding. Most of the current approaches to that problem use a group of papers authored by reviewers to determine their user profile and perform routine content matching(similar to personalization)to determine whether a paper is fit to be reviewed by that user [12]. The expert finding task introduced by TREC 2005 [13] requires one to provide a ranked list of the candidate experts based on the web data provided. Our attem handle the problem of choosing the best expert description of a hypothetical expert(set of topics with weights) profiles of candidate experts Use of ontologies to derive new concepts that are likely to be of interest to the user through semantic spreading activation networks has been studied as well [14-17, 5]. Pre- ious studies have shown that the spreading process improves accuracy and overco the challenges caused by inherent relationships and Polysemy in word sense disam- biguation process [15, 16] and ontology mapping [17]. We use this spreading process to facilitate the semantic similarity computation. We build on the spreading process used in [5] to learn user preferences in order to drive a personalized multimedia search. The learning process utilizes ontologies as a means to comprehend user interests (in BOw format)and establishes the need to consider related concepts to improve search quality While the results in [5] suggest that personalized search is of better quality in comparison to normal search, they do not show whether the consideration of related terms contributes to these improvements. On the other hand, we show that our spreading process indeed improves the accuracy of our new similarity measures and in the particular context of user profile matching A number of approaches have already been proposed to determine the similarity between two ontology concepts(or words). These determine similarity by: measurin the path distance between them [18], evaluating shared information between them [19] recursively matching sub-graphs[20), combining information from various sources [21]. are only able to determine closeness between two concepts(or words), we compute similarity between two weighted sets of concepts(or words). One of our algorithms us the simple path measure described in [18]over a bipartite graph to determine such a set intersection Ve now compare with other works that use San based IR techniques. One of our imilarity measures is similar to the one discussed in[25] but differs in the treatment of the results of the activation process. While the previous work utilizes the results of the activation to rank documents with respect to a query, our work maps an aggregate of the activation results to a similarity value. Knowledge from an ontology is used to extend the bow with terms that share important relationships with original terms to improve document retrieval is presented in [4]. Our work on set spreading is somewhat similar to this but we further explore the notion of computing similarity by optimal concept matching in bipartite graphs and using SAN 3 Background In this section, we formally define and explain some terms used in the rest of the document
an ontology,in our solution, enables similarity checks when there are no direct overlaps between user profiles and, therefore, result in more accurate similarity measurements. The problem of automated routing of conference papers to their reviewers is a somewhat related problem to that of expert finding. Most of the current approaches to that problem use a group of papers authored by reviewers to determine their user profile and perform routine content matching (similar to personalization) to determine whether a paper is fit to be reviewed by that user [12]. The expert finding task introduced by TREC 2005 [13] requires one to provide a ranked list of the candidate experts based on the web data provided. Our attempt is to handle the problem of choosing the best expert given a description of a hypothetical expert (set of topics with weights) and a set of user profiles of candidate experts. Use of ontologies to derive new concepts that are likely to be of interest to the user through semantic spreading activation networks has been studied as well [14–17, 5]. Previous studies have shown that the spreading process improves accuracy and overcomes the challenges caused by inherent relationships and Polysemy in word sense disambiguation process [15, 16] and ontology mapping [17]. We use this spreading process to facilitate the semantic similarity computation. We build on the spreading process used in [5] to learn user preferences in order to drive a personalized multimedia search. The learning process utilizes ontologies as a means to comprehend user interests (in BOW format) and establishes the need to consider related concepts to improve search quality. While the results in [5] suggest that personalized search is of better quality in comparison to normal search, they do not show whether the consideration of related terms contributes to these improvements. On the other hand, we show that our spreading process indeed improves the accuracy of our new similarity measures and in the particular context of user profile matching. A number of approaches have already been proposed to determine the similarity between two ontology concepts (or words). These determine similarity by: measuring the path distance between them [18], evaluating shared information between them [19], recursively matching sub-graphs [20], combining information from various sources [21], analysing structure of the ontology [22], and combining content analysis and web search [23]. A few other measures are evaluated in [24]. While all these approaches are only able to determine closeness between two concepts (or words), we compute similarity between two weighted sets of concepts (or words). One of our algorithms use the simple path measure described in [18] over a bipartite graph to determine such a set intersection. We now compare with other works that use SAN based IR techniques. One of our similarity measures is similar to the one discussed in [25] but differs in the treatment of the results of the activation process. While the previous work utilizes the results of the activation to rank documents with respect to a query, our work maps an aggregate of the activation results to a similarity value. Knowledge from an ontology is used to extend the BOW with terms that share important relationships with original terms to improve document retrieval is presented in [4]. Our work on set spreading is somewhat similar to this but we further explore the notion of computing similarity by optimal concept matching in bipartite graphs and using SAN. 3 Background In this section, we formally define and explain some terms used in the rest of the document
Definition 1(User Profile). An user profile, u is a set of binary tuples ((t1, w1),..., (tn, wn)) where t; are the terms that describes the user and w; denotes the importance of ti in describing the user: We use terms(u) to denote the set of terms ti in the profile Cosine Similarity: The BOw representation is typically used for computing cosine similarity between the user profiles. If the vector representation of a user profile u is V(u)and the Euclidean length(V(ui )D of an entity u, is vei wi, the similarity simcos(uj, uk)= cos(V(u,),V(uk)) V(uy)·V(uk) v(au)川V(uk) Spreading: Spreading is the process of including the terms that are related to the original terms in an user profile by referring to an ontology. Let us study the earlier mentioned simple example of two users having google and yahoo in their profile in detail to understand the spreading process bette Example 1. Consider computing the similarity of the following users mple intersection check between the profiles result in an empty set (i.e. u1 nu2=0) indicating their un-relatedness (cosine similarity is O). However, if we were to manually judge the similarity of these two users we would give it a value greater than 0. This is because we judge the similarity not just by considering the two terms from the profiles but also by considering the relationships that might exist between them due to our prior knowledge. We are able to establish the fact that both google and yahoo are search engine providers Now let us see the effectiveness of spreading in the similarity computation process in the same example. Spreading the profiles u1 and u2, by referring to Wikipedia parent ategory relationship, extends the profiles to ui=igoogle, 1.0), internet search engines, 0.5)), and The sim, ((yahoo, 2.0),(internet search engines, 1.0) The simple intersection check results in a non-empty set (i.e.uinu?fO)indicating their relatedness(cosine similarity is 0.2). The result of the spreading(i.e. the inclusion of the related term internet search engines) process makes sure that any relationship that exists between the profiles are taken into consideration 4 Spreading to Create Extended User Profiles In this section, we describe two techniques to compute and represent the extended user profiles(see example of section 3)using an ontology. An ontology O represents human knowledge about a certain domain as concepts, attributes and relationships between concepts in a well-defined hierarchy. It is usually represented as a graph where node are the concepts and edges are the relationship labelled with the type of relationship For the purpose of profile spreading we assume that all the terms ti describing an entity are mappable to concepts in a reference ontology. For example, all the terms ti in a BOW representation of a user profile maps to a concept in the Wordnet ontology. Given a term ti, the spreading process utilizes O to determine the terms that are related to t (denoted as related(ti). Although spreading the profiles with related terms allows for
Definition 1 (User Profile). An user profile, u is a set of binary tuples {ht1, w1i, . . . , htn, wni} where ti are the terms that describes the user and wi denotes the importance of ti in describing the user. We use terms(u) to denote the set of terms ti in the profile u. Cosine Similarity: The BOW representation is typically used for computing cosine similarity between the user profiles. If the vector representation of a user profile uj is −→V (uj ) and the Euclidean length (| −→V (uj )|) of an entity uj is pPn i=1 w2 i , the similarity of the entities uj and uk is (1) simcos(uj , uk) = cos( −→V (uj ), −→V (uk)) = −→V (uj ) · −→V (uk) | −→V (uj )||−→V (uk)| Spreading: Spreading is the process of including the terms that are related to the original terms in an user profile by referring to an ontology. Let us study the earlier mentioned simple example of two users having google and yahoo in their profile in detail to understand the spreading process better. Example 1. Consider computing the similarity of the following users – u1 = {hgoogle, 1.0i}, and – u2 = {hyahoo, 2.0i}. A simple intersection check between the profiles result in an empty set (i.e. u1 ∩ u2 = ∅) indicating their un-relatedness (cosine similarity is 0). However, if we were to manually judge the similarity of these two users we would give it a value greater than 0. This is because we judge the similarity not just by considering the two terms from the profiles but also by considering the relationships that might exist between them due to our prior knowledge. We are able to establish the fact that both google and yahoo are search engine providers. Now let us see the effectiveness of spreading in the similarity computation process in the same example. Spreading the profiles u1 and u2, by referring to Wikipedia parent category relationship, extends the profiles to – u 0 1 = {hgoogle, 1.0i,hinternet search engines, 0.5i}, and – u 0 2 = {hyahoo, 2.0i,hinternet search engines, 1.0i}. The simple intersection check results in a non-empty set (i.e. u 0 1 ∩ u 0 2 6= ∅) indicating their relatedness (cosine similarity is 0.2). The result of the spreading (i.e. the inclusion of the related term internet search engines) process makes sure that any relationship that exists between the profiles are taken into consideration. 4 Spreading to Create Extended User Profiles In this section, we describe two techniques to compute and represent the extended user profiles (see example of section 3) using an ontology. An ontology O represents human knowledge about a certain domain as concepts, attributes and relationships between concepts in a well-defined hierarchy. It is usually represented as a graph where nodes are the concepts and edges are the relationship labelled with the type of relationship. For the purpose of profile spreading we assume that all the terms ti describing an entity are mappable to concepts in a reference ontology. For example, all the terms ti in a BOW representation of a user profile maps to a concept in the Wordnet ontology. Given a term ti , the spreading process utilizes O to determine the terms that are related to ti (denoted as relatedO(ti)). Although spreading the profiles with related terms allows for
a comprehensive computation, uncontrolled addition of all the related terms leads to the dilution of the profiles with noise or unrelated terms. This dilution may have negative implications on the computation process where the similarity in the noise may contribute to the similarity values between entities. It is therefore desirable to have control over the types of relationships to be considered during this spreading process A SET SPREADER (a)Set Spreading (b) Graph Spreading Fig 1: Two Schemes for Profile Spreading he weights of the new related terms are proportional to the weights of the original term as the weight wi of a term ti indicates the importance of the term within a user profile. However, during spreading the weights of the related terms should differ accord- ing to the semantics of the relationships on the edge. For example, spreading based on wikipedia may be limited to only spreading along the parent categories. We therefore use a set of linear influence functions, one per relationship-type(role/property of an ontology), to control the spreading process. For example, a spreading process based on Wordnet limited to types synonym and antonym can have functions tij=wi X 0.9 and ti;=Wix-09 respectively. We propose two schemes for representing the related terms post-spreading: extended set and semantic network. 4.1 Set Spreading Depicted in Figure la, set spreading is a process of extending an user profile such that the related terms, which are determined with respect to an ontology, are just appended to the original set of terms. Set spreading is an iterative process. After each iteration, the related terms from the previous iterations are appended to the profile. The spreading process is terminated if there are no related terms to spread the profile with or after a fixed number of iterations 4.2 Graph spreading Shown in Figure 1b, graph spreading is the process where terms from two profiles and the related terms are build into a semantic network(SAN). Unlike set spreading, graph spreading preserves the relationship between a term in a profile and its related term in the form of a graph edge. This allows consideration of relationships based on their semantics on the same network. Graph spreading terminates like set spreading, or if there exists a path between every pair of the term nodes from the two profiles. This condition best uits the ontologies that have a top root element which subsumes the rest of the elements
a comprehensive computation, uncontrolled addition of all the related terms leads to the dilution of the profiles with noise or unrelated terms. This dilution may have negative implications on the computation process where the similarity in the noise may contribute to the similarity values between entities. It is therefore desirable to have control over the types of relationships to be considered during this spreading process. t1 w1 t2 w2 t3 w3 Ontology SET SPREADER # Types of relationships to spread upon (such as parentOf) # Weight functions for the spreaded terms # Termination condition (such as no. of iterations) Parameters t1r1 fr(w1) t2r1 fr(w2) t2s2 fs(w2) t1 w1 t2 w2 t3 w3 t3r1 fr(w3) t3r2 fr(w3) t3s1 fs(w3) # Iteration = 1 # Relationships = {r,s} # Weight functions = {fr,fs} (a) Set Spreading t11 w11 t12 w12 Ontology GRAPH SPREADER # Types of relationships to spread upon (such as parentOf) # Weight functions for the spreaded terms # Termination condition (such as no. of iterations) Parameters # Relationships = {r,s} # Weight functions = {fr,fs} t21 w21 t22 w22 Profile 1 Profile 2 t11 t12 t21 t22 t11r1 t12s1 t12s1 t21r1 t21s1 t22s1 r s fr(t11) fs(t12) (b) Graph Spreading Fig. 1: Two Schemes for Profile Spreading The weights of the new related terms are proportional to the weights of the original term as the weight wi of a term ti indicates the importance of the term within a user profile. However, during spreading the weights of the related terms should differ according to the semantics of the relationships on the edge. For example, spreading based on Wikipedia may be limited to only spreading along the parent categories. We therefore use a set of linear influence functions, one per relationship-type (role/property of an ontology), to control the spreading process. For example, a spreading process based on Wordnet limited to types synonym and antonym can have functions tij = wi × 0.9 and tij = wi × −0.9 respectively. We propose two schemes for representing the related terms post-spreading: extended set and semantic network. 4.1 Set Spreading Depicted in Figure 1a, set spreading is a process of extending an user profile such that the related terms, which are determined with respect to an ontology, are just appended to the original set of terms. Set spreading is an iterative process. After each iteration, the related terms from the previous iterations are appended to the profile. The spreading process is terminated if there are no related terms to spread the profile with or after a fixed number of iterations. 4.2 Graph spreading Shown in Figure 1b, graph spreading is the process where terms from two profiles and the related terms are build into a semantic network (SAN). Unlike set spreading, graph spreading preserves the relationship between a term in a profile and its related term in the form of a graph edge. This allows consideration of relationships based on their semantics on the same network. Graph spreading terminates like set spreading, or if there exists a path between every pair of the term nodes from the two profiles. This condition best suits the ontologies that have a top root element which subsumes the rest of the elements