Finding Experts on the Semantic Desktop Gianluca demartini and Claudia Niedere Leibniz universitat hannover Appelstrasse 9a, 30167 Hannover, Germany Abstract. Expert retrieval has attracted deep attention because of the huge economical impact it can have on enterprises. The classical dataset on which to perform this task is company intranet (i. e, personal pages, e-mails, documents). We propose a new system for finding experts in the users desktop content. Looking at private documents and e-mails of the user, the system builds expert profiles for all the people named in the desktop. This allows the search system to focus on the user's topics of interest thus generating satisfactory results on topics well represented on le desktop. We show, with an artificial test collection, how the desk- top content is appropriate for finding experts on the topic the user is interested in 1 Introduction Finding people who are expert on certain topics is a search task which has been mainly investigated in the enterprise context. Especially in big enterprises, topic areas can range very much also because of diverse and distributed data sources This peculiarity of enterprise datasets can highly affect the quality of the results of the expert finding task 15, 16 It is important to provide the enterprise managers with high recommendation. The managers need to build new project teams and to find people who can solve problems. Therefore, a high-precision tool for finding ex- perts is needed. Moreover, not only managers need to find experts. In a highly collaborative environment where the willingness of sharing and helping other team members is present, all the employees should be able to find out to which colleague to ask for help in solving issues If we want to achieve high-quality results while searching for experts, con- sidering the user's desktop content makes the search much more focused on the user's interests also because the desktop dataset will contain much more exper- tise evidence(on such topics) than the rest of the public enterprise intranet Classic expert search systems 9, 30, 21, 25, 26, 17 work on the entire enterprise knowledge available. This means that they use shared repository, e-mails his- tory, forums, wikis, databases, personal home pages, and all the data that an enterprise creates and stores. This makes the system to consider a huge variety of topics, for example, from accountability to IT specific issues. Our soli
Finding Experts on the Semantic Desktop Gianluca Demartini and Claudia Nieder´ee L3S Research Center Leibniz Universit¨at Hannover Appelstrasse 9a, 30167 Hannover, Germany {demartini,niederee}@L3S.de Abstract. Expert retrieval has attracted deep attention because of the huge economical impact it can have on enterprises. The classical dataset on which to perform this task is company intranet (i.e., personal pages, e-mails, documents). We propose a new system for finding experts in the user’s desktop content. Looking at private documents and e-mails of the user, the system builds expert profiles for all the people named in the desktop. This allows the search system to focus on the user’s topics of interest thus generating satisfactory results on topics well represented on the desktop. We show, with an artificial test collection, how the desktop content is appropriate for finding experts on the topic the user is interested in. 1 Introduction Finding people who are expert on certain topics is a search task which has been mainly investigated in the enterprise context. Especially in big enterprises, topic areas can range very much also because of diverse and distributed data sources. This peculiarity of enterprise datasets can highly affect the quality of the results of the expert finding task [15, 16]. It is important to provide the enterprise managers with high quality expert recommendation. The managers need to build new project teams and to find people who can solve problems. Therefore, a high-precision tool for finding experts is needed. Moreover, not only managers need to find experts. In a highly collaborative environment where the willingness of sharing and helping other team members is present, all the employees should be able to find out to which colleague to ask for help in solving issues. If we want to achieve high-quality results while searching for experts, considering the user’s desktop content makes the search much more focused on the user’s interests also because the desktop dataset will contain much more expertise evidence (on such topics) than the rest of the public enterprise intranet. Classic expert search systems [9, 30, 21, 25, 26, 17] work on the entire enterprise knowledge available. This means that they use shared repository, e-mails history, forums, wikis, databases, personal home pages, and all the data that an enterprise creates and stores. This makes the system to consider a huge variety of topics, for example, from accountability to IT specific issues. Our solution
focuses on using the user's desktop content as expertise evidence allowing the system to focus on the user's topics of interest thus providing high quality results The system we propose is first indexing the desktop content also using meta- data annotation that are produced by the Social Semantic Desktop system Nepo- muk [19. Our expert search system creates a vector space that includes the documents and the people that are present in the desktop content. After this step, when the desktop user issues a query of the type "Find erperts on the topic. +keywords the system shows a ranked list of people that the user can contact for getting help. Preliminary experiments show the high precision of the expert search results on topics which are covered by the desktop content. A lim itation of our system is that it can return only people that are present on the user's desktop. Therefore, the performances are poor when the desktop content (i.e, number of items and people)is limited, as for example for new employ ees, or when the queries are different from the main topics represented in the desktop. The main contributions of the paper are the description of how the beagle++ system creates metadata regarding documents and people(Section 2.1) a new system for finding experts on a semantic desktop(Section 2.2 he description of possible test datasets: one composed of fictitious data and one containing real desktop content( Section 3) preliminary experimental results showing how a focused dataset leads to high-quality expert search results( Section 4) a review of the previous systems and formal models presented in the field of expert search and Personal Information Management(PIM)(Section 5) 2 System Architecture 2.1 Generating Metadata about People In order to identify possible expert candidates and link them to desktop items we used extractors from the Beagle++ Dekstop Search Engine[13, 8. These extractors identify documents and e-mails authors by analysing the structure and the content of each file. For storing the produced metadata(see Figure 1) we employ the RDF repository developed in the Nepomuk project [19 based on Sesame for storing, querying, and reasoning about RDF and RDF Schema as well as on Lucene, which is integrated with the Sesame framework via the Lucene Sail [27, for full-text search An additional step is the entity linkage applied to the identified candidates For example, a person in e-mails is described by an e-mail address, whereas in a publication by the author' s name. Other causes for the appearance of different Ihttp://beagle2.kbs.uni-hannover.de http://www.youtubecom/watch?v=u14gdkcr7-1 http://www.openrdf.org
focuses on using the user’s desktop content as expertise evidence allowing the system to focus on the user’s topics of interest thus providing high quality results for queries about such topics. The system we propose is first indexing the desktop content also using metadata annotation that are produced by the Social Semantic Desktop system Nepomuk [19]. Our expert search system creates a vector space that includes the documents and the people that are present in the desktop content. After this step, when the desktop user issues a query of the type “Find experts on the topic...”+keywords the system shows a ranked list of people that the user can contact for getting help. Preliminary experiments show the high precision of the expert search results on topics which are covered by the desktop content. A limitation of our system is that it can return only people that are present on the user’s desktop. Therefore, the performances are poor when the desktop content (i.e., number of items and people) is limited, as for example for new employees, or when the queries are different from the main topics represented in the desktop. The main contributions of the paper are: – the description of how the Beagle++ system creates metadata regarding documents and people (Section 2.1). – a new system for finding experts on a semantic desktop (Section 2.2). – the description of possible test datasets: one composed of fictitious data and one containing real desktop content (Section 3). – preliminary experimental results showing how a focused dataset leads to high-quality expert search results (Section 4). – a review of the previous systems and formal models presented in the field of expert search and Personal Information Management (PIM) (Section 5). 2 System Architecture 2.1 Generating Metadata about People In order to identify possible expert candidates and link them to desktop items, we used extractors from the Beagle++ Dekstop Search Engine1 2 [13, 8]. These extractors identify documents and e-mails authors by analysing the structure and the content of each file. For storing the produced metadata (see Figure 1) we employ the RDF repository developed in the Nepomuk project [19] based on Sesame3 for storing, querying, and reasoning about RDF and RDF Schema, as well as on Lucene4 , which is integrated with the Sesame framework via the LuceneSail [27], for full-text search. An additional step is the entity linkage applied to the identified candidates. For example, a person in e-mails is described by an e-mail address, whereas in a publication by the author’s name. Other causes for the appearance of different 1 http://beagle2.kbs.uni-hannover.de 2 http://www.youtube.com/watch?v=Ui4GDkcR7-U 3 http://www.openrdf.org 4 http://lucene.apache.org
Desktop Content BibTex Metadata annotations Content RDF Candidates Documents content Client Expert Search Application Space Fig 1. An overview of how the desktop content is extracted and given in input to the expert search component for indexing. A client application is providing a user interface to the expert search service. references to the same entity are misspellings, the use of abbreviations, initials might change). Again, we exploit a component of the Beagle++ search berson or the actual change of the entity over time(e. g, the e-mail address of a person for producing information about the linkage At this point, we obtained a repository describing desktop items content and metadata. In the next section we explain how we can exploit this data and metadata for finding experts in the semantic desktop content 2.2 Leveraging Metadata for People Search In the Nepomuk system, the service of Expert Recommendation aims at pro- viding the user with a list of experts(i.e, people)on a given topic. The experts are selected among a list of persons referral to in the desktop. In order to do so, the component needs to extract, out of the RDF repository, some information about the content of documents and e-mails and also a list of expert candidates Thanks to the Beagle++ system, relations between people and documents e identified and stored in the repository. Entity Linkage identifies references pointing to the same entity by gathering clues as, for example, a person in e- mails described by an e-mail address, whereas in a publication by the authors name. In Beagle++, searching using a person's surname retrieves publications http://dev.nepomuksemanticdesktop.org/wiki/expertrecommender
Fig. 1. An overview of how the desktop content is extracted and given in input to the expert search component for indexing. A client application is providing a user interface to the expert search service. references to the same entity are misspellings, the use of abbreviations, initials, or the actual change of the entity over time (e.g., the e-mail address of a person might change). Again, we exploit a component of the Beagle++ search system for producing information about the linkage. At this point, we obtained a repository describing desktop items content and metadata. In the next section we explain how we can exploit this data and metadata for finding experts in the semantic desktop content. 2.2 Leveraging Metadata for People Search In the Nepomuk system, the service of Expert Recommendation5 aims at providing the user with a list of experts (i.e., people) on a given topic. The experts are selected among a list of persons referral to in the desktop. In order to do so, the component needs to extract, out of the RDF repository, some information about the content of documents and e-mails and also a list of expert candidates (see Figure 1). Thanks to the Beagle++ system, relations between people and documents are identified and stored in the repository. Entity Linkage identifies references pointing to the same entity by gathering clues as, for example, a person in emails described by an e-mail address, whereas in a publication by the author’s name. In Beagle++, searching using a person’s surname retrieves publications 5 http://dev.nepomuk.semanticdesktop.org/wiki/ExpertRecommender
in which her surname appears as part of an author field as well as e-mails in hich her e-mail address appears as part of the sender or receiver fields. This is obtained linking together the objects that refer to the same real world entities 20] The expert search system we propose can leverage on the extracted relations between documents and people as well as on the linkage information between different representations(e.g, surname and e-mail address). The first step is to create an inverted index for documents: a vector representation of each publica tion, e-mail, and text-based resources on the desktop is created. Then, for each expert candidate referral to in the desktop, her position into the vector space computed by linear combination of the resources related with her, using the re- lation strength as weight. At this point, each candidate expert is placed into the space and a query vector, together with a similarity measure(e. g, cosine simi- larity), can be used to retrieve a ranked list of experts. The fact that documents are indexed before candidates implies that the dimensions of the vector space are defined by the set of terms present in the desktop collection. This means that the topics of expertise that represents the candidates are those inferred from the documents ig. 2. A client application for searching experts on the semantic desktop
in which her surname appears as part of an author field as well as e-mails in which her e-mail address appears as part of the sender or receiver fields. This is obtained linking together the objects that refer to the same real world entities [20]. The expert search system we propose can leverage on the extracted relations between documents and people as well as on the linkage information between different representations (e.g., surname and e-mail address). The first step is to create an inverted index for documents: a vector representation of each publication, e-mail, and text-based resources on the desktop is created. Then, for each expert candidate referral to in the desktop, her position into the vector space is computed by linear combination of the resources related with her, using the relation strength as weight. At this point, each candidate expert is placed into the space and a query vector, together with a similarity measure (e.g., cosine similarity), can be used to retrieve a ranked list of experts. The fact that documents are indexed before candidates implies that the dimensions of the vector space are defined by the set of terms present in the desktop collection. This means that the topics of expertise that represents the candidates are those inferred from the documents. Fig. 2. A client application for searching experts on the semantic desktop
a client application can then use the Nepomuk Expert Recommendation service(which implements the system described in this paper) by providing a keyword query taken from the user. A screenshot of a possible client application is shown in Figure 2. In the top-left corner the user can provide a keyword query and the choice of looking for experts. In the central panel a ranked list of people is presented as result of the query. In the right pane, resources related to the selected expert are shown 3 Desktop Search Evaluation Datasets valuation of desktop search algorithms effectiveness is a difficult task because of the lack of standard test collections. The main problem of building such test collection is the privacy concerns that data providers might have while sharing personal data. The privacy issue is major as it impedes the diffusion of personal desktop data among researches. Some solutions for overcoming these problems have been presented in previous work [11, 12 ness of finding experts using desktop content as evidence of ng t In this section we describe two possible datasets for evaluating the effective- fictitious desktop dataset representing two hypothetical personas. This dataset has been manually created in the context of the Nepomuk project with the goal of providing a publicly available desktop dataset with no privacy concerns. As at present, the access to the actual data is still restricted. The second one is a set of real desktop data provided by 14 employees of a research center 3.1 Fictitious data In order to obtain reproducible and comparable experimental results there is a need for a common test collection. That is, a set of resources, queries, and relevance assessments that are publicly available. In the case of Pim the privacy issue of sharing personal data has to be faced. For solving this issue the team working on the Nepomuk project has created a collection of desktop items(i.e. documents, e-mails, contacts, calendar items, ...)for some imaginary personas representing hypothetical desktop users. In this paper we describe two desktop collections built in this context The first persona is called Claudia Stern. She is a project manager and her interests are mainly about ontologies, know ledge management, and information retrieval. Her desktop contains 56 publications about her interests, 36 e-mails 19 Word documents about project meetings and deliverables, 12 slides presenta- tions, 17 calendar items, 2 contacts, and an activity log collected while a travel was being arranged (i.e, flight booking, hotel reservation, search for shopping places)containing 122 actions. These resources have been indexed using the Bea- gle++ system obtaining a total of 22588 RDf triples which have been stored in the RDF repositor The second persona is called Dirk Hagemann. He works for the project that Claudia manages and his interests are similar to those of Claudia. His desktop http://dev.nepomuksemanticdesktop.org/wiki/claudia 7http://dev.nepomuksemanticdesktop.org/wiki/dirk
A client application can then use the Nepomuk Expert Recommendation service (which implements the system described in this paper) by providing a keyword query taken from the user. A screenshot of a possible client application is shown in Figure 2. In the top-left corner the user can provide a keyword query and the choice of looking for experts. In the central panel a ranked list of people is presented as result of the query. In the right pane, resources related to the selected expert are shown. 3 Desktop Search Evaluation Datasets Evaluation of desktop search algorithms effectiveness is a difficult task because of the lack of standard test collections. The main problem of building such test collection is the privacy concerns that data providers might have while sharing personal data. The privacy issue is major as it impedes the diffusion of personal desktop data among researches. Some solutions for overcoming these problems have been presented in previous work [11, 12]. In this section we describe two possible datasets for evaluating the effectiveness of finding experts using desktop content as evidence of expertise. One is a fictitious desktop dataset representing two hypothetical personas. This dataset has been manually created in the context of the Nepomuk project with the goal of providing a publicly available desktop dataset with no privacy concerns. As at present, the access to the actual data is still restricted. The second one is a set of real desktop data provided by 14 employees of a research center. 3.1 Fictitious Data In order to obtain reproducible and comparable experimental results there is a need for a common test collection. That is, a set of resources, queries, and relevance assessments that are publicly available. In the case of PIM the privacy issue of sharing personal data has to be faced. For solving this issue the team working on the Nepomuk project has created a collection of desktop items (i.e., documents, e-mails, contacts, calendar items, . . . ) for some imaginary personas representing hypothetical desktop users. In this paper we describe two desktop collections built in this context. The first persona is called Claudia Stern6 . She is a project manager and her interests are mainly about ontologies, knowledge management, and information retrieval. Her desktop contains 56 publications about her interests, 36 e-mails, 19 Word documents about project meetings and deliverables, 12 slides presentations, 17 calendar items, 2 contacts, and an activity log collected while a travel was being arranged (i.e., flight booking, hotel reservation, search for shopping places) containing 122 actions. These resources have been indexed using the Beagle++ system obtaining a total of 22588 RDF triples which have been stored in the RDF repository. The second persona is called Dirk Hagemann7 . He works for the project that Claudia manages and his interests are similar to those of Claudia . His desktop 6 http://dev.nepomuk.semanticdesktop.org/wiki/Claudia 7 http://dev.nepomuk.semanticdesktop.org/wiki/Dirk