Enhancing Digital Libraries with TechLens Roberto Torres,2, Sean M. McNee2, Mara Abel', Joseph A Konstan John Riedl Grupo de banco de Dados Intellige GroupLens Research Universidade Federal do rio grande do su Dept of Computer Science and Engineering Instituto de informatica University of Minnesota Porto Alegre,Rs· Brazil Minneapolis, MN 55455-USA [torres, marabel]@inf ufrgs.br (mcnee, konstan, riedl@cs. umn.edu ABSTRACT year[18]. If this trend continues, more than 10 million papers The number of research papers available is growing at a staggering rate. Researchers need tools to help them find the the papers they should read difficult from among this growing papers they should read among all the papers published each abundance. Even within fairly narrow areas of interest such as ear. In this paper, we present and experiment with hybrid data mining and digital libraries, it is still impossible to cope recommender algorithms that combine Collaborative Filtering with all of the paper published each year. The problem is even more challenging for interdisciplinary research, since the users. Our hybrid algorithms combine the strengths of each resulting papers are published in a wide variety of venues. filtering approach to address their individual weaknesses. We One solution is search engines like Google, which make it easy ated our algorit experiments on a to find papers by author, title, or keyword. However, in order to database of 102, 000 research papers, and through an online find a paper with a search engine the researcher has to know or experiment with 110 users. For both experiments we used a guess appropriate search keywords. How can users discover the dataset created from the Cite Seer repository of computer science existence of previously unknown papers that they would find research papers. We developed separate English and Portuguese valuable? versions of the interface and specifically recruited American and Our previous research showed that collaborative filtering-based Brazilian users to test for cross-cultural effects. ur results recommender uccessfully recommended research show that users value paper recommendations, that the hybrid algorithms can be successfully combined, that different papers to users [16]. Here we introduce a new experimental system, called TechLens, to explore how two well-known algorithms are more suitable for recommending different kinds techniques of recommender systems, namely collaborative f papers, and that users with different levels of experience filtering and content-based filtering can help users sort through perceive recommendations differently. These results can be the abundance of available research and find papers of interest applied to develop recommender systems for other types of digital libraries The TechLens algorithms explore both the content of the paper and the social context of the paper, through the analysis of its Categories and subject Descriptors citations to other papers. We believe that the algorithms we H.3.3 Information Storage and Retrieval: Information developed can be a significant aid in both current and emergent Search and Retrieval information filtering, retrieval models, digital libraries earch process We carry out this research in the context of Cite Seer, a well H.3. 7 [Information Storage and Retrieval]: Digital Librarie known public repository of computer science research systems issues, user issues 3]. Cite Seer works by crawling the Web looking for papers in a wide variety of formats, and parsing the General Terms papers to determine the title, authors and other information Algorithms, Experimentation, Human Factors CiteSeer uses a process called"automatic citation indexingto Keywords automatically build a citation graph among the papers it Collaborative Filtering, Content-Based Filtering, Digital Libraries, Hybrid Recommender Systems CiteSeer is a digital library in which the content is not created, selected, or edited by professional staff, but by automatic 1. INTRODUCTION algorithms. We call a digital library of this type an emergen more than half a million research aers were published digital library. Emergent digital libraries are similar to other digital libraries in many respects: they have a well-defined han 1, 900 joumals worldwide in 1999. Since 1986 corpus of material; that corpus is digital in nature, and hence of papers published each year has increased at a rate may be distributed by digital means, and they have powerful cataloging and search technologies appropriate to their content. However, emergent digital libraries have a key difference: the anted without fee quality of their content varies more widely than the quality of ot made or distributed for profit or commercial advantage and that content in most digital libraries, because professional staff is not ear this notice and the full citation on the first page. To involved in collection and appraisal. Therefore, exploration and therwise, or republish, to post on servers or to redistribute search techniques are needed that can seek quality and relevance ecific permission and/or a fee JCDL 04, June 7-11, 2004, Tucson, Arizona, USA of results beyond what keyword similarity can provide. Our Copyright2004ACMl-58113-832-6/040006..500
Enhancing Digital Libraries with TechLens+ Roberto Torres1,2, Sean M. McNee2 , Mara Abel1 , Joseph A. Konstan2 , John Riedl2 Grupo de Banco de Dados Inteligentes1 Universidade Federal do Rio Grande do Sul Instituto de Informática Porto Alegre, RS - Brazil {rtorres, marabel}@inf.ufrgs.br GroupLens Research2 Dept. of Computer Science and Engineering University of Minnesota Minneapolis, MN 55455 - USA {mcnee, konstan, riedl}@cs.umn.edu ABSTRACT The number of research papers available is growing at a staggering rate. Researchers need tools to help them find the papers they should read among all the papers published each year. In this paper, we present and experiment with hybrid recommender algorithms that combine Collaborative Filtering and Content-based Filtering to recommend research papers to users. Our hybrid algorithms combine the strengths of each filtering approach to address their individual weaknesses. We evaluated our algorithms through offline experiments on a database of 102,000 research papers, and through an online experiment with 110 users. For both experiments we used a dataset created from the CiteSeer repository of computer science research papers. We developed separate English and Portuguese versions of the interface and specifically recruited American and Brazilian users to test for cross-cultural effects. Our results show that users value paper recommendations, that the hybrid algorithms can be successfully combined, that different algorithms are more suitable for recommending different kinds of papers, and that users with different levels of experience perceive recommendations differently. These results can be applied to develop recommender systems for other types of digital libraries. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval – information filtering, retrieval models, search process. H.3.7 [Information Storage and Retrieval]: Digital Libraries – systems issues, user issues. General Terms Algorithms, Experimentation, Human Factors Keywords Collaborative Filtering, Content-Based Filtering, Digital Libraries, Hybrid Recommender Systems 1. INTRODUCTION According to the United States’ National Science Foundation, more than half a million research papers were published in more than 1,900 journals worldwide in 1999. Since 1986, the number of papers published each year has increased at a rate of 1% per year [18]. If this trend continues, more than 10 million papers will be published in the next 20 years. Researchers find selecting the papers they should read difficult from among this growing abundance. Even within fairly narrow areas of interest such as data mining and digital libraries, it is still impossible to cope with all of the paper published each year. The problem is even more challenging for interdisciplinary research, since the resulting papers are published in a wide variety of venues. One solution is search engines like Google, which make it easy to find papers by author, title, or keyword. However, in order to find a paper with a search engine the researcher has to know or guess appropriate search keywords. How can users discover the existence of previously unknown papers that they would find valuable? Our previous research showed that collaborative filtering-based recommender systems successfully recommended research papers to users [16]. Here we introduce a new experimental system, called TechLens+ , to explore how two well-known techniques of recommender systems, namely collaborative filtering and content-based filtering can help users sort through the abundance of available research and find papers of interest. The TechLens+ algorithms explore both the content of the paper and the social context of the paper, through the analysis of its citations to other papers. We believe that the algorithms we developed can be a significant aid in both current and emergent digital libraries. We carry out this research in the context of CiteSeer, a wellknown public repository of computer science research papers [3]. CiteSeer works by crawling the Web looking for research papers in a wide variety of formats, and parsing the research papers to determine the title, authors and other information. CiteSeer uses a process called “automatic citation indexing” to automatically build a citation graph among the papers it discovers. CiteSeer is a digital library in which the content is not created, selected, or edited by professional staff, but by automatic algorithms. We call a digital library of this type an emergent digital library. Emergent digital libraries are similar to other digital libraries in many respects: they have a well-defined corpus of material; that corpus is digital in nature, and hence may be distributed by digital means; and they have powerful cataloging and search technologies appropriate to their content. However, emergent digital libraries have a key difference: the quality of their content varies more widely than the quality of content in most digital libraries, because professional staff is not involved in collection and appraisal. Therefore, exploration and search techniques are needed that can seek quality and relevance of results beyond what keyword similarity can provide. Our research seeks to explore such techniques. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. JCDL’04, June 7–11, 2004, Tucson, Arizona, USA. Copyright 2004 ACM 1-58113-832-6/04/0006…$5.00. 228
1.1 Contributions () Analysis of quality and taste: subjective aspects of the item, This research presents unique approaches to recommend such as style and quality of writing or authoritativeness of research papers. In addition to our results, we provide two key the author are hard to analyze contributions to the fields of recommender systems and digital (iii) Narrow content analysis: CBF recommends items similar in content to the items rated in the past, and cannot produce a set of algorithms recommendations for items that may have different but recommend rch papers in many domains. The related content tested 3. COMBINING CF and cBi experiments and can be easily incorporated into existing digital Taking a closer look at the characteristics of both CF and CBF, we can see that they are complementary. For example, CBF does Second vide an online experimental evaluation to assess not suffer from the first-rater problem as long as the content of a ons about aper recommendations. The new item can be compared against all existing user profiles. In experiment surveyed users acros ss many dimensions and our addition, CBF also does not suffer from sparsity, since every results be used by other research projects item in the systems can be related to every user profile. On th The outline of this paper is as follows. We first discuss the other hand, CF does not suffer from content-dependency, since related work both in digital libraries and recommender systems it can be applied to every domain in which humans can evaluate We then talk about how to combine collaborative and content- items. Also, CF uses quality and taste when recommending based filtering in the domain of research papers. Next, we items. Finally, the serendipitous nature of Cf guarantees that xplain our algorithms and how they were tested in our there is no over-specialization problem. xperiments. Finally, we discuss the results and draw some In our previous work, we explored CF recommenders in th domain of research papers [16]. We were able to generate 2 RELATED WORK recommendations by mapping the web of citations between papers into the CF user-item ratings matrix. In our mapping, a 2.1 Collaborative Filtering user in this matrix is a paper and an is a citation. Thus, tering (CF) is one of the most successful every papervotes' for the citations it references. This mapping techniques used in recommender systems. It has been used to does not suffer from the new-user problem, common to most commend Usenet news [21], audio CDs [23], and research techniques, because each user always provides ratings in the papers [16], among others. CF works by recommending items to form of citations eople based on what other similar people have previously liked Further, in previous work we found that different CF algorithms CF creates neighborhoods of"similar"users(neighbors)for generated qualitatively different recommendations: for example, each user in the system and recommends an item to one user if some recommendations were more novel than others. cbf was her neighbors have rated it highly. CF is"domain independent only used as a baseline comparison against the Cf, and no in that it performs no content analysis of the items in the hybrid recommender approaches were considered. This paper domain. Rather, it relies on user opinions about the items to builds upon our previous work by exploring how to combine generate recommendations and CBf to generate recommendations for research papers Despite being a successful technique in many domains, CF has propose a set of new hybrid algorithms that combine a TF-IDF its share of shortcomings[2] CBF algorithm [22], with a k-nearest neighbor(User-User) CF algorithm from our previous work. First-rater problem: items need to be rated by at least one neighbor to be recommended. so the item cannot be We hypothesize that CF and CBF can be successfully combined recommended until someone rates it first to produce recommendations of research papers. In line with om our previous work, we also hypothesize the (i Sparsity problem: in many domains, a user is likely to rat only a very small percentage of the available items. This different hybrid algorithms might be more suitable for recommending different kinds of papers. Finally, we can make it difficult to find agreement among individuals, since they may have little overlap in the set of items theyve hypothesize that users with different levels of experience perceive recommendations differently due to their own background, needs, and expectations 2.2 Content-Based Filtering Many hybrid recommender systems have been successfully built Content-Based Filtering (CBF) is also commonly used in in the past. P-Tango recommended news items by combining recommender systems. Applied mostly in textual domains, such as news [l1], CBF recommends items to a user if these items using a weighted-average function 6. Fab recommended Web are similar in content to items the user has liked in the past pages by choosing neighbors for CF-based recommendations Many algorithms, most notably TF-IDF [22], have been used in using CBF-based user profiles [2]. A ' combination of CBF systems. CBF has many strengths, including the ability to CF and CBf was proposed in [17], where CBF calculations nerate recommendations over all items in the domain CBF were used to augment the Cf rat matrix also has its shortcomings [2] recommended TV Guides using CF and CBF in parallel [7 Content limitation in domains: in non-textual domains like Finally, Woodruff developed six hybrid novies and audio, current algorithms can't successfully and reliably analyze item contents order to recommend the next paper a user should read from within a single digital book [24]. In our approach, we developed a set of hybrid recommender algorithms and compared their ormance non-hybrid baseline recommenders over a
1.1 Contributions This research presents unique approaches to recommend research papers. In addition to our results, we provide two key contributions to the fields of recommender systems and digital libraries. First, we provide a set of algorithms that can be used for recommending research papers in many scientific domains. The algorithms were tested through both offline and online experiments and can be easily incorporated into existing digital libraries. Second, we provide an online experimental evaluation to assess users’ perceptions about paper recommendations. The experiment surveyed users across many dimensions and our results be used by other research projects. The outline of this paper is as follows. We first discuss the related work both in digital libraries and recommender systems. We then talk about how to combine collaborative and contentbased filtering in the domain of research papers. Next, we explain our algorithms and how they were tested in our experiments. Finally, we discuss the results and draw some conclusions. 2. RELATED WORK 2.1 Collaborative Filtering Collaborative Filtering (CF) is one of the most successful techniques used in recommender systems. It has been used to recommend Usenet news [21], audio CDs [23], and research papers [16], among others. CF works by recommending items to people based on what other similar people have previously liked. CF creates neighborhoods of “similar” users (neighbors) for each user in the system and recommends an item to one user if her neighbors have rated it highly. CF is “domain independent” in that it performs no content analysis of the items in the domain. Rather, it relies on user opinions about the items to generate recommendations. Despite being a successful technique in many domains, CF has its share of shortcomings [2]: (i) First-rater problem: items need to be rated by at least one neighbor to be recommended, so the item cannot be recommended until someone rates it first. (ii) Sparsity problem: in many domains, a user is likely to rate only a very small percentage of the available items. This can make it difficult to find agreement among individuals, since they may have little overlap in the set of items they've rated. 2.2 Content-Based Filtering Content-Based Filtering (CBF) is also commonly used in recommender systems. Applied mostly in textual domains, such as news [11], CBF recommends items to a user if these items are similar in content to items the user has liked in the past. Many algorithms, most notably TF-IDF [22], have been used in CBF systems. CBF has many strengths, including the ability to generate recommendations over all items in the domain. CBF also has its shortcomings [2]: (i) Content limitation in domains: in non-textual domains like movies and audio, current algorithms can’t successfully and reliably analyze item contents. (ii) Analysis of quality and taste: subjective aspects of the item, such as style and quality of writing or authoritativeness of the author are hard to analyze. (iii) Narrow content analysis: CBF recommends items similar in content to the items rated in the past, and cannot produce recommendations for items that may have different but related content. 3. COMBINING CF and CBF Taking a closer look at the characteristics of both CF and CBF, we can see that they are complementary. For example, CBF does not suffer from the first-rater problem as long as the content of a new item can be compared against all existing user profiles. In addition, CBF also does not suffer from sparsity, since every item in the systems can be related to every user profile. On the other hand, CF does not suffer from content-dependency, since it can be applied to every domain in which humans can evaluate items. Also, CF uses quality and taste when recommending items. Finally, the serendipitous nature of CF guarantees that there is no over-specialization problem. In our previous work, we explored CF recommenders in the domain of research papers [16]. We were able to generate recommendations by mapping the web of citations between papers into the CF user-item ratings matrix. In our mapping, a ‘user’ in this matrix is a paper and an ‘item’ is a citation. Thus, every paper ‘votes’ for the citations it references. This mapping does not suffer from the new-user problem, common to most techniques, because each ‘user’ always provides ratings in the form of citations. Further, in previous work we found that different CF algorithms generated qualitatively different recommendations: for example, some recommendations were more novel than others. CBF was only used as a baseline comparison against the CF, and no hybrid recommender approaches were considered. This paper builds upon our previous work by exploring how to combine CF and CBF to generate recommendations for research papers. We propose a set of new hybrid algorithms that combine a TF-IDF CBF algorithm [22], with a k-nearest neighbor (User-User) CF algorithm from our previous work. We hypothesize that CF and CBF can be successfully combined to produce recommendations of research papers. In line with results from our previous work, we also hypothesize that different hybrid algorithms might be more suitable for recommending different kinds of papers. Finally, we hypothesize that users with different levels of experience perceive recommendations differently due to their own background, needs, and expectations. Many hybrid recommender systems have been successfully built in the past. P-Tango recommended news items by combining recommendations from CBF and CF recommenders together using a weighted-average function [6]. Fab recommended Web pages by choosing neighbors for CF-based recommendations using CBF-based user profiles [2]. A ‘boosted’ combination of CF and CBF was proposed in [17], where CBF calculations were used to augment the CF ratings matrix. PTV recommended TV Guides using CF and CBF in parallel [7]. Finally, Woodruff developed six hybrid recommender algorithms that combined textual and citation information in order to recommend the next paper a user should read from within a single digital book [24]. In our approach, we developed a set of hybrid recommender algorithms and compared their performance to non-hybrid baseline recommenders over a 229
corpus of computer science research papers in both onl ine and similar to the citations of a given paper should broaden th offline experiments search space, leading to more diverse recommendations Our algorithms follow the taxonomy of hybrid recommender Finally, CBF-Combined is an extension of CBF-Separated algorithms first proposed by Burke [5]. In particular, we Instead of generating one list of similar papers for every citation, implemented algorithms that follow the two most this algorithm merges the text of the paper and the text of all of straightforward and appealing of Burke's categories: " feature the papers it cites together into one large chunk of text. This augmentation"and"mixed"recommenders. In both categories, larger text is submitted as the query to search for similar papers standalone no The most similar papers are then recommended. In th hybrid recommender algorithms. In feature augmentation algorithm, the presence of more words in a single query to th hybrid recommenders, the results generated from the first CB engine should more effectively return similar papers based algorithm are fed as inputs to the second algorithm. In 'mixed hybrid recommenders, the two algorithms are run independently with the same input and their results are merged together 3.3 Hybrid algorithms Each hybrid algorithm is composed of two independent 3.1 Recommender algorithms modules: A CF algorithm and a CBF algorithm. Each module Before describing our recommender algorithms, we have to be responsible for getting an input and generating precise with our language. We will draw a subtle but importan recommendations. The CbF module uses the text of the active difference between a paper and a citation. a citation is a paper paper as input and the CF module uses the citations from th for which the text may not be available. a citation therefore is a active input. Hybrid algorithms following Burkes ointer to a paper. On the other hand, a paper is a citation for feature on model run these modules in sequence, with which we also have its text. This is important because many either CI augmentor ' -Separated or CBF-Combined) citations may be references to papers that we do not have in CF) first. The output of the first module(up to 20 papers)is digital format. used as input to the second. In contrast, hybrid algorithms We developed ten recommender algorithms, each algorithm following Burke's mixed model runs the two modules in parallel receives a set of citations as input and generates an ordered list and merges the recommendations together into a final of citations as recommendations. Unless stated otherwise. the recommended list input set of citations is generated from the reference list of one 3.3.1 CF- CBF Separated input paper(active paper) In this algorithm, recommendations from Pure-CF are used as All algorithms make use of standalone Cf and CBF input to CBF-Separated. For every recommendation from CF, emendation engines. For our CF engine, we used the user the CBF module recommends a set of similar papers (up to 80 CF algorithm from the Suggest library, an Because the recommendations generated by the Cf module are implementation of standard C algorithms [12. For our CBF ordered, the recommendations generated by the CBF module engine, we used TF-IDF from the Bow Toolkit, a library that Thus. these CBI performs document retrieval [15] recommendations are weighted, with the first set(generated 3.2 Non-hybrid Algorithms from the top CF recommendation) receiving weight I and th following sets weights decreased accordingly. The similarity These algorithms run either CF or CBF. They are used as scores of the CBF recommendations are multiplied by these baseline comparison for the hybrid algorithms weights and sorted 3.2.1 Only Collaborative Filtering 3.3.2 CF-CBF Combined We developed two CF-only algorithms: Pure-CF and Denser- This algorithm is similar to CF-CBF Separated. However, CF. Pure-Cf is the standard k-nearest neighbor CF instead of recommending a set of similar papers for ever [9]. It takes the citations of the active paper as inp recommendation received from the cf module. the CbF module list of recommended citations as output. Denser-CF aggregates the text of all of the recommendations given by CF the input list of citations by adding the citations in the papers and uses this large chunk of text as its input to CBF. The results are sorted by similarity 3.2.2 Only Content-Based Filtering 3.3.3 CBF Separated-CF Three content-based algorithms were built: Pure-CBF, CBF. Here, CBF-Separated generates recommendations for the active They all are based on TF-IDF paper. These recommendations are used to augment the active and uses Porters stemming and stopword elimination [15]. All papers set of citations. The active paper with its modified set of ontent analysis was performed on paper titles and abstracts. citations is used as input to Pure-CF to generate Pure-CBF searches for similar papers based on TF-IDF similarity and recommends the most similar papers. CBF- 3.3.4 CBF Combined-CF Separated is an extension of Pure-CBF and it explores not only This algorithm is identical to CBF Separated-CF, except that the text of the paper but also the text of the papers it cites. For CBF-Combined is used in place of CBF-Separated instance, for the paper P the algorithm generates a list of similar 3.3.5 Fusion Cn)of the paper P list of apers(lcl, Lo Cn). All lis recommender modules in paralle ged into one single list, sorted based on the generates a final similarity coefficient. Papers are discarded if they have recommendation list by merging sults together. The been added to the resulting list. Papers with the generation of the final recommendation list is as follows: every recommendation that is present in both modules result lists similarity scores are recommended. Recommending papers added to the final list with a rank score. This score is the 30
corpus of computer science research papers in both online and offline experiments. Our algorithms follow the taxonomy of hybrid recommender algorithms first proposed by Burke [5]. In particular, we implemented algorithms that follow the two most straightforward and appealing of Burke’s categories: “feature augmentation” and “mixed” recommenders. In both categories, the hybrid recommender is composed of two standalone nonhybrid recommender algorithms. In ‘feature augmentation’ hybrid recommenders, the results generated from the first algorithm are fed as inputs to the second algorithm. In ‘mixed’ hybrid recommenders, the two algorithms are run independently with the same input and their results are merged together. 3.1 Recommender Algorithms Before describing our recommender algorithms, we have to be precise with our language. We will draw a subtle but important difference between a paper and a citation. A citation is a paper for which the text may not be available. A citation therefore is a pointer to a paper. On the other hand, a paper is a citation for which we also have its text. This is important because many citations may be references to papers that we do not have in digital format. We developed ten recommender algorithms; each algorithm receives a set of citations as input and generates an ordered list of citations as recommendations. Unless stated otherwise, the input set of citations is generated from the reference list of one input paper (‘active’ paper). All algorithms make use of standalone CF and CBF recommendation engines. For our CF engine, we used the ‘userbased’ CF algorithm from the Suggest library, an implementation of standard CF algorithms [12]. For our CBF engine, we used TF-IDF from the Bow Toolkit, a library that performs document retrieval [15]. 3.2 Non-hybrid Algorithms These algorithms run either CF or CBF. They are used as baseline comparison for the hybrid algorithms. 3.2.1 Only Collaborative Filtering We developed two CF-only algorithms: Pure-CF and DenserCF. Pure-CF is the standard k-nearest neighbor CF algorithm [9]. It takes the citations of the active paper as input and gives a list of recommended citations as output. Denser-CF augments the input list of citations by adding the citations in the papers cited by the active paper to the input list. 3.2.2 Only Content-Based Filtering Three content-based algorithms were built: Pure-CBF, CBFSeparated and CBF-Combined. They all are based on TF-IDF and uses Porter’s stemming and stopword elimination [15]. All content analysis was performed on paper titles and abstracts. Pure-CBF searches for similar papers based on TF-IDF similarity and recommends the most similar papers. CBFSeparated is an extension of Pure-CBF and it explores not only the text of the paper but also the text of the papers it cites. For instance, for the paper P the algorithm generates a list of similar papers LP, and for every citation (C1, C2, …, Cn) of the paper P, it generates a list of similar papers (LC1, LC2, …, LCn). All lists are merged into one single list, sorted based on the returned similarity coefficient. Papers are discarded if they have already been added to the resulting list. Papers with the highest similarity scores are recommended. Recommending papers similar to the citations of a given paper should broaden the search space, leading to more diverse recommendations. Finally, CBF-Combined is an extension of CBF-Separated. Instead of generating one list of similar papers for every citation, this algorithm merges the text of the paper and the text of all of the papers it cites together into one large chunk of text. This larger text is submitted as the query to search for similar papers. The most similar papers are then recommended. In this algorithm, the presence of more words in a single query to the CBF engine should more effectively return similar papers based on content. 3.3 Hybrid Algorithms Each hybrid algorithm is composed of two independent modules: A CF algorithm and a CBF algorithm. Each module is responsible for getting an input and generating recommendations. The CBF module uses the text of the active paper as input and the CF module uses the citations from the active paper as input. Hybrid algorithms following Burke’s feature augmentation model run these modules in sequence, with either CBF (CBF-Separated or CBF-Combined) or CF (PureCF) first. The output of the first module (up to 20 papers) is used as input to the second. In contrast, hybrid algorithms following Burke’s mixed model runs the two modules in parallel and merges the recommendations together into a final recommended list. 3.3.1 CF – CBF Separated In this algorithm, recommendations from Pure-CF are used as input to CBF-Separated. For every recommendation from CF, the CBF module recommends a set of similar papers (up to 80). Because the recommendations generated by the CF module are ordered, the recommendations generated by the CBF module have to be scaled by this ordering. Thus, these CBF recommendations are weighted, with the first set (generated from the top CF recommendation) receiving weight 1 and the following sets’ weights decreased accordingly. The similarity scores of the CBF recommendations are multiplied by these weights and sorted. 3.3.2 CF – CBF Combined This algorithm is similar to CF-CBF Separated. However, instead of recommending a set of similar papers for every recommendation received from the CF module, the CBF module aggregates the text of all of the recommendations given by CF and uses this large chunk of text as its input to CBF. The results are sorted by similarity. 3.3.3 CBF Separated – CF Here, CBF-Separated generates recommendations for the active paper. These recommendations are used to augment the active paper’s set of citations. The active paper with its modified set of citations is used as input to Pure-CF to generate recommendations. 3.3.4 CBF Combined – CF This algorithm is identical to CBF Separated – CF, except that CBF-Combined is used in place of CBF-Separated. 3.3.5 Fusion Fusion, our ‘mixed’ hybrid algorithm, runs the two recommender modules in parallel and generates a final recommendation list by merging the results together. The generation of the final recommendation list is as follows: every recommendation that is present in both modules’ result lists is added to the final list with a rank score. This score is the 230
summation of the ranks of the recommendation in their original represents the user's tastes and opinions about the papers that lists. The final recommendation list is sorted based on these she has read. Such a profile could contain both long-term and scores. Therefore, a paper that was ranked 3 from the CF short-term user interests and the profile could gather data either module and 2nd from the CBf module would receive a score of explicitly or implicitly. See Table I for a listing of the 5. The lower the score, the closer to the top an item goes advantages and disadvantages for creating profiles in these Recommendations that don t appear on both lists are appended different ways alternately, to the end of the final list. Fusion recommendation ocess is shown in Figure 1. A similar algorithm has been An example of gathering information implicitly for developed by Cotter [7] interests is to build the profile from all of the papers thi has read in the past. This can be accomplished by system to monitor what papers the user downloads and silently incorporating all papers into the users profile. This pproach has the benefit of knowing everything a user reads and when the user read it, but it cannot know how closely the user read it. Thus. over time the users could become bloated with papers that the user may the users profile and reduce recommendation quality igure 1. Fusion Algorithm a way to combat this problem is to have the user explicitly state 4. EXPERIMENTS to be added to the user profile. Moreover, th To test the utility of our algorithms, we ran two experiments: an user could also provide extra information such as a rating, offline experiment to evaluate the ability of the algorithms to commentary, or classification of the paper(e.g. by research area, recommend papers known to be related to a selected paper and field, etc. ) The user could also review her profile to make sure n online experiment to evaluate users' perceptions about the that it accurately represented her. This explicitly gathere quality of the recommendations. information would help the system generate personalized recommendations. It is unlikely, however, that a user would be 4.1 Dataset willing to invest the time and energy needed to maintain such a To test our algorithms we created a dataset with papers extracted user profile. from CiteSeer 3]. This dataset initially had over 500,000 papers and 2 million citations. We limited this dataset in two ways. An example of gathering information implicitly for short-term First, we removed papers that cited fewer than 3 other papers, as interests is to build a paper-monitoring system similar to th we believe these loosely connected papers introduced noise to ones previously described, with one large difference: this system the dataset. Second we removed citations for which we did not would only remember the last paper a user downloaded and ave the full text of the paper in our dataset. We performed this read. This one paper would be used as the current user profile trimming so that both cf and cbF would be able to and would be the basis for generating recommendations every item in out dataset. The pruned dataset has 102, 295 Finally, we could gather information explicitly for short-term with an average of 14 connections per paper, where the interests. In such a system, the user would choose one paper to f connections is the number of citations a paper makes plus the be his short-term profile, the recommender would use this paper number of papers that cite it. to generate recommendations. We will use this approach in this 4.2 User Profiling research as it is not only the most straightforward and simple to understand, but it is also the only approach that does not require In order to recommend papers to users we need to have a model creating and installing a system on users' computers to monitor of the user's interests: we need a user profile. This profile the papers they are downloading and reading Table 1: User Profile alternatives Approach Gathering Information /Information Users'Interests Used Advantages Disadvantages Implicit and Long-ter All papers read track of Privacy issues, bloated nterests in the past profiles, requires habits over time onitoring system All Papers by Explicit and Long-term All papers read recommendations Requires monitoring nterests in the past based on user's system, user must field interests manually adjust profile One Paper Explicit and Short-term Only one paper Does not require Does not keep track of of interest reading habits over time One Paper Implicit and Short-term Only one paper Requires limited Privacy issues, does not interests of interest system track habits over time
summation of the ranks of the recommendation in their original lists. The final recommendation list is sorted based on these scores. Therefore, a paper that was ranked 3rd from the CF module and 2nd from the CBF module would receive a score of 5. The lower the score, the closer to the top an item goes. Recommendations that don’t appear on both lists are appended, alternately, to the end of the final list. Fusion recommendation process is shown in Figure 1. A similar algorithm has been developed by Cotter [7]. Figure 1. Fusion Algorithm 4. EXPERIMENTS To test the utility of our algorithms, we ran two experiments: an offline experiment to evaluate the ability of the algorithms to recommend papers known to be related to a selected paper and an online experiment to evaluate users’ perceptions about the quality of the recommendations. 4.1 Dataset To test our algorithms we created a dataset with papers extracted from CiteSeer [3]. This dataset initially had over 500,000 papers and 2 million citations. We limited this dataset in two ways. First, we removed papers that cited fewer than 3 other papers, as we believe these loosely connected papers introduced noise to the dataset. Second, we removed citations for which we did not have the full text of the paper in our dataset. We performed this trimming so that both CF and CBF would be able to analyze every item in out dataset. The pruned dataset has 102,295 papers with an average of 14 connections per paper, where the number of connections is the number of citations a paper makes plus the number of papers that cite it. 4.2 User Profiling In order to recommend papers to users we need to have a model of the user’s interests; we need a user profile. This profile represents the user’s tastes and opinions about the papers that she has read. Such a profile could contain both long-term and short-term user interests and the profile could gather data either explicitly or implicitly. See Table 1 for a listing of the advantages and disadvantages for creating profiles in these different ways. An example of gathering information implicitly for long-term interests is to build the profile from all of the papers that the user has read in the past. This can be accomplished by building a system to monitor what papers the user downloads and reads, and silently incorporating all papers into the user’s profile. This approach has the benefit of knowing everything a user reads and when the user read it, but it cannot know how closely the user read it. Thus, over time the user’s profile could become bloated with papers that the user may have quickly skimmed and discarded. These false positives can erode the user’s profile and reduce recommendation quality. A way to combat this problem is to have the user explicitly state which papers are to be added to the user profile. Moreover, the user could also provide extra information such as a rating, commentary, or classification of the paper (e.g. by research area, field, etc.). The user could also review her profile to make sure that it accurately represented her. This explicitly gathered information would help the system generate personalized recommendations. It is unlikely, however, that a user would be willing to invest the time and energy needed to maintain such a user profile. An example of gathering information implicitly for short-term interests is to build a paper-monitoring system similar to the ones previously described, with one large difference: this system would only remember the last paper a user downloaded and read. This one paper would be used as the current user profile and would be the basis for generating recommendations. Finally, we could gather information explicitly for short-term interests. In such a system, the user would choose one paper to be his short-term profile; the recommender would use this paper to generate recommendations. We will use this approach in this research as it is not only the most straightforward and simple to understand, but it is also the only approach that does not require creating and installing a system on users’ computers to monitor the papers they are downloading and reading. Table 1: User Profile Alternatives Approach Gathering Information / Users’ Interests Information Used Advantages Disadvantages All Papers Implicit and Long-term interests All papers read in the past Keep track of user’s reading habits over time Privacy issues, bloated profiles, requires monitoring system All Papers by Field Explicit and Long-term interests All papers read in the past Filter recommendations based on user’s field interests Requires monitoring system, user must manually adjust profile One Paper Explicit and Short-term interests Only one paper of interest Does not require a system Does not keep track of reading habits over time One Paper Implicit and Short-term interests Only one paper of interest Requires limited system Privacy issues, does not track habits over time 231
5. OFFLINE EXPERIMENT paper's citations. However, Pure-CF was not expected to n this experiment, we randomly removed one citation from the perform better than Denser-CF. active paper and then checked whether our algorithms could Fusion performed significantly better compared to every other recommend that removed citation. This "leave one out algorithm at both top-1 (28%)and all hit-percentage (78%) methodology has been frequently used in other recommender Pure-CF also did well, usually having the best or second best systems offline experiments [ 4, 16] performance among all bins. The results of the five best We divided the dataset into training and test datasets at a 90%to algorithms are shown in Figure 2 10% ratio. Ten different training and testing datasets were The other hybrid algorithms did not perform well. CBF created for 10-fold cross validation. For each trial, every paper Separated-CF was slightly inferior to the CBF Combined-CF in the test dataset had one randomly removed citation. On the other hand, CF-CBF Combined had a very poor Although being successfully used in other research, this method rformance compared to CF-CBF Separated--CF-CBF of experimentation has some limitations. The recommender Combined had an"all" hit-percentage of 0. 1%. More research is algorithms could recommend a paper that didn t exist at the time needed to explore what happened here the active paper was published. To handle that, we filtered out 5.3 offline discussion recommendations with a publication year later than that of the tive paper. The algorithms could also recommend papers that The poor performance of Denser-CF was a surprising result are very similar to or even better than the removed citation because we thought that the denser input used in this algorithm possibly"diminishing"the algorithms' performance. Although should improve the quality of recommendations, since sparsity this is a possibility, we still expect the removed citation to be is a known CF problem. But, if for example, we assumed each citations to the input set, these additions might be not closely 5.1 Metrics enough related to the active paper, and thus generate only noise There are two metrics used in the offline experiment: hit To help us understand the behavior of the algorithms, we also ercentage and rank. We define " hit-per looked at recommendation coverage. Coverage is the percentag percentage of the time the recommender alg of items for which the system could generate a recommendation ecommended the removed citation anywhe [10. All of the algorithms had 100% coverage, except Pure-CP recommendation list. Similarly, we define"rank"as the location which had coverage of 93%. This high coverage in the hybrid where the removed citation was found in the recommendation algorithms is due to the presence of the CBF as an algorithm step and it is a significant advantage of building a hybrid We chose these metrics because of how the reflect real world recommender use of recommenders: Users implicitly trust recommenders to The results from the offline ex nent were used to nerate meaningful, correct recommendations and that the algorithms for the online We chose to us order of recommended items is important. So, we validated our following algorithms online eparated and Pure-CF as algorithms by first examining if they can generate the ' correct non-hybrids, and CF-CBF Separated, CBF Combined-CF, and recommendation(hit-percentage metric)and then how well they Fusion as the hybrid algorithms generate that recommendation (rank metric). Since both measures are important, we combined them. We segmented our hit-percentage analysis into bins based on rank, where lower is better. Thus, a recommendation in the top- 10 bin is better than a recommendation in the top-30 bin. For example, if an algorithm had a top-10 hit percentage of 25%, en 25% of the time that algorithm recommended the removed citation at a rank of 10 or better (i.e. lower). Recommendations d the 40th position are considered"all" because users are kely to see items recommended beyond this position. The all" bin is equivalent to the overall hit percentage of the CAhill algorithm. To select the best algorithms, we focused on two particular criteria. As we think that users prefer to see the best recommendations first, our first criterion is the algorithms Figure 2: Results for the 5 best algorithms performance in the top-10 hit-percentage. The second criterion 6. ONLINE EXPERIMENT is the algorithms ability to recommend the removed citatio The online experiment was aimed to assess users' perception independently of its rank. It is measured by the "all"hit- about the recommendations they received percentage. As a possible tie-breaker, the algorithms top-l performance is also exam 6.1 Experimental Design 5.2 Offline Results We developed an online experimental system, called TechLens consisting of a six-page Web-based experiment where users Based on the above criteria, the best hybrid algorithms evaluated recommendations of research papers Fusion, CBF Combined-CF, and CF-CBF Separated. The best non-hybrid algorithms were CBF-Separated and Pure-CE. Users were invited to anonymously participate through links at the Penn State version of CiteSeer [20 and EBizSearch [8 CBF-Separated was expected to perform better than Pure-CBF Users were also invited through messages posted in e-mail lists, because it widens the search space by using the text of the active such as internal lists of the Computer Science departments at
5. OFFLINE EXPERIMENT In this experiment, we randomly removed one citation from the active paper and then checked whether our algorithms could recommend that removed citation. This “leave one out” methodology has been frequently used in other recommender systems offline experiments [4, 16]. We divided the dataset into training and test datasets at a 90% to 10% ratio. Ten different training and testing datasets were created for 10-fold cross validation. For each trial, every paper in the test dataset had one randomly removed citation. Although being successfully used in other research, this method of experimentation has some limitations. The recommender algorithms could recommend a paper that didn’t exist at the time the active paper was published. To handle that, we filtered out recommendations with a publication year later than that of the active paper. The algorithms could also recommend papers that are very similar to or even better than the removed citation, possibly “diminishing” the algorithms’ performance. Although this is a possibility, we still expect the removed citation to be recommended. 5.1 Metrics There are two metrics used in the offline experiment: hit percentage and rank. We define “hit-percentage” as the percentage of the time the recommender algorithm correctly recommended the removed citation anywhere in the recommendation list. Similarly, we define “rank” as the location where the removed citation was found in the recommendation list. We chose these metrics because of how the reflect real world use of recommenders: Users implicitly trust recommenders to generate meaningful, correct recommendations and that the order of recommended items is important. So, we validated our algorithms by first examining if they can generate the ‘correct’ recommendation (hit-percentage metric) and then how well they generate that recommendation (rank metric). Since both measures are important, we combined them. We segmented our hit-percentage analysis into bins based on rank, where lower is better. Thus, a recommendation in the top- 10 bin is better than a recommendation in the top-30 bin. For example, if an algorithm had a top-10 hit percentage of 25%, then 25% of the time that algorithm recommended the removed citation at a rank of 10 or better (i.e. lower). Recommendations beyond the 40th position are considered “all” because users are not likely to see items recommended beyond this position. The “all” bin is equivalent to the overall hit percentage of the algorithm. To select the best algorithms, we focused on two particular criteria. As we think that users prefer to see the best recommendations first, our first criterion is the algorithm’s performance in the top-10 hit-percentage. The second criterion is the algorithm’s ability to recommend the removed citation independently of its rank. It is measured by the “all” hitpercentage. As a possible tie-breaker, the algorithm’s top-1 performance is also examined. 5.2 Offline Results Based on the above criteria, the best hybrid algorithms were Fusion, CBF Combined–CF, and CF-CBF Separated. The best non-hybrid algorithms were CBF-Separated and Pure-CF. CBF-Separated was expected to perform better than Pure-CBF because it widens the search space by using the text of the active paper’s citations. However, Pure-CF was not expected to perform better than Denser-CF. Fusion performed significantly better compared to every other algorithm at both top-1 (28%) and all hit-percentage (78%). Pure-CF also did well, usually having the best or second best performance among all bins. The results of the five best algorithms are shown in Figure 2. The other hybrid algorithms did not perform well. CBF Separated–CF was slightly inferior to the CBF Combined–CF. On the other hand, CF–CBF Combined had a very poor performance compared to CF–CBF Separated—CF–CBF Combined had an “all” hit-percentage of 0.1%. More research is needed to explore what happened here. 5.3 Offline Discussion The poor performance of Denser-CF was a surprising result because we thought that the denser input used in this algorithm should improve the quality of recommendations, since sparsity is a known CF problem. But, if for example, we assumed each paper had 7 citations, then Denser-CF would add 49 more citations to the input set, these additions might be not closely enough related to the active paper, and thus generate only noise. To help us understand the behavior of the algorithms, we also looked at recommendation coverage. Coverage is the percentage of items for which the system could generate a recommendation [10]. All of the algorithms had 100% coverage, except Pure-CF which had coverage of 93%. This high coverage in the hybrid algorithms is due to the presence of the CBF as an algorithm step and it is a significant advantage of building a hybrid recommender. The results from the offline experiment were used to select algorithms for the online experiment. We chose to use the following algorithms online: CBF-Separated and Pure-CF as non-hybrids, and CF–CBF Separated, CBF Combined–CF, and Fusion as the hybrid algorithms. Offline Results 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Top-1 Top-10 Top-20 Top-30 Top-40 All Hit-percentage Pure-CF CBF-Separated CF-CBF Separated CBF Combined-CF Fusion Figure 2: Results for the 5 best algorithms 6. ONLINE EXPERIMENT The online experiment was aimed to assess users’ perceptions about the recommendations they received. 6.1 Experimental Design We developed an online experimental system, called TechLens+ , consisting of a six-page Web-based experiment where users evaluated recommendations of research papers. Users were invited to anonymously participate through links at the Penn State version of CiteSeer [20] and EBizSearch [8]. Users were also invited through messages posted in e-mail lists, such as internal lists of the Computer Science departments at: 232