Www2010·Fu! Paper Aprl!26-30· Raleigh·Nc·USA Context-aware Citation recommendation Jian pei Daniel Kifer Penn State University Simon Fraser University Penn State University ghe ist psu. edu jpi@ cs. sfu.ca dan+www10@cse psu. edu Prasenjit Mitra C. Lee Giles Penn State University Penn State Univer mitra ist psu. edu giles ist psu.e ABSTRACT nk-PLSA-LDA [? simplifies+I The missing lin When you write papers, how many times do you want to nake some citations at a place but you are not sure which models for papers to cite? Do you wish to have a recommendation here citations are 2. Mixed membership models of system which can recommend a small number of good can- a sample from a probability scientific publications(2004) didates for every place that you want to make some cita- distribution associated with tions? In this paper, we present our initiative of building a topIc. context-aware citation recommendation system. High qual ity citation recommendation is challenging: not only should Figure 1: A manuscript with citation placeholde the citations recommended be relevant to the paper under and recommended citations. The text is from Sec- composition, but also should match the local contexts of the. tion 2.2 places citations are made. Moreover, it is far from trivial to model how the topic of the whole paper and the contexts of the citation places should affect the selection and ranking of citations should be added. In order to fill in those cita- citations.To tackle the problem, we develop a context-aware tion placeholders, one needs to search the relevant literature approach. The core idea is to design a novel non-parametric and find a small number of proper citations. Searching for relevance between a citation context and a document, Our proper citations is often a labor-intensive task in research pa- proach can recommend citations for a context effectively. ot familiar with the very extensive literature. Moreover Moreover, it can recommend a set of citations for a paper the volume of research undertaken and information avail- with high quality. We implement a prototype system in CiteSeerX. An extensive empirical evaluation in the Cite- able make citation search hard even for senior researchers Do you wish to have a recommendation system which can the effectiveness and the scalability of our approach recommend a small number of good candidates for every place that you want to make some citations? High quality citation recommendation is a challenging problem for many Categories and Subject Descriptors reasons 1. 3. 3 Information Storage and Retrieval Information For each citation placeholder, we can collect the words sur- earch and retrieval rounding as the contert of the placeholder. One may think we can use some key words in the context of a placeholder General terms to search a literature search engine like Google scholar or CiteSeerx to obtain a list of documents as the candidates Algorithms, Design, Experimentation for citations. However, such a method, based on keyword matching, is often far from satisfactory. For example, us- Keywords ing query "frequent itemset mining one may want to search Bibliometrics, Context, Gleasons Theorem, Recommender for the first paper proposing the concept of frequent itemset Systems Inning, e.g. However, Google Scholar returns a paper about frequent closed itemset mining published in 2000 as the first result, and a paper on privacy preserving frequent 1. INTRODUCTION itemset mining as the second choice. [1 does not appear in When you write papers, how many times do you want to the first page of the results. CiteSeerX also lists a paper make some citations at a place but you are not sure which on privacy preserving frequent itemset mining as the first apers to cite? For example, the left part of Figure 1 shows result. CiteSeerX fails to return [1] on the first page, either a segment of a query manuscript containing some citation One may wonder, as we can model citations as links from placeholders(placeholders for short)marked as"?", where citing documents to cited ones, can we use graph-based link prediction techniques to recommend citations? Graph-based mittee(Iw3C2). Distribution of these papers is limited to classroom use, link prediction techniques often require a user to provide and pew 2010, April 26-30, 2010, Raleigh, North Carolina, USA. sample citations for each placeholder, and thus shifts much of the burden to the user. And, graph-based link prediction ACM978-1-60558-799-8/10/04
Context-aware Citation Recommendation Qi He Penn State University qhe@ist.psu.edu Jian Pei Simon Fraser University jpei@cs.sfu.ca Daniel Kifer Penn State University dan+www10@cse.psu.edu Prasenjit Mitra Penn State University pmitra@ist.psu.edu C. Lee Giles Penn State University giles@ist.psu.edu ABSTRACT When you write papers, how many times do you want to make some citations at a place but you are not sure which papers to cite? Do you wish to have a recommendation system which can recommend a small number of good candidates for every place that you want to make some citations? In this paper, we present our initiative of building a context-aware citation recommendation system. High quality citation recommendation is challenging: not only should the citations recommended be relevant to the paper under composition, but also should match the local contexts of the places citations are made. Moreover, it is far from trivial to model how the topic of the whole paper and the contexts of the citation places should affect the selection and ranking of citations. To tackle the problem, we develop a context-aware approach. The core idea is to design a novel non-parametric probabilistic model which can measure the context-based relevance between a citation context and a document. Our approach can recommend citations for a context effectively. Moreover, it can recommend a set of citations for a paper with high quality. We implement a prototype system in CiteSeerX. An extensive empirical evaluation in the CiteSeerX digital library against many baselines demonstrates the effectiveness and the scalability of our approach. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Algorithms, Design, Experimentation Keywords Bibliometrics, Context, Gleason’s Theorem, Recommender Systems 1. INTRODUCTION When you write papers, how many times do you want to make some citations at a place but you are not sure which papers to cite? For example, the left part of Figure 1 shows a segment of a query manuscript containing some citation placeholders (placeholders for short) marked as “[?]”, where Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2010, April 26–30, 2010, Raleigh, North Carolina, USA. ACM 978-1-60558-799-8/10/04. Figure 1: A manuscript with citation placeholders and recommended citations. The text is from Section 2.2. citations should be added. In order to fill in those citation placeholders, one needs to search the relevant literature and find a small number of proper citations. Searching for proper citations is often a labor-intensive task in research paper composition, particularly for junior researchers who are not familiar with the very extensive literature. Moreover, the volume of research undertaken and information available make citation search hard even for senior researchers. Do you wish to have a recommendation system which can recommend a small number of good candidates for every place that you want to make some citations? High quality citation recommendation is a challenging problem for many reasons. For each citation placeholder, we can collect the words surrounding as the context of the placeholder. One may think we can use some keywords in the context of a placeholder to search a literature search engine like Google Scholar or CiteSeerX to obtain a list of documents as the candidates for citations. However, such a method, based on keyword matching, is often far from satisfactory. For example, using query “frequent itemset mining” one may want to search for the first paper proposing the concept of frequent itemset mining, e.g. [1]. However, Google Scholar returns a paper about frequent closed itemset mining published in 2000 as the first result, and a paper on privacy preserving frequent itemset mining as the second choice. [1] does not appear in the first page of the results. CiteSeerX also lists a paper on privacy preserving frequent itemset mining as the first result. CiteSeerX fails to return [1] on the first page, either. One may wonder, as we can model citations as links from citing documents to cited ones, can we use graph-based link prediction techniques to recommend citations? Graph-based link prediction techniques often require a user to provide sample citations for each placeholder, and thus shifts much of the burden to the user. And, graph-based link prediction WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA 421
Www2010·Fu! Paper Aprl!26-30· Raleigh·Nc·USA text is a snippet of the text around a citation or a place- holder. The core idea is to design a novel non-parametric probabilistic model which can measure the context-based relevance between a citation context and a document. Our approach can recommend citations for a context effectively, which is the innovative part of the paper. Moreover, it can recommend a set of citations for a paper with high quality. In addition to the theoretical contribution, we also imple- ment a prototype system in CiteSeerX. Figure 2 shows a real example in our demo system, where a user submits an cita- tion context with a placeholder and the title/atract. Among the top 6 results, the 1st, 4th, 5th and 6th ones are proper recommendations relevant and new to the manuscript; the 2nd one is the original citation: the 3rd one is a similar but irrelevant recommendation We also present an extensive empirical evaluation in the CiteSeerX digital library against many baselines. Our em- pirical study demonstrates the effectiveness and the scala bility of our approach. Figure 2: a demo of our context-aware citation rec- The rest of this paper is organized as follows. We discuss ommendation system related work in Section 2. We formalize the problem in Sec- tion 3. We discuss candidate retrieval methods in ection 4 We present our context-aware relevance model for ranking in Section 5. We report our empirical evaluation in Section 6. methods may encounter difficulties to make proper citations Section 7 concludes the paper. across multiple communities because a community may not be aware of the related work in some other community a detailed natural language processing analysis of the full- 2. RELATED WORK text for each document may help to make good citation In this section, we discuss the related work on document ommendations, but unfortunately has to incur serious scala- ecommendation, topic-based link prediction, and analysis bility issues. There may be hundreds of thousands of papers of citation contexts. that need to be compared with the given manuscript. Thus the natural language processing methods cannot be straight 2.1 Document Recommendation Systems forwardly scaled up to large digital libraries and electronic There are some previous efforts on recommending a bib- document archives liography list for a manuscript, or recommending papers to The recommended citations for placeholders should satisfy reviewers. The existing techniques generally rely on a user two requirements. First, a citation recommendation needs profile or a partial list of citations. to be explainable. Our problem is different from generat. Basu et al. [4 focused on recommending conference pa- ing a bibliography list for a paper where a recommendation er submissions to reviewers based on paper abstracts and should discuss some ideas related to the query manuscript. reviewer profiles. Reviewer profiles are extracted from the In our problem, a recommended citation for a placeholder Web. This is Web. This is a specific step in a more general problem known in a query manuscript should be relevant and authoritative as the Reviewer Assignment Problem, surveyed by Wang et to the particular idea that is being discussed at that point al. 26]. Chandrasekaran et al. [6 presented a technique to in the query manuscript. Different placeholders in the same recommend technical papers to readers whose profile infor- uery manuscript may need different citations. mation is stored in CiteSeer. A user's publication records Second, the recommendations for a manuscript need to are used to model her profile. User profiles and documents ake into account the various ideas discussed in the manuscript. are presented as hierarchial concept trees with predefined For example, suppose we are asked to recommend citations concepts from the ACM Computing Classification System. for a query manuscript in which one section discusses mix- The similarity between a user profile and a document is mea- ture models and another section discusses nonparametric sured by the weighed tree edit distance. Our work can also distributions. Citations to nonparametric mixture moc be seen as a profile-based system, where a query manuscript may be ranked high since they are related to both sections is a profile. However, our system uses richer information in the same manuscript than just predefined concepts or paper abstracts. In summary, citation recommendation for place Shaparenko and Joachims 22 proposed a technique based and for the overall bibliography) is a challenging task. The on language modeling and convex optimization to recom- recommendations need to consider many factors: the query mend documents. For a large corpus, the k-most similar manuscript in whole, the contexts of the citation placehold documents based on cosine similarity are retrieved. How ers individually and collectively, and the literature articles ever, similarity based on full-text is too slow for large digital individually. We need to construct an elegant mathematical libraries. Furthermore, according to our experiments, simi- framework for relevance and develop an efficient and scalable larity based on document abstract results in poor recall( Table 1 In this paper, we present an effective solution to the prob- Some previous studies recommend citations for a manuscript m of citation recommendations for placeholders in query already containing a partial list of citations. Specifically, manuscripts. Our approach is context-aware, where a con- given a document d and its partial citation list r, those
Figure 2: A demo of our context-aware citation recommendation system. methods may encounter difficulties to make proper citations across multiple communities because a community may not be aware of the related work in some other community. A detailed natural language processing analysis of the fulltext for each document may help to make good citation recommendations, but unfortunately has to incur serious scalability issues. There may be hundreds of thousands of papers that need to be compared with the given manuscript. Thus, the natural language processing methods cannot be straightforwardly scaled up to large digital libraries and electronic document archives. The recommended citations for placeholders should satisfy two requirements. First, a citation recommendation needs to be explainable. Our problem is different from generating a bibliography list for a paper where a recommendation should discuss some ideas related to the query manuscript. In our problem, a recommended citation for a placeholder in a query manuscript should be relevant and authoritative to the particular idea that is being discussed at that point in the query manuscript. Different placeholders in the same query manuscript may need different citations. Second, the recommendations for a manuscript need to take into account the various ideas discussed in the manuscript. For example, suppose we are asked to recommend citations for a query manuscript in which one section discusses mixture models and another section discusses nonparametric distributions. Citations to nonparametric mixture models may be ranked high since they are related to both sections in the same manuscript. In summary, citation recommendation for placeholders (and for the overall bibliography) is a challenging task. The recommendations need to consider many factors: the query manuscript in whole, the contexts of the citation placeholders individually and collectively, and the literature articles individually. We need to construct an elegant mathematical framework for relevance and develop an efficient and scalable implementation. In this paper, we present an effective solution to the problem of citation recommendations for placeholders in query manuscripts. Our approach is context-aware, where a context is a snippet of the text around a citation or a placeholder. The core idea is to design a novel non-parametric probabilistic model which can measure the context-based relevance between a citation context and a document. Our approach can recommend citations for a context effectively, which is the innovative part of the paper. Moreover, it can recommend a set of citations for a paper with high quality. In addition to the theoretical contribution, we also implement a prototype system in CiteSeerX. Figure 2 shows a real example in our demo system, where a user submits an citation context with a placeholder and the title/atract. Among the top 6 results, the 1st, 4th, 5th and 6th ones are proper recommendations relevant and new to the manuscript; the 2nd one is the original citation; the 3rd one is a similar but irrelevant recommendation. We also present an extensive empirical evaluation in the CiteSeerX digital library against many baselines. Our empirical study demonstrates the effectiveness and the scalability of our approach. The rest of this paper is organized as follows. We discuss related work in Section 2. We formalize the problem in Section 3. We discuss candidate retrieval methods in Section 4. We present our context-aware relevance model for ranking in Section 5. We report our empirical evaluation in Section 6. Section 7 concludes the paper. 2. RELATED WORK In this section, we discuss the related work on document recommendation, topic-based link prediction, and analysis of citation contexts. 2.1 Document Recommendation Systems There are some previous efforts on recommending a bibliography list for a manuscript, or recommending papers to reviewers. The existing techniques generally rely on a user profile or a partial list of citations. Basu et al. [4] focused on recommending conference paper submissions to reviewers based on paper abstracts and reviewer profiles. Reviewer profiles are extracted from the Web. This is a specific step in a more general problem known as the Reviewer Assignment Problem, surveyed by Wang et al. [26]. Chandrasekaran et al. [6] presented a technique to recommend technical papers to readers whose profile information is stored in CiteSeer. A user’s publication records are used to model her profile. User profiles and documents are presented as hierarchial concept trees with predefined concepts from the ACM Computing Classification System. The similarity between a user profile and a document is measured by the weighed tree edit distance. Our work can also be seen as a profile-based system, where a query manuscript is a profile. However, our system uses richer information than just predefined concepts or paper abstracts. Shaparenko and Joachims [22] proposed a technique based on language modeling and convex optimization to recommend documents. For a large corpus, the k-most similar documents based on cosine similarity are retrieved. However, similarity based on full-text is too slow for large digital libraries. Furthermore, according to our experiments, similarity based on document abstract results in poor recall (cf. Table 1). Some previous studies recommend citations for a manuscript already containing a partial list of citations. Specifically, given a document d and its partial citation list r , those WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA 422
Www2010·Ful! Paper Aprl!26-30· Raleigh·Nc·USA studies try to recover the complete citation list denoted by automatically learn the motivation functions(e. g, compare, rDr. Collaborative filtering techniques have been widely contrast, use of the citation, etc. ) of citations by using applied. For example, MeNee et built various rat- tation context based features. The second group tries to matrices including a author-citation matrix, a paper- enhance topical similarity between citations. For example citation matrix, and a co-citation matrix. Pap Huang et al. [11] observed that citation contexts can effec o-cited often with citations in r are potential candidates. tively help to avoid "topic drifting" in clustering citations Zhou et al.[27] propagated the positive labels (i.e, the ex- into topics. Ritchie[21 extensively examined the impact of isting citations in r) in multiple graphs such as the paper- various citation context extraction methods on the perfo paper citation graph, the author-paper bipartite graph, and mance of information retrieval the rest documents for a given testing document in a seml- The previous studies clearly indicate that citation con- the paper-venue bipartite graph, and learned the labels of upervised manner. Torres et aL.[25]used a combination of paper and clearly reflect the information needs (i.e,moti- context-based and collaborative filtering algorithms to build vations) of citations. Our work moves one step ahead to a recommendation system, and reported that the hybrid al- recommend citations according to contexts gorithms performed better than individual ones. Strohman et al.(23] experimented with a citation recommendation sys- 3. PROBLEM DEFINITION AND ARCHITEC. tem where the relevance between two documents is mea sured by a linear combination of text features and citation TURE graph features. They concluded that similarity between bib In this section, we formulate the problem of context-based biographies and Katz distance [15 are the most important citation recommendation, and discuss the preprocessing step features. Tang and Zhang 24] explored recommending cita- tions for placeholders in a very limited extent. In particular, 3.1 Problem Definition a user must provide a bibliography with papers relevant to Let d be a document, and D be a document corpus. ach citation context, as this information is used to compute features in the hidden layer of a Restricted Boltzmann Ma- DEFINITIon 3. 1. In a document d, a conteat c is a bag chine before predictions can be made. We feel this require- of words. The global conteat is the title and abstract of ment negates the need for requesting recommendations. d. The local contert is the tert surrounding a citation or In our study, we do not require a partial list of citation placeholder:. If document di cites document d2, the local since creating such a list for each placeholder shifts most of contert of this citation is called an out-link conteat with the burden on the user. Thus. we can recommend citations respect to di and an in-link conteact writh respect to d both globally (i.e, for the bibliography)and also locally (i.e innovative part of this paper) A user can submit either a manuscript (i.e, a global con- text and a set of out-link local contexts)or a few sentences 2.2 Topic-based Citation Link Prediction (i.e, an out-link local context)as the query to our system Topic models are unsupervised techniques that analyze There are two types of citation recommendation tasks, which the text of a large document corpus and provide a low. happen in different application scenarios dimensional representation of documents in terms of au tomatically discovered and comprehensible "topics". Topic DEFINITION 3.2 (GLOBAL RECOMMENDATION) models have been extended to handle citations as well as query manuscript d without a bibliography, a global recof a text, and thus can be used to predict citations for bibliogra mendation is a ranked list of citations in a corpus D that hies. The aforementioned work of Tang and Zhang 24 are recommended as candidates for the bibliography of d fits in this framework. Nallapati et al. [18 introduced a Note that different citation contexts in d may express dif- called Pairwise-Link-LDA which models the presence sence of a link between every pair of documents and trent information needs. The bibliography candidates vided by a global recommendation should collectively satisfy does not scale to large digital libraries. Nallapati et the citation information needs of all out-link local contexts al. [18 also introduced a simpler model that is similar to the work of Cohn and Hofmann 7 and Erosheva et al. [9 in the query manuscript d Here citations are modeled as a sample from a probability DEFINITION 3.3(LOCAL RECOMMENDATION ) Given an distribution associated with a topic. Thus, a paper can be out-link local contert c. with respect to d, a local recom associated with topics when it is viewed as a citation. It mendation is a ranked list of citations in a corpus D that an also be associated with topics from the analysis of its are recommended as candidates for the placeholder associ- text. However, there is no mechanism to enforce consistency between the topics assigned in those two ways. els require a long training process For local recommendations, the query manuscript d is an such as Gibbs Sampling or variational inference. In addition te.tional input and it is not required to already contain a because they are typically trained using iterative technique to this, they must be retrained as new documents are added To the best of our knowledge, global recommendations 2.3 Citation Context Analysis are only tackled by document-citation graph methods (e. g. 23)and topic models(e. g,24, 18). However, the contert- The prior work on analyzing citation contexts mainly be- aware approaches have not been considered for global longs to two groups. The first group tries to understand local recommendations(except in a limited case where a the motivation functions of an existing citation. For ex- bibliography with papers relevant to each citation context is ample, Aya et al. 2 built a machine learning algorithm to required as the input)
studies try to recover the complete citation list denoted by r ⊃ r . Collaborative filtering techniques have been widely applied. For example, McNee et al. [16] built various rating matrices including a author-citation matrix, a papercitation matrix, and a co-citation matrix. Papers which are co-cited often with citations in r are potential candidates. Zhou et al. [27] propagated the positive labels (i.e., the existing citations in r ) in multiple graphs such as the paperpaper citation graph, the author-paper bipartite graph, and the paper-venue bipartite graph, and learned the labels of the rest documents for a given testing document in a semisupervised manner. Torres et al. [25] used a combination of context-based and collaborative filtering algorithms to build a recommendation system, and reported that the hybrid algorithms performed better than individual ones. Strohman et al. [23] experimented with a citation recommendation system where the relevance between two documents is measured by a linear combination of text features and citation graph features. They concluded that similarity between bibliographies and Katz distance [15] are the most important features. Tang and Zhang [24] explored recommending citations for placeholders in a very limited extent. In particular, a user must provide a bibliography with papers relevant to each citation context, as this information is used to compute features in the hidden layer of a Restricted Boltzmann Machine before predictions can be made. We feel this requirement negates the need for requesting recommendations. In our study, we do not require a partial list of citations, since creating such a list for each placeholder shifts most of the burden on the user. Thus, we can recommend citations both globally (i.e., for the bibliography) and also locally (i.e., for each placeholder, the innovative part of this paper). 2.2 Topic-based Citation Link Prediction Topic models are unsupervised techniques that analyze the text of a large document corpus and provide a lowdimensional representation of documents in terms of automatically discovered and comprehensible “topics”. Topic models have been extended to handle citations as well as text, and thus can be used to predict citations for bibliographies. The aforementioned work of Tang and Zhang [24] fits in this framework. Nallapati et al. [18] introduced a model called Pairwise-Link-LDA which models the presence or absence of a link between every pair of documents and thus does not scale to large digital libraries. Nallapati et al. [18] also introduced a simpler model that is similar to the work of Cohn and Hofmann [7] and Erosheva et al. [9]. Here citations are modeled as a sample from a probability distribution associated with a topic. Thus, a paper can be associated with topics when it is viewed as a citation. It can also be associated with topics from the analysis of its text. However, there is no mechanism to enforce consistency between the topics assigned in those two ways. In general, topic models require a long training process because they are typically trained using iterative techniques such as Gibbs Sampling or variational inference. In addition to this, they must be retrained as new documents are added. 2.3 Citation Context Analysis The prior work on analyzing citation contexts mainly belongs to two groups. The first group tries to understand the motivation functions of an existing citation. For example, Aya et al. [2] built a machine learning algorithm to automatically learn the motivation functions (e.g., compare, contrast, use of the citation, etc.) of citations by using citation context based features. The second group tries to enhance topical similarity between citations. For example, Huang et al. [11] observed that citation contexts can effectively help to avoid “topic drifting” in clustering citations into topics. Ritchie [21] extensively examined the impact of various citation context extraction methods on the performance of information retrieval. The previous studies clearly indicate that citation contexts are a good summary of the contributions of a cited paper and clearly reflect the information needs (i.e., motivations) of citations. Our work moves one step ahead to recommend citations according to contexts. 3. PROBLEM DEFINITION AND ARCHITECTURE In this section, we formulate the problem of context-based citation recommendation, and discuss the preprocessing step. 3.1 Problem Definition Let d be a document, and D be a document corpus. Definition 3.1. In a document d, a context c is a bag of words. The global context is the title and abstract of d. The local context is the text surrounding a citation or placeholder. If document d1 cites document d2, the local context of this citation is called an out-link context with respect to d1 and an in-link context with respect to d2. A user can submit either a manuscript (i.e., a global context and a set of out-link local contexts) or a few sentences (i.e., an out-link local context) as the query to our system. There are two types of citation recommendation tasks, which happen in different application scenarios. Definition 3.2 (Global Recommendation). Given a query manuscript d without a bibliography, a global recommendation is a ranked list of citations in a corpus D that are recommended as candidates for the bibliography of d. Note that different citation contexts in d may express different information needs. The bibliography candidates provided by a global recommendation should collectively satisfy the citation information needs of all out-link local contexts in the query manuscript d. Definition 3.3 (Local Recommendation). Given an out-link local context c∗ with respect to d, a local recommendation is a ranked list of citations in a corpus D that are recommended as candidates for the placeholder associated with c∗. For local recommendations, the query manuscript d is an optional input and it is not required to already contain a representative bibliography. To the best of our knowledge, global recommendations are only tackled by document-citation graph methods (e.g., [23]) and topic models (e.g., [24, 18]). However, the contextaware approaches have not been considered for global or local recommendations (except in a limited case where a bibliography with papers relevant to each citation context is required as the input). WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA 423
Www2010·Ful! Paper Aprl!26-30· Raleigh·Nc·USA authors (b)author (a) single-context-based (b) hybrid method Figure 3: Context-oblivious methods for gl rec- Figure 4: Context-aware methods. The single- context-based method is for local recommendation with the absence of query manuscript di. The h 3.2 Preprocessing brid method is for global recommendation and local Our proposed context-based citation recommendation sys- recommendation with the presence of d1, where the tem can take two types of inputs, a query manuscript d, or final candidate set is the union of the candidate sets just a single out-link local context c. We preprocess the derived from each local out-link context in d1 query manuscript di by extracting its global context (i.e he title and abstract )and all of its out-link local contexts Extracting the local contexts from di is not a trivial task 3. The papers cited by documents already in a candidate Ritchie [21 conducted extensive experiments to study the set, generated by some other method (e. g, GN or Au- impact of different lengths of local citation contexts on in- thor ). We call this method CitHop formation retrieval performance, and concluded that fixed 4. The documents written by the authors whose papers window contexts(e. g. size of 100 words) are simple and rea- re already in a candidate set generated by some other onably effective. Thus, before removing all stop words, for ethod. We call this method AuthHop each placeholder we extract the citation context by taking 50 words before and 50 words after the placeholder. This Note that retrieval based on the similarity between the entire text content of documents would be too slow Thus and 20 citations, preprocessing takes on average less than these methods appear to provide a reasonable alternative 0.1 seconds However, the body of an academic paper covers many ideas The critical steps of the system are:(1)quickly retrievin that would not fit in its abstract and so many relevant doc- a large candidate set which has good coverage over the pos- uments will not be retrieved. Retrieval by author similarity sible citations, and(2) for each placeholder associated with have a broad range of research interests and if CitHop the citations by relevance and returning the top K. The or AuthHop is used to further expand the candidate se next two sections focus on these critical step Context-aware methods can avoid such probler 4.2 Context-aware Methods 4. THE CANDIDATE SET The contert-aware methods help improve coverage for rec- Citation recommendation systems, including our syste ommendations by considering local contexts in the query and especially those that compute intricate graph-based fea- manuscript. In a query manuscript di, for each context tures[ 23, 27], first quickly retrieve a large candidate set of we can retrieve papers. Then, features are computed for papers in this set and ranking is performed. This is done solely for scalabilit 5. The top N papers whose in-link contexts are most Techniques for retrieving this candidate set are illustrated ilar to c,. We call this method LN(e. g, L100 in Figures 3 and 4. Here, the query manuscript is rel 6. The papers containing the top-N out-link contexts most sented by a partially shaded circle. Retrieved documents similar to c.(these papers cite papers retrieved b are represented as circles with numbers corresponding to the LN). When used in conjunction with LN, we call this numbered items in the lists of retrieval methods discusse method LCN (e. g, LC100) below. In this section. we discuss two kinds of methods for retrieving a candidate set. Those methods will be evaluated We found that method LCN is needed because frequently a document from a digital library may have an out-link con- text that describes how it differs from related work(e. g 4.1 Context-oblivious Methods prior work required full bibliographies but we do not") The contert-oblivious methods do not consider local con- Thus, while an out-link context usually describes a cited texts in a query manuscript that has no bibliography. For a paper, sometimes it may also describe the citing paper, and query manuscript di, we can retrieve this description may be what best matches an out-link con- text c. from a query manuscript 1. The top N documents with abstract and title m The above 6 methods can be combined in some obvious similar to di. We call this method GN (e.g, G100, ways. For example,(L100+CitHop)+G1000 is the can- G1000 didate set formed by the following process: for each context c, in a query manuscript dl, add the top 100 documents 2. The documents that share authors with di. We call with in-link local context most similar to c. (i.e. L100)and this method Author all of their citations (i.e. CitHop)and then add the top
Figure 3: Context-oblivious methods for global recommendation. 3.2 Preprocessing Our proposed context-based citation recommendation system can take two types of inputs, a query manuscript d1 or just a single out-link local context c∗. We preprocess the query manuscript d1 by extracting its global context (i.e. the title and abstract) and all of its out-link local contexts. Extracting the local contexts from d1 is not a trivial task. Ritchie [21] conducted extensive experiments to study the impact of different lengths of local citation contexts on information retrieval performance, and concluded that fixed window contexts (e.g. size of 100 words) are simple and reasonably effective. Thus, before removing all stop words, for each placeholder we extract the citation context by taking 50 words before and 50 words after the placeholder. This preprocessing is efficient. For a PDF document of 10 pages and 20 citations, preprocessing takes on average less than 0.1 seconds. The critical steps of the system are: (1) quickly retrieving a large candidate set which has good coverage over the possible citations, and (2) for each placeholder associated with an out-link local context and for the bibliography, ranking the citations by relevance and returning the top K. The next two sections focus on these critical steps. 4. THE CANDIDATE SET Citation recommendation systems, including our system and especially those that compute intricate graph-based features [23, 27], first quickly retrieve a large candidate set of papers. Then, features are computed for papers in this set and ranking is performed. This is done solely for scalability. Techniques for retrieving this candidate set are illustrated in Figures 3 and 4. Here, the query manuscript is represented by a partially shaded circle. Retrieved documents are represented as circles with numbers corresponding to the numbered items in the lists of retrieval methods discussed below. In this section, we discuss two kinds of methods for retrieving a candidate set. Those methods will be evaluated in Section 6.2. 4.1 Context-oblivious Methods The context-oblivious methods do not consider local contexts in a query manuscript that has no bibliography. For a query manuscript d1, we can retrieve 1. The top N documents with abstract and title most similar to d1. We call this method GN (e.g., G100, G1000). 2. The documents that share authors with d1. We call this method Author. Figure 4: Context-aware methods. The singlecontext-based method is for local recommendation with the absence of query manuscript d1. The hybrid method is for global recommendation and local recommendation with the presence of d1, where the final candidate set is the union of the candidate sets derived from each local out-link context in d1. 3. The papers cited by documents already in a candidate set, generated by some other method (e.g., GN or Author). We call this method CitHop. 4. The documents written by the authors whose papers are already in a candidate set generated by some other method. We call this method AuthHop. Note that retrieval based on the similarity between the entire text content of documents would be too slow. Thus, these methods appear to provide a reasonable alternative. However, the body of an academic paper covers many ideas that would not fit in its abstract and so many relevant documents will not be retrieved. Retrieval by author similarity will add many irrelevant papers, especially if the authors have a broad range of research interests and if CitHop or AuthHop is used to further expand the candidate set. Context-aware methods can avoid such problems. 4.2 Context-aware Methods The context-aware methods help improve coverage for recommendations by considering local contexts in the query manuscript. In a query manuscript d1, for each context c∗, we can retrieve: 5. The top N papers whose in-link contexts are most similar to c∗. We call this method LN (e.g., L100). 6. The papers containing the top-N out-link contexts most similar to c∗ (these papers cite papers retrieved by LN). When used in conjunction with LN, we call this method LCN (e.g., LC100). We found that method LCN is needed because frequently a document from a digital library may have an out-link context that describes how it differs from related work (e.g., “prior work required full bibliographies but we do not”). Thus, while an out-link context usually describes a cited paper, sometimes it may also describe the citing paper, and this description may be what best matches an out-link context c∗ from a query manuscript. The above 6 methods can be combined in some obvious ways. For example, (L100+CitHop)+G1000 is the candidate set formed by the following process: for each context c∗ in a query manuscript d1, add the top 100 documents with in-link local context most similar to c∗ (i.e. L100) and all of their citations (i.e. CitHop) and then add the top WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA 424
Www2010·Fu! Paper Aprl!26-30· Raleigh·Nc·UsA 1000 documents with abstract and title most similar to d, d is then pa(c)=Trace(Tacc)=c Tac. Note that similar (ie.G1000) atomic concepts(as measured by the dot product)will have similar relevance scores 5. MODELING CONTEXTBASED CITATION Now, the probability distribution characterized by Glea son's theorem is not a generative distribution-it cannot be RELEVANCE used to sample concepts. Instead, it is a distribution over In this section, we propose a non-parametric probabilis. yes/no answers(e.g. is concept c relevant or not? ).Thus tic model to measure context-based(and overall) relevance to estimate Td document d we will need some (un- between a manuscript and a candidate citation, for rank known) generative distribution Pgen over unit vectors(cor ing retrieved candidates. To improve scalability, we use an cepts). Our evidence about the properties of Td come fror pproximate inference technique which yields a closed form the following process solution. Our model is general and simple so that it can be cepts c1,..., Ck and these concepts are then independently used to efficiently and effectively measure the similarity be- judged to be relevant to d. This happens with probability tween any two documents with respect to certain contexts 1 Pgen(ci)pa(ci). We seek to find a density matrix Ta r concepts in information retrieval that maximizes the likelihood. The log likelihood is 5.1 Context-Based relevance model Recall that a query manuscript di can have a global con- ∑ogp()+∑log(re) text cI(e. g, a title and abstract, which describes the prob- lem to be addressed) and the out-link local contexts c2 Now, if there is only one concept (i.e. k= 1), then the maximum likelihood estimator is easy to compute: it is Td= e. g, text surrounding citations)which compactly express deas that may be present in related work. a document d2 cici. In the general case, however, numerical methods are in an existing corpus D has a global context b1(e.g needed 3. The computational cost of estimating Td can be tle and abstract), and local in-link contexts b proposition: ext that is used by papers citing d)which compactly ex PROPOSITION 5. 2. The density matrit Td can be repre- press ideas covered in d2 sentedas 2is,t? where the t, are a set of at most r or. In this section, we describe a principled and consistent way thogonal column vectors with 2(ti- ti)= 1, and r is the of measuring sim(di, d2), defined as the overall relevance of dimension of the space spanned by the ci(the number of lin- d2 to di and sim(d1, d2; c), defined as the relevance of da early independent conterts). There are O(r )parameters to determine and numerical (iterative) techniques will scale could be any of the out-link contexts c ). Our techniques a polynomial inr2 are based on Gleasons Theorem specialized to finite dimen The detailed proof is in Appendix A Furthermore, the addition of new documents to the corpus ill cause the addition of new in-link contexts, requiring a THEOREM 5.1(GLEASON 10D. For an n-dir recomputation of all the Td. Thus the likelihood approach real vector space V (with n> 3), let p be a function that will not scale for our system signs a number in 0, 1 to each subspace of V such that p(u1 6 u2)=p(un)+p(u2) whenever v and v2 are orth 5.2 Scalable closed form solutions mal subspaces and p(v)= l. Then p(u)= Trace(TP) Since hundreds of thousands of density matrices need to where P. is the projection matri for subspace v(e.g. P be estimated (one for each document)and recomputed is the projection of vector w onto the subspace v), and T new documents are added to the corpus, we opt to replace a density matriz -a symmetric positive semidefinite matrix the exact but slow maximum likelihood computation with an whose trace is 1 approximate but fast closed form solution which we derive Note that van Rijsbergen 20 proposed using Gleason's The- in this section orem as a model for infornation retrieval. and this was als We begin with the following observation. For each con- extensively studied by Melucci [17. However, our frame- cept ci, the maximum likelihood estimate of Td given that is substantially different from their proposals since it concept ci is relevant is cici. It stands to reason that our overall estimate of Td should be similar to each of the Cic on comparisons between density matrices Let w be the set of all words in our corpus and let WI be We will measure similarity by the Frobenius norm(square- the number of words. The vector space V is W'l-dimensiona root of the sum of the squared matrix entries)and thus set with one dimension for each word up the following optimization problem: In this framework, atomic concepts will be represented as one-dimensional vector spaces and we will treat each context minimize(ra)=∑‖a-cct ink )as an atomic concept. Each atomic subject to the constraint that Td is a density ma shall also denote by c (one such representation can be de- derivatives, we get rived from normalizing tf-idf scores into a unit vector). The projection matrix for c is then cc. Our goal is to measure the probability that c is relevant to a document d and so 7=2∑(a-cc)=0 i=1 oy Gleasons Theorem, each document d is associated with a density matrix Td. The probability that c is relevant to leading to the solution Ta=fis cic. Now that we have a closed form estimate for Td, we can discuss how to measure Iu e u2 is the linear span of v1 and v2 the relevance between documents with respect to a concept
1000 documents with abstract and title most similar to d1 (i.e. G1000). 5. MODELING CONTEXT-BASED CITATION RELEVANCE In this section, we propose a non-parametric probabilistic model to measure context-based (and overall) relevance between a manuscript and a candidate citation, for ranking retrieved candidates. To improve scalability, we use an approximate inference technique which yields a closed form solution. Our model is general and simple so that it can be used to efficiently and effectively measure the similarity between any two documents with respect to certain contexts or concepts in information retrieval. 5.1 Context-Based Relevance Model Recall that a query manuscript d1 can have a global context c1 (e.g., a title and abstract, which describes the problem to be addressed) and the out-link local contexts c2,...,ck1 (e.g., text surrounding citations) which compactly express ideas that may be present in related work. A document d2 in an existing corpus D has a global context b1 (e.g. title and abstract), and local in-link contexts b1,...,bk2 (e.g., text that is used by papers citing d2) which compactly express ideas covered in d2. In this section, we describe a principled and consistent way of measuring sim(d1, d2), defined as the overall relevance of d2 to d1 and sim(d1, d2; c∗), defined as the relevance of d2 to d1 with respect to a specific context c∗ (in particular, c∗ could be any of the out-link contexts ci). Our techniques are based on Gleason’s Theorem specialized to finite dimensional real vector spaces. Theorem 5.1 (Gleason [10]). For an n-dimensional real vector space V (with n ≥ 3), let p be a function that assigns a number in [0, 1] to each subspace of V such that p(v1 ⊕ v2) = p(v1) + p(v2) whenever v1 and v2 are orthogonal subspaces 1 and p(V )=1. Then p(v) = T race(T Pv) where Pv is the projection matrix for subspace v (e.g. Pvw is the projection of vector w onto the subspace v), and T is a density matrix – a symmetric positive semidefinite matrix whose trace is 1. Note that van Rijsbergen [20] proposed using Gleason’s Theorem as a model for information retrieval, and this was also extensively studied by Melucci [17]. However, our framework is substantially different from their proposals since it relies on comparisons between density matrices. Let W be the set of all words in our corpus and let |W| be the number of words. The vector space V is |W|-dimensional with one dimension for each word. In this framework, atomic concepts will be represented as one-dimensional vector spaces and we will treat each context (global, in-link, out-link) as an atomic concept. Each atomic concept c will be associated a unit column vector which we shall also denote by c (one such representation can be derived from normalizing tf-idf scores into a unit vector). The projection matrix for c is then ccT . Our goal is to measure the probability that c is relevant to a document d and so, by Gleason’s Theorem, each document d is associated with a density matrix Td. The probability that c is relevant to 1v1 ⊕ v2 is the linear span of v1 and v2. d is then pd(c) = T race(TdccT ) = cT Tdc. Note that similar atomic concepts (as measured by the dot product) will have similar relevance scores. Now, the probability distribution characterized by Gleason’s theorem is not a generative distribution – it cannot be used to sample concepts. Instead, it is a distribution over yes/no answers (e.g. is concept c relevant or not?). Thus to estimate Td for a document d we will need some (unknown) generative distribution pgen over unit vectors (concepts). Our evidence about the properties of Td come from the following process. pgen independently generates k concepts c1,...,ck and these concepts are then independently judged to be relevant to d. This happens with probability k i=1 pgen(ci)pd(ci). We seek to find a density matrix Td that maximizes the likelihood. The log likelihood is: k i=1 log pgen(ci) +k i=1 log c T i Tdci . Now, if there is only one concept (i.e. k = 1), then the maximum likelihood estimator is easy to compute: it is Td = c1ct 1. In the general case, however, numerical methods are needed [3]. The computational cost of estimating Td can be seen from the following proposition: Proposition 5.2. The density matrix Td can be represented as r i=1 tit T i where the ti are a set of at most r orthogonal column vectors with (ti · ti)=1, and r is the dimension of the space spanned by the ci (the number of linearly independent contexts). There are O(r2) parameters to determine and numerical (iterative) techniques will scale as a polynomial in r2. The detailed proof is in Appendix A. Furthermore, the addition of new documents to the corpus will cause the addition of new in-link contexts, requiring a recomputation of all the Td. Thus the likelihood approach will not scale for our system. 5.2 Scalable Closed Form Solutions Since hundreds of thousands of density matrices need to be estimated (one for each document) and recomputed as new documents are added to the corpus, we opt to replace the exact but slow maximum likelihood computation with an approximate but fast closed form solution which we derive in this section. We begin with the following observation. For each concept ci, the maximum likelihood estimate of Td given that concept ci is relevant is cicT i . It stands to reason that our overall estimate of Td should be similar to each of the cicT i . We will measure similarity by the Frobenius norm (squareroot of the sum of the squared matrix entries) and thus set up the following optimization problem: minimize L(Td) = k i=1 ||Td − cic T i ||2 F , subject to the constraint that Td is a density matrix. Taking derivatives, we get ∂L ∂Td = 2k i=1 (Td − cic T i )=0, leading to the solution Td = 1 k k i=1 cicT i . Now that we have a closed form estimate for Td, we can discuss how to measure the relevance between documents with respect to a concept. WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA 425