A Discriminative approach to Topic-based Citation Recommendation Jie Tang and Jing zhang Tsinghua University, Beijing, 100084. China jietangetsinghua. edu. cn, zhangjingekeg Cs tsinghua. edu.cn Abstract. In this paper, we present a study of a novel problem, i.e. topic-based citation recommendation, which involves recommending papers to be referred to. Traditionally, this problem is usually treated as an engineering issue and dealt with using heuristics. This paper gives a formalization of topic-based citation recommendation and proposes a discriminative approach to this problem. Specif ally, it proposes a two-layer Restricted Boltzmann Machine model, called RBM- CS, which can discover topic distributions of paper content and citation relation- ship simultaneously. Experimental results demonstrate that RBM-CS can signifi cantly outperform baseline methods for citation recommendation 1 Introduction Citation recommendation is concerned with recommending papers that should be re- ferred to. When starting a work in a new research topic or brainstorming for novel ideas, a researcher usually wants to have a quick understanding of the exiting literatures in this field, including which papers are the most relevant papers and what sub-topics are presented in these papers. Two common ways to find reference papers are: (1)search documents on search engines such as Google and (2)trace the cited references by start- ing with a small number of initial papers(seed-papers). Unfortunately, for(1)it would be difficult to find a comprehensive keyword list to cover all papers, especially for be ginning researchers. It is very possible to miss important developments in areas outside a researcher's specialty. For(2), an average paper may cite more than twenty papers. It would be quite time consuming to analyze each of the cited reference to see whether it is useful or not, especially with the increase of the tracing depth. Additionally, even a well organized paper may miss some important"related work", due to space limitation or other reasons Previously, papers recommendation has been studied, for example, by exploring col- laborative filtering [7]. Our problem is relevant, but different, from this kind of work. Firstly, in citation recommendation, the user is interested in not only a list of recom- mended papers, but also the sub-topics presented in these papers. Secondly, conven- tional methods can only recommend papers; but cannot suggest the citation position (i.e. which sentences should refer to the citation). The work is supported by the National Natural Science Foundation of China(60703059, Chi nese National Key Foundation Research and Development Plan(2007CB3 10803), and Chinese Young Faculty Research Funding(20070003093)
A Discriminative Approach to Topic-based Citation Recommendation ? Jie Tang and Jing Zhang Department of Computer Science and Technology, Tsinghua University, Beijing, 100084. China jietang@tsinghua.edu.cn,zhangjing@keg.cs.tsinghua.edu.cn Abstract. In this paper, we present a study of a novel problem, i.e. topic-based citation recommendation, which involves recommending papers to be referred to. Traditionally, this problem is usually treated as an engineering issue and dealt with using heuristics. This paper gives a formalization of topic-based citation recommendation and proposes a discriminative approach to this problem. Specifi- cally, it proposes a two-layer Restricted Boltzmann Machine model, called RBMCS, which can discover topic distributions of paper content and citation relationship simultaneously. Experimental results demonstrate that RBM-CS can signifi- cantly outperform baseline methods for citation recommendation. 1 Introduction Citation recommendation is concerned with recommending papers that should be referred to. When starting a work in a new research topic or brainstorming for novel ideas, a researcher usually wants to have a quick understanding of the exiting literatures in this field, including which papers are the most relevant papers and what sub-topics are presented in these papers. Two common ways to find reference papers are: (1) search documents on search engines such as Google and (2) trace the cited references by starting with a small number of initial papers (seed-papers). Unfortunately, for (1) it would be difficult to find a comprehensive keyword list to cover all papers, especially for beginning researchers. It is very possible to miss important developments in areas outside a researcher’s specialty. For (2), an average paper may cite more than twenty papers. It would be quite time consuming to analyze each of the cited reference to see whether it is useful or not, especially with the increase of the tracing depth. Additionally, even a well organized paper may miss some important “related work”, due to space limitation or other reasons. Previously, papers recommendation has been studied, for example, by exploring collaborative filtering [7]. Our problem is relevant, but different, from this kind of work. Firstly, in citation recommendation, the user is interested in not only a list of recommended papers, but also the sub-topics presented in these papers. Secondly, conventional methods can only recommend papers; but cannot suggest the citation position (i.e., which sentences should refer to the citation). ? The work is supported by the National Natural Science Foundation of China (60703059), Chinese National Key Foundation Research and Development Plan (2007CB310803), and Chinese Young Faculty Research Funding (20070003093)
Jie Tang and Jing Zhang →P a retrieval Fig 1: Example of citation recommendation. o. In this paper, we formalize citation recommendation as that of topic discovery, pic-based recommendation, and matching citation sentences with the recommend papers. We propose a unified and discriminative approach to citation recommendation This approach can automatically discover topical aspects of each paper and recommend papers based on the discovered topic distribution. Experimental results show that the proposed approach significantly outperforms the baseline methods 2 Problem formulation We define notations used throughout this paper. Assuming that a paper d contains a vector wd of Nd words, in which each word wdi is chosen from a vocabulary of size V; and a list ld of Ld references. Then a collection of D papers can be represented as D=I(w1, 11), . ,(wD, D). We only consider references that appear in the paper ollection D. Thus the size L of the vocabulary of references is D. Further, we consider that each paper is associated with a distribution of T' topics, so is the citation. Definition 1.(Citation Context and Citation Sentence)Citation context is defined by the context words occurring in, for instance, the user written proposal. For an example, the words". We use Cosine computation (x)evaluate the similarity. " would be citation context. One reference paper is expected to be cited at the position"Ix".We use c to denote a citation context. each sentence in the citation context is called citation sentence. The position"Ix) "to cite the reference paper is called citation position Figure 1 shows an example of citation recommendation. The left part of Figure 1 includes a citation context provided by the user and a paper collection. The right part shows the recommended result that we expect a citation recommendation algorithm outputs. For instance, two topics, i.e., "text summarization"and"information retrieval have been extracted from the citation context. For the first topic"text summarization two papers have been recommended and for the second topic"information retrieval three papers have been recommended. Further, the recommended papers are matched with the citation sentences and the corresponding citation positions have been identified
2 Jie Tang and Jing Zhang We are considering the extraction-based text summarization [2] [3]. … As for the models, we can adopt many existing probabilistic retrieval models such as the classic probabilistic retrieval models [4] and the Kullback-Leibler (KL) divergence retrieval model [1] [5]. Suggested references Citation context Paper collection We are considering the extraction-based text summarization. … As for the models, we can adopt many existing probabilistic retrieval models such as the classic probabilistic retrieval models and the Kullback-Leibler (KL) divergence retrieval model. Citation recommendation results Document language models, .. Topic1: text summarization Topic 2: information retrieval Lafferty, J. and Zhai, C. Document language models, query models, and risk minimization for information retrieval. In SIGIR'01. 111-119. [1] Luhn, H. P. 1958. The automatic creation of literature abstracts. IBM Journal of Research and Development 2, 2, 159-165. [2] McKeown, K. and Radev, D. R. 1995. Generating summaries of multiple news articles. In SIGIR’95, 74-82. [3] Robertson, S. E. 1977. The probability ranking principle in IR. Journal of Documentation 33, 4, 294-304. [4] Zhai, C. and Lafferty, J. A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR’01, 334-342. [5] Discovered topics Matching references with citation sentences Fig. 1: Example of citation recommendation. In this paper, we formalize citation recommendation as that of topic discovery, topic-based recommendation, and matching citation sentences with the recommended papers. We propose a unified and discriminative approach to citation recommendation. This approach can automatically discover topical aspects of each paper and recommend papers based on the discovered topic distribution. Experimental results show that the proposed approach significantly outperforms the baseline methods. 2 Problem Formulation We define notations used throughout this paper. Assuming that a paper d contains a vector wd of Nd words, in which each word wdi is chosen from a vocabulary of size V ; and a list ld of Ld references. Then a collection of D papers can be represented as D = {(w1, l1), · · · ,(wD, lD)}. We only consider references that appear in the paper collection D. Thus the size L of the vocabulary of references is D. Further, we consider that each paper is associated with a distribution of T topics, so is the citation. Definition 1. (Citation Context and Citation Sentence) Citation context is defined by the context words occurring in, for instance, the user written proposal. For an example, the words “... We use Cosine computation [x] evaluate the similarity ...” would be a citation context. One reference paper is expected to be cited at the position “[x]”. We use c to denote a citation context. Each sentence in the citation context is called citation sentence. The position “[x]” to cite the reference paper is called citation position. Figure 1 shows an example of citation recommendation. The left part of Figure 1 includes a citation context provided by the user and a paper collection. The right part shows the recommended result that we expect a citation recommendation algorithm outputs. For instance, two topics, i.e., “text summarization” and “information retrieval”, have been extracted from the citation context. For the first topic “text summarization”, two papers have been recommended and for the second topic “information retrieval”, three papers have been recommended. Further, the recommended papers are matched with the citation sentences and the corresponding citation positions have been identified
A Discriminative Approach to Topic-based Citation Recommendation We see that the recommended papers are topic dependent. By nature, the problem of citation recommendation can be formalized as topic discovery, reference papers rec ommendation, and matching of the recommended papers with the citation sentences. 3 Our Approach At a high level, our approach primarily consists of three steps 1. We propose a two-layer Restricted Boltzmann Machine(RBM) model, referred to as RBM-CS. Given a collection of papers with citation relationship, the model learns a mixture of topic distribution over paper contents and citation relationships 2. We present a method to rank papers for a given citation context, based on the learned topic model. We take the top ranked papers as the recommended papers 3. We describe a method to find the correspondence between the recommended papers and the citation sentences 3.1 The rBM-CS model Restricted Boltzmann Machines(RBMs)[8] are undirected graphical models that use a layer of hidden variables to model a( topic)distribution over visible variables. In th work, we propose a two-layer RBM model, called RBM-CS, to jointly model papers and itations Graphical representation of the RBM-CS model is shown in Figure 2. We see that in RBM-CS, the hidden layer h is associated with two visible layers: words w and citation relationships l, respectively coupling with an interaction matrix M and U. The basic idea in RBM-CS is to capture the topic distribution of papers with a hidden topic layer, which is conditioned on both words and citation relationships. Words and citation elationship are considered to be generated from the hidden topics independently Fig. 2: Graphical representation of the RBM-CS model To train a graphical model, we can consider maximization of the generative log likelihood log p(w, I). However, we are dealing with a predictive problem, our interests ltimately only lie in correct prediction p( w), not necessarily to have a good p(w) Therefore, we define a discriminative objective function by a conditional log-likelihood L=∑gpuw)=∑(nw)
A Discriminative Approach to Topic-based Citation Recommendation 3 We see that the recommended papers are topic dependent. By nature, the problem of citation recommendation can be formalized as topic discovery, reference papers recommendation, and matching of the recommended papers with the citation sentences. 3 Our Approach At a high level, our approach primarily consists of three steps: 1. We propose a two-layer Restricted Boltzmann Machine (RBM) model, referred to as RBM-CS. Given a collection of papers with citation relationship, the model learns a mixture of topic distribution over paper contents and citation relationships. 2. We present a method to rank papers for a given citation context, based on the learned topic model. We take the top ranked papers as the recommended papers. 3. We describe a method to find the correspondence between the recommended papers and the citation sentences. 3.1 The RBM-CS Model Restricted Boltzmann Machines (RBMs) [8] are undirected graphical models that use a layer of hidden variables to model a (topic) distribution over visible variables. In this work, we propose a two-layer RBM model, called RBM-CS, to jointly model papers and citations. Graphical representation of the RBM-CS model is shown in Figure 2. We see that in RBM-CS, the hidden layer h is associated with two visible layers: words w and citation relationships l, respectively coupling with an interaction matrix M and U. The basic idea in RBM-CS is to capture the topic distribution of papers with a hidden topic layer, which is conditioned on both words and citation relationships. Words and citation relationship are considered to be generated from the hidden topics independently. l ... l l Fig. 2: Graphical representation of the RBM-CS model. To train a graphical model, we can consider maximization of the generative loglikelihood log p(w, l). However, we are dealing with a predictive problem, our interests ultimately only lie in correct prediction p(l|w), not necessarily to have a good p(w). Therefore, we define a discriminative objective function by a conditional log-likelihood: L = XD d log p(ld|wd) = XD d log ÃYL j=1 p(lj |wd) ! (1)
Jie Tang and Jing Zhang The probability p(l,lwd)can be defined as p)=1)+可))=0(2Mm)+()+ where o( )is a sigmoid function, defined as o(r)=1/(1+erp(-x)); e are bias terms for citation relationships: f(hk)is the feature function for hidden variable hk; f(i)and f(wi)are feature functions for citation relationship l, and word w;, respectively; a are bias terms for hidden variables. For simplicity, we define f(w'i) as the count of word in document d. We define binary value for the feature function of citation relationship L. For example, for document d, f(li)=l denotes that the document d has a citation relationship with another paper d Now, the task is to learn the model parameters 0=(m, U, a, b, e)given a training set D. Maximum-likelihood (ML) learning of the parameters can be done by gradient ascent with respect to the model parameters(b are bias terms for words). The exact gradient, for any parameter b Ee can be written as follows: alog p(Iw)=En咧-EP啊 where Epol denotes an expectation with respect to the data distribution and EPx is an expectation with respect to the distribution defined by the model. Computation of the expectation EPx is intractable. In practice, we use a stochastic approximation of this gradient, called the contrastive divergence gradient [4]. The algorithm cycles througl the training data and updates the model parameters according to Algorithm l, where the probabilities p(hk w, 1), p(w h)and p(l, h)are defined as p(hkw,D=a(∑Mk∫(m)+∑Uf()+ak) p(ml)=a(∑M1k∫(hk)+b) p()=a∑Uk∫(hk)+e) where b are bias terms for words; f(Li)is the feature function for citation relationship repeat a) for each document by 2. until all model parameters e converge
4 Jie Tang and Jing Zhang The probability p(lj |wd) can be defined as: p(lj |w) = σ ÃXT k=1 Ujkf(hk) + ej ! , f(hk) = σ XV i=1 Mij f(wi) +X j Ukj f(lj ) + ak (2) where σ(.) is a sigmoid function, defined as σ(x) = 1/(1 + exp(−x)); e are bias terms for citation relationships; f(hk) is the feature function for hidden variable hk; f(lj ) and f(wi) are feature functions for citation relationship lj and word wi , respectively; a are bias terms for hidden variables. For simplicity, we define f(wi) as the count of word wi in document d. We define binary value for the feature function of citation relationship l. For example, for document d, f(lj ) = 1 denotes that the document d has a citation relationship with another paper dj . Now, the task is to learn the model parameters Θ = (M, U, a, b, e) given a training set D. Maximum-likelihood (ML) learning of the parameters can be done by gradient ascent with respect to the model parameters (b are bias terms for words). The exact gradient, for any parameter θ ∈ Θ can be written as follows: ∂log p(l|w) ∂θ = EP0 [l|w] − EPM [l|w] (3) where EP0 [.] denotes an expectation with respect to the data distribution and EPM is an expectation with respect to the distribution defined by the model. Computation of the expectation EPM is intractable. In practice, we use a stochastic approximation of this gradient, called the contrastive divergence gradient [4]. The algorithm cycles through the training data and updates the model parameters according to Algorithm 1, where the probabilities p(hk|w, l), p(wi |h) and p(lj |h) are defined as: p(hk|w, l) = σ( XV i=1 Mikf(wi) +XL j=1 Ujkf(lj ) + ak) (4) p(wi|h) = σ( XT k=1 Mikf(hk) + bi) (5) p(lj |h) = σ( XT k=1 Ujkf(hk) + ej ) (6) where b are bias terms for words; f(lj ) is the feature function for citation relationship. Algorithm 1. Parameter learning via contrastive divergence Input: training data D = {(wd, ld)}, topic number T, and learning rate λ 1. repeat (a) for each document d: i. sampling each topic hk according to (4); ii. sampling each word wi according to (5); iii. sampling each citation relationship lj according to (6); (b) end for (c) update each model parameter θ ∈ Θ by θ = θ + λ( ∂logp(l|w) ∂θ ) 2. until all model parameters Θ converge
A Discriminative Approach to Topic-based Citation Recommendation 3.2 Ranking and recommendation The objective of citation recommendation is to rank the recommended papers for a given citation context. Specifically, we apply the same modeling procedure to the cita- tion context. Hence, we can obtain a topic representation he) of the citation context c. Based on the topic representation and the modeling results, we can calculate the prob- ability of each paper being the reference paper for the citation context according to Equation(6). Finally, the papers are ranked in terms of the probabilities and the top K ranked papers are returned as the recommended papers. It is hard to specify an accurate value of K for each citation context. A simple way is to set it as the average number of citations in a paper (i.e, ll in our data set). 3.3 Matching Recommended Papers with Citation Sentences The purpose of matching the recommended papers with citation sentences is to align the recommended papers with sentences in the citation context. This can be done by using each recommended paper as a keyword query to retrieve relevant citation sentences. In general, we may use any retrieval method. In this paper, we used KL-divergence to measure the relevance between the recommended paper and the citation sentence KL(d,sa)=∑p(hkd)lg where d is a recommended paper and sci is the ith sentence in the citation context c; the probabilities p(hkd)and p(hk sci), which can be obtained by (4) 4 Experiments 4.1 Experimental Setting Data Set We conducted experiments on two data sets, NPS and Citeseer The NiPS data set consists of 12 volumes of NIPS papers(1, 605 papers and 10, 472 citation relationships). Each paper contains full text and its citations. We removed some citations with incomplete information, e. g, consisting of only authors and publication venue, but no title. We also removed citations that do not appear in the data set. The Citeseer data set consists of 3, 335 papers(with 32, 558 citation relationships)downloaded from the Citeseer web site. As well, we removed citations that do not appear in the data set. t,. Each paper was preprocessed by(a) removing stopwords and numbers; (b)remov- ng words appearing less than three times in the corpus; and (c) downcasing the obtained words. Finally, we obtained V= 26, 723 unique words and a total of 350, 361 words in NIPS and V=44, 548 unique words and 634, 875 words in Citeseer http://citeseer psu. edu/oai. html
A Discriminative Approach to Topic-based Citation Recommendation 5 3.2 Ranking and recommendation The objective of citation recommendation is to rank the recommended papers for a given citation context. Specifically, we apply the same modeling procedure to the citation context. Hence, we can obtain a topic representation {hc} of the citation context c. Based on the topic representation and the modeling results, we can calculate the probability of each paper being the reference paper for the citation context according to Equation (6). Finally, the papers are ranked in terms of the probabilities and the top K ranked papers are returned as the recommended papers. It is hard to specify an accurate value of K for each citation context. A simple way is to set it as the average number of citations in a paper (i.e., 11 in our data set). 3.3 Matching Recommended Papers with Citation Sentences The purpose of matching the recommended papers with citation sentences is to align the recommended papers with sentences in the citation context. This can be done by using each recommended paper as a keyword query to retrieve relevant citation sentences. In general, we may use any retrieval method. In this paper, we used KL-divergence to measure the relevance between the recommended paper and the citation sentence: KL(d, sci) = XT k=1 p(hk|d)log p(hk|d) p(hk|sci) (7) where d is a recommended paper and sci is the ith sentence in the citation context c; the probabilities p(hk|d) and p(hk|sci), which can be obtained by (4). 4 Experiments 4.1 Experimental Setting Data Set We conducted experiments on two data sets, NIPS 1 and Citeseer 2 . The NIPS data set consists of 12 volumes of NIPS papers (1,605 papers and 10,472 citation relationships). Each paper contains full text and its citations. We removed some citations with incomplete information, e.g., consisting of only authors and publication venue, but no title. We also removed citations that do not appear in the data set. The Citeseer data set consists of 3, 335 papers (with 32,558 citation relationships) downloaded from the Citeseer web site. As well, we removed citations that do not appear in the data set. Each paper was preprocessed by (a) removing stopwords and numbers; (b) removing words appearing less than three times in the corpus; and (c) downcasing the obtained words. Finally, we obtained V = 26, 723 unique words and a total of 350, 361 words in NIPS and V = 44, 548 unique words and 634, 875 words in Citeseer. 1 http://www.cs.toronto.edu/˜roweis/data.html 2 http://citeseer.ist.psu.edu/oai.html