IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING.VOL.27.NO.5.MAY 2015 1343 Relational Collaborative Topic Regression for Recommender Systems Hao Wang and Wu-Jun Li,Member,IEEE Abstract-Due to its successful application in recommender systems,collaborative filtering(CF)has become a hot research topic in data mining and information retrieval.In traditional CF methods,only the feedback matrix,which contains either explicit feedback (also called ratings)or implicit feedback on the items given by users,is used for training and prediction.Typically,the feedback matrix is sparse,which means that most users interact with few items.Due to this sparsity problem,traditional CF with only feedback information will suffer from unsatisfactory performance.Recently,many researchers have proposed to utilize auxiliary information,such as item content(attributes),to alleviate the data sparsity problem in CF.Collaborative topic regression(CTR)is one of these methods which has achieved promising performance by successfully integrating both feedback information and item content information.In many real applications,besides the feedback and item content information,there may exist relations (also known as networks)among the items which can be helpful for recommendation.In this paper,we develop a novel hierarchical Bayesian model called Relational Collaborative Topic Regression(RCTR),which extends CTR by seamlessly integrating the user-item feedback information,item content information,and network structure among items into the same model.Experiments on real-world datasets show that our model can achieve better prediction accuracy than the state-of-the-art methods with lower empirical training time.Moreover,RCTR can learn good interpretable latent structures which are useful for recommendation. Index Terms-Collaborative filtering,topic models,recommender system,social network,relational learning INTRODUCTION ECOMMENDER systems (RS)play an important role to users,is used for training and prediction.Typically,the Lenable us to make effective use of information.For feedback matrix is sparse,which means that most items example,Amazon [28]adopts RS for product recommenda-are given feedback by few users or most users only give tions,and Netflix [8]uses RS for movie recommendations. feedback to few items.Due to this sparsity problem,tradi- Existing RS methods can be roughly categorized into three tional CF with only feedback information will suffer from classes [1],[4],[10],[41]:content-based methods,collabora- unsatisfactory performance.More specifically,it is difficult tive filtering (CF)based methods,and hybrid methods. for CF methods to achieve good performance in both Content-based methods [5],[26]adopt the profile of the item-oriented setting and user-oriented setting when the users or products for recommendation.CF based methods feedback matrix is sparse.In an item-oriented setting where [7],[11],[13],[18],[27],[301,[331,[351,[37],[39],[42]use we need to recommend users to items,it is generally diffi- past activities or preferences,such as the ratings on items cult to know which users could like an item if it has only given by users,for prediction,without using any user or been given feedback by one or two users.This adds to the product profiles.Hybrid methods [2],[31,[61,[121,[151,[161,difficulty companies face when promoting new products [17],[201,[321,[34],[40],[431,[47],[48]combine both con-(items).Moreover,users'ignorance of new items will result tent-based methods and CF based methods by ensemble in less feedback on the new items,which will further harm techniques.Due to privacy issues,it is harder in general to the accuracy of their recommendations.For the user- collect user profiles than past activities.Hence,CF based oriented setting where we recommend items to users,it is methods have become more popular than content-based also difficult to predict what a user likes if the user has only methods in recent years. given feedback to one or two items.However,in the real In most traditional CF methods,only the feedback world,it is common to find that most users provide only a matrix,which contains either explicit feedback(also called little feedback.Actually,providing good recommendations ratings)or implicit feedback [21]on the items given by for new users with little feedback is more important than for frequent users since new users will only come back to the site(service)depending on how good the recommenda- .H.Wang is with the Department of Computer Science and Engineering, tion is.However,for frequent users,it is most likely that Hong Kong University of Science and Technology,Hong Kong. they are already satisfied with the site(service).If we man- E-mail:lwangaz@cse.ust.hk. .W.-J.Li is with the National Key Laboratory of Novel Software Technol- age to boost the recommendation accuracy for new or infre- ogy,Department of Computer Science and Technology,Nanjing Univer- quent users,more of them will become frequent users,and sity,Nanjing 210023,China.E-mail:liwujun@nju.edu.cn. then better recommendations can be expected with more Manuscript received 24 July 2013;revised 29 Aug.2014;accepted 6 Oct. training data.Therefore,improving the recommendation 2014.Date of publication 29 Oct.2014;date of current version 27 Mar.2015. accuracy at an extremely sparse setting is key to getting the Recommended for acceptance by Y.Koren. recommender systems working in a positive cycle. For information on obtaining reprints of this article,please send e-mail to: reprints@ieee.org,and reference the Digital Object Identifier below. To overcome the sparsity problem of CF-based models, Digital Object Identifier no.10.1109/TKDE.2014.2365789 many researchers have proposed to integrate auxiliary mmission standards
Relational Collaborative Topic Regression for Recommender Systems Hao Wang and Wu-Jun Li, Member, IEEE Abstract—Due to its successful application in recommender systems, collaborative filtering (CF) has become a hot research topic in data mining and information retrieval. In traditional CF methods, only the feedback matrix, which contains either explicit feedback (also called ratings) or implicit feedback on the items given by users, is used for training and prediction. Typically, the feedback matrix is sparse, which means that most users interact with few items. Due to this sparsity problem, traditional CF with only feedback information will suffer from unsatisfactory performance. Recently, many researchers have proposed to utilize auxiliary information, such as item content (attributes), to alleviate the data sparsity problem in CF. Collaborative topic regression (CTR) is one of these methods which has achieved promising performance by successfully integrating both feedback information and item content information. In many real applications, besides the feedback and item content information, there may exist relations (also known as networks) among the items which can be helpful for recommendation. In this paper, we develop a novel hierarchical Bayesian model called Relational Collaborative Topic Regression (RCTR), which extends CTR by seamlessly integrating the user-item feedback information, item content information, and network structure among items into the same model. Experiments on real-world datasets show that our model can achieve better prediction accuracy than the state-of-the-art methods with lower empirical training time. Moreover, RCTR can learn good interpretable latent structures which are useful for recommendation. Index Terms—Collaborative filtering, topic models, recommender system, social network, relational learning Ç 1 INTRODUCTION RECOMMENDER systems (RS) play an important role to enable us to make effective use of information. For example, Amazon [28] adopts RS for product recommendations, and Netflix [8] uses RS for movie recommendations. Existing RS methods can be roughly categorized into three classes [1], [4], [10], [41]: content-based methods, collaborative filtering (CF) based methods, and hybrid methods. Content-based methods [5], [26] adopt the profile of the users or products for recommendation. CF based methods [7], [11], [13], [18], [27], [30], [33], [35], [37], [39], [42] use past activities or preferences, such as the ratings on items given by users, for prediction, without using any user or product profiles. Hybrid methods [2], [3], [6], [12], [15], [16], [17], [20], [32], [34], [40], [43], [47], [48] combine both content-based methods and CF based methods by ensemble techniques. Due to privacy issues, it is harder in general to collect user profiles than past activities. Hence, CF based methods have become more popular than content-based methods in recent years. In most traditional CF methods, only the feedback matrix, which contains either explicit feedback (also called ratings) or implicit feedback [21] on the items given by users, is used for training and prediction. Typically, the feedback matrix is sparse, which means that most items are given feedback by few users or most users only give feedback to few items. Due to this sparsity problem, traditional CF with only feedback information will suffer from unsatisfactory performance. More specifically, it is difficult for CF methods to achieve good performance in both item-oriented setting and user-oriented setting when the feedback matrix is sparse. In an item-oriented setting where we need to recommend users to items, it is generally diffi- cult to know which users could like an item if it has only been given feedback by one or two users. This adds to the difficulty companies face when promoting new products (items). Moreover, users’ ignorance of new items will result in less feedback on the new items, which will further harm the accuracy of their recommendations. For the useroriented setting where we recommend items to users, it is also difficult to predict what a user likes if the user has only given feedback to one or two items. However, in the real world, it is common to find that most users provide only a little feedback. Actually, providing good recommendations for new users with little feedback is more important than for frequent users since new users will only come back to the site (service) depending on how good the recommendation is. However, for frequent users, it is most likely that they are already satisfied with the site (service). If we manage to boost the recommendation accuracy for new or infrequent users, more of them will become frequent users, and then better recommendations can be expected with more training data. Therefore, improving the recommendation accuracy at an extremely sparse setting is key to getting the recommender systems working in a positive cycle. To overcome the sparsity problem of CF-based models, many researchers have proposed to integrate auxiliary H. Wang is with the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong. E-mail: hwangaz@cse.ust.hk. W.-J. Li is with the National Key Laboratory of Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China. E-mail: liwujun@nju.edu.cn. Manuscript received 24 July 2013; revised 29 Aug. 2014; accepted 6 Oct. 2014. Date of publication 29 Oct. 2014; date of current version 27 Mar. 2015. Recommended for acceptance by Y. Koren. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TKDE.2014.2365789 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015 1343 1041-4347 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information
1344 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.27,NO.5,MAY 2015 information into the model training and prediction proce- Experiments on real-world datasets show that RCTR dures.Some methods [45],[51]utilize the item content can achieve higher prediction accuracy than the (attributes)to facilitate the CF training.One representative state-of-the-art methods. of these methods is collaborative topic regression(CTR)[45] Note that CTR-SMF [31]tries to integrate the user relations which jointly models the user-item feedback matrix and the into the CTR model,which is different from the application item content information (texts of articles).CTR seamlessly scenarios of our RCTR model. incorporates topic modeling [9]with CF to improve the per- The rest of this paper is organized as follows.In Section 2, formance and interpretability.For new items,CTR is able to we briefly introduce the background of CF and CTR. perform out-of-matrix prediction (cold-start prediction) Section 3 presents the details of our proposed RCTR model. [381,[51]using only the content information.Some other Experimental results are described in Section 4 and finally methods [221,[291,[31]try to use social networks among Section 5 concludes the whole paper. users to improve the performance.Among these methods, CTR-SMF [31]extends CTR by integrating the social net- 2 BACKGROUND work among users into CTR with social matrix factorization (SMF)[29]techniques,which has achieved better perfor- In this section,we give a brief introduction about the back- mance than CTR. ground of RCTR,including CF based recommendation, In many real applications,besides the feedback and item matrix factorization (MF)(also called latent factor model) content information,there may exist relations(or networks) based CF methods [251,[351,[361,J491,[501,and CTR [45]. among the items [44],[46]which can also be helpful for rec- ommendation.For example,if we want to recommend 2.1 CF Based Recommendation papers(references)to users in CiteULike,the citation rela- The task of CF is to recommend items to users based on their tions between papers are informative for recommending past feedback of the users.For example,we can deploy a rec- papers with similar topics.Other examples of item net- ommender system to recommend papers (references)to works can be found in hyperlinks among webpages,movies researchers in CiteULike.Here users are researchers and directed by the same directors,and so on. items are papers.As in [451,we assume there are I users and In this paper,we develop a novel hierarchical Bayesian J items in this paper.The feedback on item j given by user i model,called Relational Collaborative Topic Regression is denoted by ri.Although our model is general enough to (RCTR),to incorporate item relations for recommendation. be used for other settings with explicit feedback such as the The main contributions of RCTR are outlined as follows: case with integer ratings ranging from 1 to 5,we assume ri e(0,1}in this paper which is the same setting as that in By extending CTR,RCTR seamlessly integrates CTR [45].Note that this is a setting with implicit feedback as the user-item feedback information,item content introduced in [21].This means that our model tries to predict information and relational (network)structure whether a user likes a item or not.In training data,rij=1 among items into a principled hierarchical Bayes- means that user i likes item j.rij =0 means that the element ian model. is unobserved (missing),i.e.,we do not know whether user i Even if a new item has been given feedback only by likes item j or not.As stated in Section 1,CF methods use one or two users,RCTR can make effective use of the only the feedback matrix frijli 1,2,...,I;j=1,2,...,J information from the item network to alleviate the for training and prediction data sparsity problem in CF,which will conse- There are two different cases of prediction [45]:in-matrix quently improve the recommendation accuracy. prediction and out-of-matrix prediction.For the item-ori- In cases where a new user has given feedback to only ented setting,in-matrix prediction tries to make recommen- one or two items,RCTR can also make effective use dation for items with at least one feedback from the users in of the information from the item network to improve the training data.On the contrary,out-of-matrix prediction the recommendation accuracy. tries to make recommendation for items without any In RCTR,a family of link(relation)probability func- feedback in the training data.The in-matrix prediction and tions is proposed to model the relations between out-of-matrix prediction for user-oriented settings are simi- items.This extension from discrete link probability lar except that we make recommendation for users rather functions like those in [14]to a family of link proba- than items.Out-of-matrix prediction is actually the so-called bility functions increases the modeling capacity of cold-start recommendation in some of the literature [38],[51]. RCTR with better performance. Compared with CTR,a smaller number of learning 2.2 Matrix Factorization for CF iterations are needed for RCTR to achieve satisfac- The existing CF methods can be divided into two main tory accuracy.As a consequence,the total empirical categories [24]:memory-based methods [18],[28],[37]and measured runtime of training RCTR is lower than model-based methods [19],[25],[35].Memory-based meth- that of training CTR even if the time complexity of ods adopt the(weighted)average of the feedback of similar each RCTR iteration is slightly higher than that of (neighborhood)users or items for prediction,while model- CTR. based methods try to learn a statistical model from the train- ● RCTR can learn good interpretable latent structures ing data.Many works have verified that model-based meth- which are useful for recommendation. ods can outperform memory-based methods in general. Hence,model-based methods have become more popular in 1.http://www.citeulike.org/ recent years
information into the model training and prediction procedures. Some methods [45], [51] utilize the item content (attributes) to facilitate the CF training. One representative of these methods is collaborative topic regression (CTR) [45] which jointly models the user-item feedback matrix and the item content information (texts of articles). CTR seamlessly incorporates topic modeling [9] with CF to improve the performance and interpretability. For new items, CTR is able to perform out-of-matrix prediction (cold-start prediction) [38], [51] using only the content information. Some other methods [22], [29], [31] try to use social networks among users to improve the performance. Among these methods, CTR-SMF [31] extends CTR by integrating the social network among users into CTR with social matrix factorization (SMF) [29] techniques, which has achieved better performance than CTR. In many real applications, besides the feedback and item content information, there may exist relations (or networks) among the items [44], [46] which can also be helpful for recommendation. For example, if we want to recommend papers (references) to users in CiteULike,1 the citation relations between papers are informative for recommending papers with similar topics. Other examples of item networks can be found in hyperlinks among webpages, movies directed by the same directors, and so on. In this paper, we develop a novel hierarchical Bayesian model, called Relational Collaborative Topic Regression (RCTR), to incorporate item relations for recommendation. The main contributions of RCTR are outlined as follows: By extending CTR, RCTR seamlessly integrates the user-item feedback information, item content information and relational (network) structure among items into a principled hierarchical Bayesian model. Even if a new item has been given feedback only by one or two users, RCTR can make effective use of the information from the item network to alleviate the data sparsity problem in CF, which will consequently improve the recommendation accuracy. In cases where a new user has given feedback to only one or two items, RCTR can also make effective use of the information from the item network to improve the recommendation accuracy. In RCTR, a family of link (relation) probability functions is proposed to model the relations between items. This extension from discrete link probability functions like those in [14] to a family of link probability functions increases the modeling capacity of RCTR with better performance. Compared with CTR, a smaller number of learning iterations are needed for RCTR to achieve satisfactory accuracy. As a consequence, the total empirical measured runtime of training RCTR is lower than that of training CTR even if the time complexity of each RCTR iteration is slightly higher than that of CTR. RCTR can learn good interpretable latent structures which are useful for recommendation. Experiments on real-world datasets show that RCTR can achieve higher prediction accuracy than the state-of-the-art methods. Note that CTR-SMF [31] tries to integrate the user relations into the CTR model, which is different from the application scenarios of our RCTR model. The rest of this paper is organized as follows. In Section 2, we briefly introduce the background of CF and CTR. Section 3 presents the details of our proposed RCTR model. Experimental results are described in Section 4 and finally Section 5 concludes the whole paper. 2 BACKGROUND In this section, we give a brief introduction about the background of RCTR, including CF based recommendation, matrix factorization (MF) (also called latent factor model) based CF methods [25], [35], [36], [49], [50], and CTR [45]. 2.1 CF Based Recommendation The task of CF is to recommend items to users based on their past feedback of the users. For example, we can deploy a recommender system to recommend papers (references) to researchers in CiteULike. Here users are researchers and items are papers. As in [45], we assume there are I users and J items in this paper. The feedback on item j given by user i is denoted by rij. Although our model is general enough to be used for other settings with explicit feedback such as the case with integer ratings ranging from 1 to 5, we assume rij 2 f0; 1g in this paper which is the same setting as that in CTR [45]. Note that this is a setting with implicit feedback as introduced in [21]. This means that our model tries to predict whether a user likes a item or not. In training data, rij ¼ 1 means that user i likes item j. rij ¼ 0 means that the element is unobserved (missing), i.e., we do not know whether user i likes item j or not. As stated in Section 1, CF methods use only the feedback matrix frijji ¼ 1; 2; ... ; I; j ¼ 1; 2; ... ; Jg for training and prediction. There are two different cases of prediction [45]: in-matrix prediction and out-of-matrix prediction. For the item-oriented setting, in-matrix prediction tries to make recommendation for items with at least one feedback from the users in the training data. On the contrary, out-of-matrix prediction tries to make recommendation for items without any feedback in the training data. The in-matrix prediction and out-of-matrix prediction for user-oriented settings are similar except that we make recommendation for users rather than items. Out-of-matrix prediction is actually the so-called cold-start recommendation in some of the literature [38], [51]. 2.2 Matrix Factorization for CF The existing CF methods can be divided into two main categories [24]: memory-based methods [18], [28], [37] and model-based methods [19], [25], [35]. Memory-based methods adopt the (weighted) average of the feedback of similar (neighborhood) users or items for prediction, while modelbased methods try to learn a statistical model from the training data. Many works have verified that model-based methods can outperform memory-based methods in general. Hence, model-based methods have become more popular in 1. http://www.citeulike.org/ recent years. 1344 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015
WANG AND LI:RELATIONAL COLLABORATIVE TOPIC REGRESSION FOR RECOMMENDER SYSTEMS 1345 Matrix factorization [25],[35]and its extensions,such as the probabilistic matrix factorization (PMF)[35],are the most representative model-based methods which in practice have achieved promising performance.The basic idea of MF is to use latent vectors in a low-dimensional space to repre- sent the users and items.More specifically,user i is repre- sented by a latent vector uiE RK of dimensionality K,and item j is represented by a latent vector vjE RK.The predic- Fig.1.The graphical model of collaborative topic regression. tion of the feedback on item j given by user i can be com- puted as follows: article is about(represented by ,)and what the users think =5 of it (represented by vj),as discussed in [45].If we use If we use two latent matrices U=(u)and V=(v)to B=B:k to denote K topics,the generative process of CTR can be listed as follows: denote the latent vectors for all the users and items respec- tively,it means MF has learnt to find the optimal U and V 1)Draw a user latent vector for each user i: by optimizing the following objective function: ui~N(0,XIk). 3 For each item j, v:l a) (1 Draw topic proportions;~Dirichlet(@). 0 Draw item latent offset j~N(0,AIK),then set the item latent vector to be:vj=ej+0j. where·‖denotes the Frobenius norm of a vector,,入uand c) For each word in the document (item)wi,which A are regularization terms for controlling model complex- is denoted as wj, ity.The objective function in (1)corresponds to the maxi- i)Draw topic assignment zin ~Mult(@j). mum a posteriori(MAP)estimate of the PMF model in [35]. ii)Draw word win ~Mult(B.). In [45],a generalization of PMF model is proposed 3) Draw the feedback rij for each user-item pair(i,j), i~N(0,XIK), rN(u,写) vj~N(0,IK). (2) As mentioned in [45],the key to CTR lies in the item r~N(,写) latent offset ej,which makes the item latent vector vj close enough to the topic proportions 0;and diverge from it if where N()denotes the normal distribution,Ik is an iden- necessary.Parameter Ae controls how close v;is to 0j. tity matrix with K rows and columns,and cij is defined as Experiments on scientific article recommendation from follows: CiteULike show that CTR can outperform MF based CF methods. a, if rij=1, b. if rij =0, 3 RELATIONAL COLLABORATIVE TOPIC with a and b being tuning parameters and a>b>0.Please REGRESSION note that all the (i,j)pairs with feedback 0 in the training In this section,we describe the details of our proposed set are used for training in this paper.We can also sample model,called Relational Collaborative Topic Regression. part of them for training in cases where the number of zeros Besides the feedback and item content information modeled in the feedback matrix is too large. by CTR,RCTR can also model the relations among the items MF methods have achieved promising performance in which are informative for recommendations. practice.However,they also suffer from the sparsity prob- lem.Furthermore,as stated in [45],it is not easy for MF 3.1 Model Formulation methods to perform out-of-matrix prediction. To better illustrate the graphical model of RCTR,we adopt a way different from that in Fig.1 which is adopted by the 2.3 Collaborative Topic Regression authors of CTR [45].The graphic model of RCTR is shown Collaborative topic regression [45]is proposed to recom- in Fig.2,in which the component in the dashed rectangle is mend documents(papers)to users by seamlessly integrat- what differentiates RCTR from CTR. ing both feedback matrix and item (document)content The generative process of RCTR is as follows: information into the same model,which can address the problems faced by MF based CF.By combining MF and 1)Draw a user latent vector for each user i:ui~ latent Dirichlet allocation (LDA)[9],CTR achieves better W(0,λ1Ik) prediction performance than MF based CF with better inter- 2) For each item j, pretable results.Moreover,with the item content informa- a) Draw topic proportions ;Dirichlet(@). tion,CTR can predict feedback for out-of-matrix items. b) Draw item latent offset j~N(0,AIK),then set The graphical model of CTR is shown in Fig.1. the item latent vector to be:vj=ej+0j. CTR introduces an item latent offset e;between the topic c) Draw item relational offset j~N(0,IK), proportions 0j in LDA and the item latent vectors vj in CF. then set the item relational vector to be: The offset can be explained by the gap between what the Sj=Tj十Uj
Matrix factorization [25], [35] and its extensions, such as the probabilistic matrix factorization (PMF) [35], are the most representative model-based methods which in practice have achieved promising performance. The basic idea of MF is to use latent vectors in a low-dimensional space to represent the users and items. More specifically, user i is represented by a latent vector ui 2 RK of dimensionality K, and item j is represented by a latent vector vj 2 RK. The prediction of the feedback on item j given by user i can be computed as follows: r^ij ¼ uT i vj: If we use two latent matrices U ¼ ðuiÞ I i¼1 and V ¼ ðvjÞ J j¼1 to denote the latent vectors for all the users and items respectively, it means MF has learnt to find the optimal U and V by optimizing the following objective function: min U;V X I i¼1 X J j¼1 rij uT i vj 2 þ u X I i¼1 kuik2 þ v X J j¼1 kvjk2 ; (1) where kk denotes the Frobenius norm of a vector, u and v are regularization terms for controlling model complexity. The objective function in (1) corresponds to the maximum a posteriori (MAP) estimate of the PMF model in [35]. In [45], a generalization of PMF model is proposed ui Nð0; 1 u IKÞ; vj Nð0; 1 v IKÞ; rij NðuT i vj; c1 ij Þ; (2) where N ðÞ denotes the normal distribution, IK is an identity matrix with K rows and columns, and cij is defined as follows: cij ¼ a; if rij ¼ 1; b; if rij ¼ 0; with a and b being tuning parameters and a>b> 0. Please note that all the ði; jÞ pairs with feedback 0 in the training set are used for training in this paper. We can also sample part of them for training in cases where the number of zeros in the feedback matrix is too large. MF methods have achieved promising performance in practice. However, they also suffer from the sparsity problem. Furthermore, as stated in [45], it is not easy for MF methods to perform out-of-matrix prediction. 2.3 Collaborative Topic Regression Collaborative topic regression [45] is proposed to recommend documents (papers) to users by seamlessly integrating both feedback matrix and item (document) content information into the same model, which can address the problems faced by MF based CF. By combining MF and latent Dirichlet allocation (LDA) [9], CTR achieves better prediction performance than MF based CF with better interpretable results. Moreover, with the item content information, CTR can predict feedback for out-of-matrix items. The graphical model of CTR is shown in Fig. 1. CTR introduces an item latent offset j between the topic proportions uj in LDA and the item latent vectors vj in CF. The offset can be explained by the gap between what the article is about (represented by uj) and what the users think of it (represented by vj), as discussed in [45]. If we use b ¼ b1:K to denote K topics, the generative process of CTR can be listed as follows: 1) Draw a user latent vector for each user i: ui Nð0; 1 u IKÞ. 2) For each item j, a) Draw topic proportions uj DirichletðaÞ. b) Draw item latent offset j Nð0; 1 v IKÞ, then set the item latent vector to be: vj ¼ j þ uj. c) For each word in the document (item) wj, which is denoted as wjn, i) Draw topic assignment zjn MultðujÞ. ii) Draw word wjn Multðbzjn Þ. 3) Draw the feedback rij for each user-item pair ði; jÞ, rij NðuT i vj; c1 ij Þ: As mentioned in [45], the key to CTR lies in the item latent offset j, which makes the item latent vector vj close enough to the topic proportions uj and diverge from it if necessary. Parameter v controls how close vj is to uj. Experiments on scientific article recommendation from CiteULike show that CTR can outperform MF based CF methods. 3 RELATIONAL COLLABORATIVE TOPIC REGRESSION In this section, we describe the details of our proposed model, called Relational Collaborative Topic Regression. Besides the feedback and item content information modeled by CTR, RCTR can also model the relations among the items which are informative for recommendations. 3.1 Model Formulation To better illustrate the graphical model of RCTR, we adopt a way different from that in Fig. 1 which is adopted by the authors of CTR [45]. The graphic model of RCTR is shown in Fig. 2, in which the component in the dashed rectangle is what differentiates RCTR from CTR. The generative process of RCTR is as follows: 1) Draw a user latent vector for each user i: ui N ð0; 1 u IKÞ. 2) For each item j, a) Draw topic proportions uj DirichletðaÞ. b) Draw item latent offset j Nð0; 1 v IKÞ, then set the item latent vector to be: vj ¼ j þ uj. c) Draw item relational offset tj Nð0; 1 r IKÞ, then set the item relational vector to be: sj ¼ tj þ vj. Fig. 1. The graphical model of collaborative topic regression. WANG AND LI: RELATIONAL COLLABORATIVE TOPIC REGRESSION FOR RECOMMENDER SYSTEMS 1345
1346 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING.VOL.27,NO.5,MAY 2015 Fig.2.The graphical model of RCTR.The component in the dashed Fig.3.The graphical model of the degenerated RCTR.The component rectangle is what differentiates RCTR from CTR. in the dashed rectangle is what differentiates RCTR from CTR. d) For each word in the document (item)wi,which feedback is binary,it is possible to further improve the per- is denoted as wjn, formance by using a Logistic feedback model. Draw topic assignment zjn~Mult(@j). ii)Draw word wMult(B). 3.2 Learning 3 Draw the parameter n+~N(0,AIK+1). Based on the RCTR model in Fig.2,we may treat all the param- 4 For each pair of items(j,),draw a binary link indicator eters as random variables and resort to fully Bayesian meth- 小s,sy(sj,s,n). ods,such as Markov chain Monte Carlo (MCMC)methods and variational methods [23],for learning and inference.How- ) Draw the feedback rij for each user-item pair (i,j), ever,we do not adopt fully Bayesian methods here for RCTR because these methods typically incur high computational rNN(u防,c写) cost.Furthermore,because the focus of this paper is to illus- trate the importance of the relations among items which are In the above generative process,the link probability func- not utilized in CTR,it is more reasonable to adopt the same tion is defined as follows: learning strategy as that in CTR for learning and inference. Hence,as that in CTR,a maximum a posteriori esti- (时=1s,,n)=[σ(n(sos)+]°, (3) mate is also adopted for parameter learning in RCTR The MAP estimate tries to maximize the log-posteriori where ljf is a binary value,li.f=1 means there exists a of U,V,nt,s1:J,01:1,and B given the hyper-parameters relation (link)between item j and item j,otherwise p,入u入w,入r,and入e ljf=0,v is a scalar value representing the offset, n=(n,v)with the symbol (being the vector-scalar con- =p∑loga(gosy)+)-告∑fu catenation,the operator o means element-wise vector mul- =1 tiplication,and o()represents the sigmoid function which is defined as follows: 告∑g-P西-别 exp(x) a()=1+exp( 合∑号-gPg-)-合, (4) Steps 2(c),3,and 4 in the above generative process are what differentiates RCTR from CTR.The item relational offset +∑∑g(E tj in Step 2(c)is one of the key properties in RCTR.Similar to the item latent offset,rj makes it possible for sj to diverge 罗- from the item latent factor vi if necessary.While the item latent vector v;represents what the users think item j is A constant is omitted and the hyper-parameter of the topic about,the item relational vector s;represents the effect model a is set to 1 as in CTR.This objective function can be other items have on item j.The larger A,is,the more likely optimized using coordinate ascent.Since is not jointly con- it is that v;and sj are close to each other.When Ar goes to vex in all the variables,we design an alternating algorithm to infinity,the model degenerates to the case with vj=sj, learn the parameters.More specifically,each time we opti- which is shown in Fig.3.In the experiments which are pre- mize one parameter with all other parameters fixed. sented in Section 4(Fig.21),we show that the RCTR model For ui and vj,by setting the gradient to zero,we get the in Fig.2 outperforms the degenerated model in Fig.3,which following update rules: verifies the effectiveness of the item relational offset t;.Note 4←(CVr+入wIk)1VC,R, that for simplicity and fair comparison,we use the same Gaussian feedback model in Step 5 as in [45].Since the y-(UCUr+入Ik+入Ik)-(UCB+λ8+入s)
d) For each word in the document (item) wj, which is denoted as wjn, i) Draw topic assignment zjn MultðujÞ. ii) Draw word wjn Multðbzjn Þ. 3) Draw the parameter hþ Nð0; 1 e IKþ1Þ: 4) For each pair of items (j, j0 ), draw a binary link indicator lj;j0 jsj; sj0 cðjsj; sj0 ; hþÞ: 5) Draw the feedback rij for each user-item pair ði; jÞ, rij NðuT i vj; c1 ij Þ: In the above generative process, the link probability function is defined as follows: cðlj;j0 ¼ 1jsj; sj0 ; hþÞ ¼ s hT ðsj sj0Þ þ n r ; (3) where lj;j0 is a binary value, lj;j0 ¼ 1 means there exists a relation (link) between item j and item j0 , otherwise lj;j0 ¼ 0, n is a scalar value representing the offset, hþ ¼ hh; ni with the symbol hi being the vector-scalar concatenation, the operator means element-wise vector multiplication, and sðÞ represents the sigmoid function which is defined as follows: sðxÞ ¼ expðxÞ 1 þ expðxÞ : Steps 2(c), 3, and 4 in the above generative process are what differentiates RCTR from CTR. The item relational offset tj in Step 2(c) is one of the key properties in RCTR. Similar to the item latent offset, tj makes it possible for sj to diverge from the item latent factor vj if necessary. While the item latent vector vj represents what the users think item j is about, the item relational vector sj represents the effect other items have on item j. The larger r is, the more likely it is that vj and sj are close to each other. When r goes to infinity, the model degenerates to the case with vj ¼ sj, which is shown in Fig. 3. In the experiments which are presented in Section 4 (Fig. 21), we show that the RCTR model in Fig. 2 outperforms the degenerated model in Fig. 3, which verifies the effectiveness of the item relational offset tj. Note that for simplicity and fair comparison, we use the same Gaussian feedback model in Step 5 as in [45]. Since the feedback is binary, it is possible to further improve the performance by using a Logistic feedback model. 3.2 Learning Based on the RCTR model in Fig. 2, we may treat all the parameters as random variables and resort to fully Bayesian methods, such as Markov chain Monte Carlo (MCMC) methods and variational methods [23], for learning and inference. However, we do not adopt fully Bayesian methods here for RCTR because these methods typically incur high computational cost. Furthermore, because the focus of this paper is to illustrate the importance of the relations among items which are not utilized in CTR, it is more reasonable to adopt the same learning strategy as that in CTR for learning and inference. Hence, as that in CTR, a maximum a posteriori estimate is also adopted for parameter learning in RCTR. The MAP estimate tries to maximize the log-posteriori of U, V , hþ, s1:J , u1:J , and b given the hyper-parameters r, u, v, r, and e L ¼r X l j;j0¼1 log sðhT ðsj sj0Þ þ nÞ u 2 X i uT i ui v 2 X j ðvj ujÞ T ðvj ujÞ r 2 X j ðsj vjÞ T ðsj vjÞ e 2 hþT hþ þX j X n logX k ujkbk;wjn X i;j cij 2 rij uT i vj 2 : (4) A constant is omitted and the hyper-parameter of the topic model a is set to 1 as in CTR. This objective function can be optimized using coordinate ascent. Since L is not jointly convex in all the variables, we design an alternating algorithm to learn the parameters. More specifically, each time we optimize one parameter with all other parameters fixed. For ui and vj, by setting the gradient to zero, we get the following update rules: ui ðVCiV T þ uIKÞ 1 VCiRi; vj ðUCiUT þ vIK þ rIKÞ 1 ðUCjRj þ vuj þ rsjÞ; Fig. 2. The graphical model of RCTR. The component in the dashed rectangle is what differentiates RCTR from CTR. Fig. 3. The graphical model of the degenerated RCTR. The component in the dashed rectangle is what differentiates RCTR from CTR. 1346 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015
WANG AND LI:RELATIONAL COLLABORATIVE TOPIC REGRESSION FOR RECOMMENDER SYSTEMS 1347 where C;is a diagonal matrix with (ciilj=1,,J}being its diagonal entries,and Ri=[rijlj =1,2,...,J}is a column vector containing all the feedbacks of user i.Note that cij is defined in (2),which reflects the confidence controlled by a 0. and b,as discussed in [21]. For s;and n+,since we can not directly take the gradients 0. of with respect to sj or n and set them to zero,gradient ascent is used to update the variables.The gradient of with respect to sj is 三P之-s°s)+o-- Fig.4.A comparison of link probability functions with different p.The curves plot the probability of l;=1 as a function of the inner product of s;and s,.n is set to 1 and v is adapted so that all functions have the The gradient of with respect to nt is same starting point. 了ty=p∑1-atx》z克-入n, 材=1 the latent factor space and L is the total number of relations (links)in the relation network.The cost of updating the where we denoteπ=(s;os,l) item relational matrix S={sjlj=1,2,....J is also O(KL) For 0j,we first define q(zjn=k)=vink as seen in [451, for each iteration.The complexity for updating other varia- and then apply Jensen's inequality after the items contain- bles is the same as that in CTR.For U,the time complexity ing are separated, is O(IK3+IJK2)and for V it is O(JK3+1JK2)where I is the number of users andJ is the number of items.In each %)≥-合g-Pg- iteration,RCTR only adds O(KL)of extra time complexity +∑∑pog日PE.wm-log9) to CTR.Since the relation network is typically sparse which means L can be treated as a constant multiple of/,the extra 程k time cost in RCTR is minimal. =(0,φ): From our experiments,we find that RCTR needs a smaller number of learning iterations than CTR to achieve Here中=(ot)k-·Obviously(0,p,)is a tight satisfactory accuracy.As a consequence,the total empirical lower bound of (;and we can use projection gradient to measured runtime of training RCTR is lower than that of optimize 0.The optimal is training CTR even if the time complexity of each iteration of RCTR is slightly higher than that of CTR.This is verified in 中mk0X0ikPk.wn the experimental results in Section 4.7. As for the parameter B,we follow the same M-step update as in LDA, 3.5 Discussion on Link Probability Function Another key property of RCTR is the family of link proba- Bex∑∑mlom=叫 bility functions,which is inspired by the relational topic model(RTM)[14].The authors in RTM [14]find that differ- ent link probability functions can achieve different predic- 3.3 Prediction tion performance.In RCTR,we use a single parameter p to Let D be the observed test data.Similar to [45],we use a point control the choice of the link probability function.Since p is estimate of ui,0;and ej to calculate the predicted feedback a non-negative real number,the family of link probability functions actually contains an infinite number of candidate E(rilD]EluilD](E[ejD]+ElejlD]), link probability functions.However,only two link probabil- ity functions are proposed in [141.Hence,our new family of where E(.)denotes the expectation operation. link probability functions can increase the modeling capac- For in-matrix prediction ity of RCTR,and consequently better performance can be expected.From the perspective of optimization,p can sim- r旁≈(u))'(+)=(4)' ply be regarded as a necessary regularization hyper-param- eter to control the tradeoff between relations and other observations,which can easily be seen from(4).Compari- For out-of-matrix prediction,since Elej]=0,we have son between link probability functions with different p is 暗≈()' shown in Fig.4,from which we can see that our link proba- bility functions are flexible enough to model different cases. Note that if p=1,our link probability function degener- 3.4 Time Complexity ates to one of the link probability functions mentioned in According to the update rules in the RCTR learning proce- RTM[14].Moreover,when p =1,v=0 and n is a vector with dure,we can see that for each iteration the time complexity all elements being one,(li =1lsj,sy,n+)=a(sfs), for updating n is O(KL)where K is the dimensionality of which is the link probability function in CTR-SMF [31].The
where Ci is a diagonal matrix with fcijjj ¼ 1;;Jg being its diagonal entries, and Ri ¼ frijjj ¼ 1; 2; ... ; Jg is a column vector containing all the feedbacks of user i. Note that cij is defined in (2), which reflects the confidence controlled by a and b, as discussed in [21]. For sj and hþ, since we can not directly take the gradients of L with respect to sj or hþ and set them to zero, gradient ascent is used to update the variables. The gradient of L with respect to sj is rsjL ¼ r X l j;j0¼1 ð1 sðhT ðsj sj0Þ þ nÞÞðh sj0Þ rðsj vjÞ: The gradient of L with respect to hþ is rhþ L ¼ r X l j;j0¼1 ð1 sðhþT pþ j;j0ÞÞpþ j;j0 ehþ; where we denote pþ j;j0 ¼ hsj sj0 ; 1i. For uj, we first define qðzjn ¼ kÞ ¼ cjnk as seen in [45], and then apply Jensen’s inequality after the items containing uj are separated, L ðujÞ v 2 ðvj ujÞ T ðvj ujÞ þX n X k fjnkðlog ujkbk;wjn log fjnkÞ ¼ L ðuj; fjÞ: Here fj ¼ ðfjnkÞ NK n¼1;k¼1. Obviously L ðuj; fjÞ is a tight lower bound of L ðujÞ and we can use projection gradient to optimize uj. The optimal fjnk is fjnk / ujkbk;wjn : As for the parameter b, we follow the same M-step update as in LDA, bkw / X j X n fjnk1½wjn ¼ w : 3.3 Prediction Let D be the observed test data. Similar to [45], we use a point estimate of ui, uj and j to calculate the predicted feedback E½rijjD E½uijD T ðE½ujjD þ E½jjD Þ; where EðÞ denotes the expectation operation. For in-matrix prediction r ij ðu j Þ T ðu j þ j Þ¼ðu iÞ T v j : For out-of-matrix prediction, since E½j ¼ 0, we have r ij ðu iÞ T u j : 3.4 Time Complexity According to the update rules in the RCTR learning procedure, we can see that for each iteration the time complexity for updating h is OðKLÞ where K is the dimensionality of the latent factor space and L is the total number of relations (links) in the relation network. The cost of updating the item relational matrix S ¼ fsjjj ¼ 1; 2; ... ; Jg is also OðKLÞ for each iteration. The complexity for updating other variables is the same as that in CTR. For U, the time complexity is OðIK3 þ IJK2Þ and for V it is OðJK3 þ IJK2Þ where I is the number of users and J is the number of items. In each iteration, RCTR only adds OðKLÞ of extra time complexity to CTR. Since the relation network is typically sparse which means L can be treated as a constant multiple of J, the extra time cost in RCTR is minimal. From our experiments, we find that RCTR needs a smaller number of learning iterations than CTR to achieve satisfactory accuracy. As a consequence, the total empirical measured runtime of training RCTR is lower than that of training CTR even if the time complexity of each iteration of RCTR is slightly higher than that of CTR. This is verified in the experimental results in Section 4.7. 3.5 Discussion on Link Probability Function Another key property of RCTR is the family of link probability functions, which is inspired by the relational topic model (RTM) [14]. The authors in RTM [14] find that different link probability functions can achieve different prediction performance. In RCTR, we use a single parameter r to control the choice of the link probability function. Since r is a non-negative real number, the family of link probability functions actually contains an infinite number of candidate link probability functions. However, only two link probability functions are proposed in [14]. Hence, our new family of link probability functions can increase the modeling capacity of RCTR, and consequently better performance can be expected. From the perspective of optimization, r can simply be regarded as a necessary regularization hyper-parameter to control the tradeoff between relations and other observations, which can easily be seen from (4). Comparison between link probability functions with different r is shown in Fig. 4, from which we can see that our link probability functions are flexible enough to model different cases. Note that if r ¼ 1, our link probability function degenerates to one of the link probability functions mentioned in RTM [14]. Moreover, when r ¼ 1, n ¼ 0 and h is a vector with all elements being one, cðlj;j0 ¼ 1jsj; sj0 ; hþÞ ¼ sðsT j sj0Þ, which is the link probability function in CTR-SMF [31]. The Fig. 4. A comparison of link probability functions with different r. The curves plot the probability of lj;j0 ¼ 1 as a function of the inner product of sj and sj0. h is set to 1 and n is adapted so that all functions have the same starting point. WANG AND LI: RELATIONAL COLLABORATIVE TOPIC REGRESSION FOR RECOMMENDER SYSTEMS 1347