Ranked Tag recommendation Systems Based on Logistic Regression J. R Quevedo, E Montanes, J. Ranilla, and I. Diaz University of Oviedo i quevedo, montaneselena, ranilla, sirene @uniovies Abstract. This work proposes an approach to tag recommendation based on a logistic regression based system. The goal of the method is to support users of current social network systems by providing a rank of new meaningful tags for a resource. This system provides a ranked tag set and it feeds on different posts depending on the resource for which the user requests the recommendation. The performance of this approach is tested according to several evaluation measures, one of them proposed in this paper(Ft). The experiments show that this learning system outperforms certain benchmark recommenders 1 Introduction Tagging can be defined as the process of assigning short textual descriptions called tags, to information resources, which allow the user to organize the con tent. This becomes very popular and helpful for large-scale systems such as Folksonomies. A Folksonomy 3 is a collection of resources entered by users in posts. Each post consists of a resource and a set of keywords, so-called tags, at tached to it by a user. Generally, the resource is specific to the user who added it to the system, but all users are invited to label it with tags This paper proposes an approach to tag recommendation based on a learning process. The work starts from the hypothesis that a learning process improves the performance of the recommendation task. It explores several information to feed on the learner. It also analyzes the fact that recent posts are more useful and suitable to recommend new tags The remainder of the paper is structured as follows. Section 2 presents back- ground information about tag recommendation in social networks. Our approach is put in context in Section 3 while the proposed method is provided in Section 4. Section 5 details some novel performance evaluation metrics. Finally Section 6 shows the results and Section 7 draws the work conclusions This work was supported by the spanish Ministerio de Educacion y Ciencia and the European Regional Development Fund TIN2007-61273 M. Grana Romay et al.(Eds ) HAIS 2010, Part I, LNAI 6076, pp 237-244, 2010
Ranked Tag Recommendation Systems Based on Logistic Regression J.R. Quevedo, E. Monta˜n´es, J. Ranilla, and I. D´ıaz Computer Science Department University of Oviedo {quevedo,montaneselena,ranilla,sirene}@uniovi.es Abstract. This work proposes an approach to tag recommendation based on a logistic regression based system. The goal of the method is to support users of current social network systems by providing a rank of new meaningful tags for a resource. This system provides a ranked tag set and it feeds on different posts depending on the resource for which the user requests the recommendation. The performance of this approach is tested according to several evaluation measures, one of them proposed in this paper (F + 1 ). The experiments show that this learning system outperforms certain benchmark recommenders. 1 Introduction Tagging can be defined as the process of assigning short textual descriptions, called tags, to information resources, which allow the user to organize the content. This becomes very popular and helpful for large-scale systems such as Folksonomies. A Folksonomy [3] is a collection of resources entered by users in posts. Each post consists of a resource and a set of keywords, so-called tags, attached to it by a user. Generally, the resource is specific to the user who added it to the system, but all users are invited to label it with tags. This paper proposes an approach to tag recommendation based on a learning process. The work starts from the hypothesis that a learning process improves the performance of the recommendation task. It explores several information to feed on the learner. It also analyzes the fact that recent posts are more useful and suitable to recommend new tags. The remainder of the paper is structured as follows. Section 2 presents background information about tag recommendation in social networks. Our approach is put in context in Section 3 while the proposed method is provided in Section 4. Section 5 details some novel performance evaluation metrics. Finally Section 6 shows the results and Section 7 draws the work conclusions. This work was supported by the Spanish Ministerio de Educaci´on y Ciencia and the European Regional Development Fund [TIN2007-61273]. M. Gra˜na Romay et al. (Eds.): HAIS 2010, Part I, LNAI 6076, pp. 237–244, 2010. c Springer-Verlag Berlin Heidelberg 2010
238 J.R. Quevedo et al 2 Related work Different approaches have been proposed to support users during the tag process depending on the purpose for which they were built. Some of makes recommendations by analyzing tag co-occurrences or studies graph Im Jaschke et al. 5 adapt a user-based collaborative filtering as well as a graph- based recommender built on top of FolkRank Basile et al. 1 propose a smart TRS able to learn from past user interaction as well as the content of the re- sources to annotate. Katakis et al. [6 model the automated tag suggestion prob- lem as a multi-label text classification task. Sigurbjornsson et al. [8 present the res ults by means of a tag characterization focusing on how users tags photos of Flickr and what information is contained in the tagging Most of these systems require information associated with the content of the resource itself 1. Others simply suggest a sets of tags as a consequence of classification rather than providing a ranking of them [6. Some of them require a large quantitative of supporting data [8. The proposal of this work avoids these drawbacks through a novel approach which establishes a tag ranking by a machine learning approach based on Logistic Regression 3 Tag Recommender Systems(tRs A folksonomy is a tuple F: =(l, T, R, ]) wherell, T and R are finite sets, whose elements are respectively called users, tags and resources, and y is a ternary relation between them, i. e, yCuxTxR, whose elements are tag assignments (posts). When a user adds a new or existing resource to a folksonomy, it could be helpful to recommend him/her relevant tags TRS usually take the users, resources and the ratings of tags into account to suggest a list of tags to the user. According to 7 a TRS can briefly be formulated as a system that takes as input a given user u E u and a resource r ER and produces a set T(u,r)CT of tags as output. Jaschke et al in 5 define a pos of a folksonomy as a user, a resource and all tags that this user has assigned to that resource. This work slightly modifies this definition in the sense that it restricts the set of tags to the tags used simultaneously to tag a resource by a 2 There exist some simple but frequently used TRS [5] based on providing a list of ranked tags extracted from the set of posts connected with the current MPT(Most Popular Tags ) For each tag ti, the posts with ti are counted and the top tags (ranked by occurrence count) are utilized as recommendations st often together with ri are then proposed as recommendations MPTU (Most Popular Tags by User): For a resource ri it is counted for each tag in how many posts they occur together with Ti. In addition, for a
238 J.R. Quevedo et al. 2 Related Work Different approaches have been proposed to support users during the tagging process depending on the purpose for which they were built. Some of them makes recommendations by analyzing tag co-occurrences or studies graph-based approaches. J¨aschke et al. [5] adapt a user-based collaborative filtering as well as a graphbased recommender built on top of FolkRank. Basile et al. [1] propose a smart TRS able to learn from past user interaction as well as the content of the resources to annotate. Katakis et al. [6] model the automated tag suggestion problem as a multi-label text classification task. Sigurbjornsson et al. [8] present the results by means of a tag characterization focusing on how users tags photos of Flickr and what information is contained in the tagging. Most of these systems require information associated with the content of the resource itself [1]. Others simply suggest a sets of tags as a consequence of a classification rather than providing a ranking of them [6]. Some of them require a large quantitative of supporting data [8]. The proposal of this work avoids these drawbacks through a novel approach which establishes a tag ranking by a machine learning approach based on Logistic Regression. 3 Tag Recommender Systems (TRS) A folksonomy is a tuple F := (U, T , R, Y) where U, T and R are finite sets, whose elements are respectively called users, tags and resources, and Y is a ternary relation between them, i. e., Y ⊆ U×T ×R, whose elements are tag assignments (posts). When a user adds a new or existing resource to a folksonomy, it could be helpful to recommend him/her relevant tags. TRS usually take the users, resources and the ratings of tags into account to suggest a list of tags to the user. According to [7] a TRS can briefly be formulated as a system that takes as input a given user u ∈ U and a resource r ∈ R and produces a set T (u, r) ⊂ T of tags as output. J¨aschke et al. in [5] define a post of a folksonomy as a user, a resource and all tags that this user has assigned to that resource. This work slightly modifies this definition in the sense that it restricts the set of tags to the tags used simultaneously to tag a resource by a user. There exist some simple but frequently used TRS [5] based on providing a list of ranked tags extracted from the set of posts connected with the current annotation. – MPT (Most Popular Tags): For each tag ti, the posts with ti are counted and the top tags (ranked by occurrence count) are utilized as recommendations. – MPTR (Most Popular Tags by Resource): For a resource ri it is counted for every tag in every posts they occur together with ri. The tags occurring most often together with ri are then proposed as recommendations. – MPTU (Most Popular Tags by User): For a resource ri it is counted for each tag in how many posts they occur together with ri. In addition, for a
Ranked Tag Recommendation Systems Based on Logistic Regression 239 user ui it is counted for every tag in every posts they occur together with li. The tags occurring most often together with either ri or ui are taken as recommendations MPTRU (Most Popular Tags by Resource or User ) For a resource ri it is counted for each tag in how many posts they occur together with ri. In addition, for a user ui it is counted for every tag in how many posts they occur together with ui. The tags occurring most often together with eithe Ti or ui are taken as The introduction of a learning system is expected to improve the performance of the 4 Learning to Recommend This section depicts the whole procedure followed in order to provide a user and a resource with a set of ranked tags. These recommendations are based on a learning process that learns how the users has previously tagged resources. The core of the method is a supervised learning algorithm based on logistic regression 2. This paper studies different training sets built according to the user and resource for whom the recommendations are provided, rding to The key points of the system are the following The training set depends on each test set and it is built specifically for each f them Several training sets are built according to different criteria and afterwards compared and evaluated The learning system adopted was LIBLINEAR 2, which provides a proba- bilistic distribution before the classification. This probability distribution is exerted to rank the tags, taking as most suitable tag the one with highest probability value he tags of the ranking will be all that appear in the categories of each training set. This entails that some positive tags of a test post might not be 4.1 Definition of the Test Set The traditional approach splits the data into training and test sets at the begin- fterwards. a model is inferred aining set and it is validated thanks to the test set [6. In this paper, the methodology used is quite different in the sense that the training and test sets are not fixed. The test set is randomly elected and afterwards an ad hoc training set is provided to each test post According to the definition of a folksonomy in Section 3, it is composed by a set of posts. Each post is formed by a user, a resource and a set of tags, i.e P1=(u,r;,{t1…,t1})
Ranked Tag Recommendation Systems Based on Logistic Regression 239 user ui it is counted for every tag in every posts they occur together with ui. The tags occurring most often together with either ri or ui are taken as recommendations. – MPTRU (Most Popular Tags by Resource or User): For a resource ri it is counted for each tag in how many posts they occur together with ri. In addition, for a user ui it is counted for every tag in how many posts they occur together with ui. The tags occurring most often together with either ri or ui are taken as recommendations. The introduction of a learning system is expected to improve the performance of them. 4 Learning to Recommend This section depicts the whole procedure followed in order to provide a user and a resource with a set of ranked tags. These recommendations are based on a learning process that learns how the users has previously tagged resources. The core of the method is a supervised learning algorithm based on logistic regression [2]. This paper studies different training sets built according to the user and resource for whom the recommendations are provided. The key points of the system are the following – The training set depends on each test set and it is built specifically for each of them. – Several training sets are built according to different criteria and afterwards compared and evaluated. – The learning system adopted was LIBLINEAR [2], which provides a probabilistic distribution before the classification. This probability distribution is exerted to rank the tags, taking as most suitable tag the one with highest probability value. – The tags of the ranking will be all that appear in the categories of each training set. This entails that some positive tags of a test post might not be ranked. 4.1 Definition of the Test Set The traditional approach splits the data into training and test sets at the beginning. Afterwards, a model is inferred using the training set and it is validated thanks to the test set [6]. In this paper, the methodology used is quite different in the sense that the training and test sets are not fixed. The test set is randomly selected and afterwards an ad hoc training set is provided to each test post. According to the definition of a folksonomy in Section 3, it is composed by a set of posts. Each post is formed by a user, a resource and a set of tags, i.e., pi = (ui, ri, {ti1 ,...,tik })
40 J. R. Quevedo et al Each post of a folksonomy is candidate to become a test post. Each test post is then turned into as many examples as tags used to label the resource. Therefore post pi is split into k test examples (v;,r,t1)…k=(u4,r1,t) 4.2 Definition of the Training Set Whichever learning system strongly depends on the training set used to learn posts posted before the test post. This parameter N is experimentally ired or This work dynamically selects an ad hoc training set from the N most recent This characteristic makes the tRS to suggest the on-fashion folksonomy tags nd it produces a more scalable system, since the number of posts in the training set does not increase according to the number of posts posted before the test post This characteristic makes solving the problem by the learning system feasible Therefore, the choice of the training set for a given test post is reduced to define the criteria the posts must satisfy to be included in the training set. This work studies different criteria Approach 1. TR Let Pi=(ui, Ti, ti, .. ti ] ) be a test post. Let Ru, be the subset of posts associated to a resource ri and Rr.=pi/pi E Ri and it was posted before th TR approach selects as training set the N most modern posts of Rri, being di the date when pi was posted. Approach 2. TU Let pi=(ui, Ti, tin,..., tiny) be a test post. Let Pu be the personomy(the subset of posts posted by a user constitutes the so-called per Pu,=pi/pi E Pu, and it was posted before t] TU approach selects as training set the N most modern posts of pd, being di the date when Pi was posted Approach 3. TRU The above training sets do not take into account that the learned model which is used after a resource is presented to the user. Hence, this approach proposes to go further and to add as training examples those concerning with the resource for which the recommendations are demanded. The examples whose tags have been previously assigned to the resource by the user to whom the recommendations are provided are removed because it has no sense to recommend a user the tags he had previously used to label the resource Therefore, the training set associated to pi is formed by URd r,=(pd urpi/p;=(ui, Ti, ( t1, .,tnD)1
240 J.R. Quevedo et al. Each post of a folksonomy is candidate to become a test post. Each test post is then turned into as many examples as tags used to label the resource. Therefore, post pi is split into k test examples e1 = (ui, ri, ti1 )...ek = (ui, ri, tik ) (1) 4.2 Definition of the Training Set Whichever learning system strongly depends on the training set used to learn. This work dynamically selects an ad hoc training set from the N most recent posts posted before the test post. This parameter N is experimentally fixed. This characteristic makes the TRS to suggest the on-fashion folksonomy tags and it produces a more scalable system, since the number of posts in the training set does not increase according to the number of posts posted before the test post. This characteristic makes solving the problem by the learning system feasible. Therefore, the choice of the training set for a given test post is reduced to define the criteria the posts must satisfy to be included in the training set. This work studies different criteria. Approach 1. TR Let pi = (ui, ri, {ti1 ,...,tik }) be a test post. Let Rui be the subset of posts associated to a resource ri and Rt ri = {pi/pi ∈ Ri and it was posted before t}. TR approach selects as training set the N most modern posts of Rdi ri , being di the date when pi was posted. Approach 2. TU Let pi = (ui, ri, {ti1 ,...,tik }) be a test post. Let Pui be the personomy (the subset of posts posted by a user constitutes the so-called personomy) associated to a user ui and Pt ui = {pi/pi ∈ Pui and it was posted before t} TU approach selects as training set the N most modern posts of Pdi ui , being di the date when pi was posted. Approach 3. TRU The above training sets do not take into account that the learned model which is used after a resource is presented to the user. Hence, this approach proposes to go further and to add as training examples those concerning with the resource for which the recommendations are demanded. The examples whose tags have been previously assigned to the resource by the user to whom the recommendations are provided are removed because it has no sense to recommend a user the tags he had previously used to label the resource. Therefore, the training set associated to pi is formed by URdi ui,ri = {Pd i ∪ Rd i }\{pj/pj = (ui, ri, {t1, ..., tn})}
Ranked Tag Recommendation Systems Based on Logistic Regression 4.3 Example Representation Once both the training and test sets are defined, it is necessary to transform them into a computable form understandable for a machine learning system. Therefore, we have to define the features which characterize the examples as well as the class of each example The features which characterize the examples are the tags previously used o tag the resource in the folksonomy. Hence, each example will be represented by a boolean vector V of size M (the number of tags of the folksonomy) where if and only if t, was used to tag the resource before and 0 otherwise, where jEl,.,M. The class of a example will be the tag with which the user has tagged the resource in this moment In addition, removing redundant or non-useful features which add noise to the system is usually helpful to increase both the effectiveness and efficiency of the classifiers. The example representation based on tags as features makes possible perform a simple feature selection in the training set consisting of just keeping those tags which represent the test(this approach will be called TRUTR) 4.4 Learning System The key point of this paper is to provide a ranked set of tags adapted to a user and a resource. Therefore, it could be beneficial a learning system able to rank the tags, indicating to the user which tag is the best and which one is the worst In this framework, LIBLINEAR [2] is an open source library based on SVM which is a recent alternative able to accomplish multi-category classification through logistic regression, providing a probabilistic distribution before the clas- sification. This probability distribution is exerted to rank the tags, taking as most suitable tag the one with highest probability value. In the same sense the most discordant tag will be the one with lowest probability. This work uses the default LIBLINEAR configuration after a slight modifi- ation of the output. The evaluation in this case takes place when a resource is presented to the user. Pe erformance evaluation So far, no consensus about an adequate metric to evaluate a recommender is stated 5. In(6)a random post for each user is picked up and afterwards they provide a set of tags for this post based on the whole folksonomy except such post. Then, they compute the Fi as T+(u, r)nT(u, r) D (u n IT(u, r)l+T+(u,r) Availableathttp://www.csie.ntu.edu.tw/cjlin/liblinear/
Ranked Tag Recommendation Systems Based on Logistic Regression 241 4.3 Example Representation Once both the training and test sets are defined, it is necessary to transform them into a computable form understandable for a machine learning system. Therefore, we have to define the features which characterize the examples as well as the class of each example. The features which characterize the examples are the tags previously used to tag the resource in the folksonomy. Hence, each example will be represented by a boolean vector V of size M (the number of tags of the folksonomy) where vj = 1 if and only if tj was used to tag the resource before and 0 otherwise, where j ∈ 1,...,M. The class of a example will be the tag with which the user has tagged the resource in this moment. In addition, removing redundant or non-useful features which add noise to the system is usually helpful to increase both the effectiveness and efficiency of the classifiers. The example representation based on tags as features makes possible perform a simple feature selection in the training set consisting of just keeping those tags which represent the test (this approach will be called TRUTR). 4.4 Learning System The key point of this paper is to provide a ranked set of tags adapted to a user and a resource. Therefore, it could be beneficial a learning system able to rank the tags, indicating to the user which tag is the best and which one is the worst for the resource. In this framework, LIBLINEAR [2] is an open source library1 based on SVM which is a recent alternative able to accomplish multi-category classification through logistic regression, providing a probabilistic distribution before the classification. This probability distribution is exerted to rank the tags, taking as most suitable tag the one with highest probability value. In the same sense the most discordant tag will be the one with lowest probability. This work uses the default LIBLINEAR configuration after a slight modifi- cation of the output. The evaluation in this case takes place when a resource is presented to the user. 5 Performance Evaluation So far, no consensus about an adequate metric to evaluate a recommender is stated [5]. In ([6]) a random post for each user is picked up and afterwards they provide a set of tags for this post based on the whole folksonomy except such post. Then, they compute the F1 as F1 = 1 |D| (u,r)∈D 2|T +(u, r) ∩ T (u, r)| |T (u, r)| + |T +(u, r)| (2) 1 Available at http://www.csie.ntu.edu.tw/∼cjlin/liblinear/