EXPLORING INFORMATION HIDDEN IN TAGS: A SUBJECT-BASED ITEM RECOMMENDATION APPROACH Jing peng Daniel Zeng nstitute of Automation, Chinese Academy of Sciences Department of Management Information Systems, The University of Arizona Jing peng cla ac CI zeng @email. arizona. edu Abstract Collaborative tagging sites allow users to bookmark and annotate their favorite Web contents with tags. These tags provide a novel source of information for collaborative filtering (CF). Research on how to i mprove i tem recommendation q uality leveraging t ags i s emerging yet i nformation idden in tags is far from being fully exploited. In this paper, we aim at finding informative usage patterns from tags by consistent c lustering on tags us ing nonnegative matrix factorization. The clustered subjects, represented by w eighed t ag v ectors, can t hen be used to b uild a s ubject centered us er i nformation seeking model for i tem recommendation. E xperiments o n two reai world datasets show that our subject-based algorithms substantially outperform the traditional CF methods as well as tag-enhanced recommendation approaches reported in the literature Keywords: Collaborative filtering, collaborative tagging, nonnegative matrix factorization, tag-enhanced 1. Introduction Collaborative tagging or social bookmarking sites, such as Delicious (del. icio. us) and CiteULike www.citeulike.org),allowuserstobookmarktheirfavoriteWebcontents(oritems)onlineandprovide tag-based annotations to facilitate future retrieval. Tagging data present interesting opportunities as well as challenges to CF as tags are available as a novel source of data in addition to the typical user-item interaction information Recently, there has been a growing number of efforts aiming to improve Web content or item recommendation quality leveraging tags. Zhao et al.(Zhao et al. 2008)proposed to compute the similarit of two users based on the semantic distance of their tag sets on common items they have bookmarked Tso-Sutter et al. (Tso-Sutter et al. 2008)presented an interesting approach to integrate tags into traditional CF algorithms. They extended the item vectors for users and user vectors for items with tags and then onstructed the user/item neighborhoods based on the extended user/item profiles. Instead of exploring gs for similarity computation, the topic-based method was proposed to take advantage of tags in a probabilistic framework( Peng et al. 2009), viewing each tag as an indicator of a topic and then estimating the probability of a user saving an item by summing up the transition probabilities through all tags However, none of the existing research has explicitly explored why and how a user chooses to bookmark an item and assigns a certain set of tags. We hypothesize that exploiting such information embedded in tagging activities could lead to better recommendation performance. A record in tagging data is a tuple consisting of three fields in the form of <user, item, tag>. Some preliminary work(Halpin et al. 2007; Lambiotte et al. 2006)has started to examine a tripartite graph structure of collaborative tagging, as shown in Figure 1(a). The topic-based method(Peng et al. 2009) assumes this structure implicitly Nevertheless, this tripartite structure suffers from the following problems: i)a tag usually covers a wide range of topics rather than a specific subject. (ii) Users may be linked to items that they are not interested in at all due to polysemy of tags (iii) Users may miss items they are actually interested in because synonyms are used. (iv) There exists a large number of tags(e.g,"toread, ** ) carrying no value to users other than their originators 19th Workshop on Information Technologies and Systems
EXPLORING INFORMATION HIDDEN IN TAGS: A SUBJECT-BASED ITEM RECOMMENDATION APPROACH Jing Peng1 Daniel Zeng2, 1 1 Institute of Automation, Chinese Academy of Sciences 2 Department of Management Information Systems, The University of Arizona jing.peng@ia.ac.cn zeng@email.arizona.edu Abstract Collaborative tagging sites allow users to bookmark and annotate their favorite Web contents with tags. These tags provide a novel source of information for collaborative filtering (CF). Research on h ow t o i mprove i tem r ecommendation q uality l everaging t ags i s emerging y et i nformation hidden in tags is far from being fully exploited. In this paper, we aim at finding informative usage patterns from tags by c onsistent c lustering o n t ags us ing n onnegative matrix f actorization. The clustered subjects, r epresented by w eighed t ag v ectors, c an t hen be used t o b uild a s ubjectcentered us er i nformation seeking model f or i tem r ecommendation. E xperiments o n two realworld datasets show that our subject-based algorithms substantially outperform the traditional CF methods as well as tag-enhanced recommendation approaches reported in the literature. Keywords: Collaborative filtering, collaborative tagging, nonnegative matrix factorization, tag-enhanced 1. Introduction Collaborative tagging or social bookmarking sites, such as Delicious (del.icio.us) and CiteULike (www.citeulike.org), allow users to bookmark their favorite Web contents (or items) online and provide tag-based annotations to facilitate future retrieval. Tagging data present interesting opportunities as well as challenges to CF as tags are available as a novel source of data in addition to the typical user-item interaction information. Recently, there has been a growing number of efforts aiming to improve Web content or item recommendation quality leveraging tags. Zhao et al. (Zhao et al. 2008) proposed to compute the similarity of two users based on the semantic distance of their tag sets on common items they have bookmarked. Tso-Sutter et al. (Tso-Sutter et al. 2008) presented an interesting approach to integrate tags into traditional CF algorithms. They extended the item vectors for users and user vectors for items with tags and then constructed the user/item neighborhoods based on the extended user/item profiles. Instead of exploring tags for similarity computation, the topic-based method was proposed to take advantage of tags in a probabilistic framework (Peng et al. 2009), viewing each tag as an indicator of a topic and then estimating the probability of a user saving an item by summing up the transition probabilities through all tags. However, none of the existing research has explicitly explored why and how a user chooses to bookmark an item and assigns a certain set of tags. We hypothesize that exploiting such information embedded in tagging activities could lead to better recommendation performance. A record in tagging data is a tuple consisting of three fields in the form of <user, item, tag>. Some preliminary work (Halpin et al. 2007; Lambiotte et al. 2006) has started to examine a tripartite graph structure of collaborative tagging, as shown in Figure 1 (a). The topic-based method (Peng et al. 2009) assumes this structure implicitly. Nevertheless, this tripartite structure suffers from the following problems: (i) A tag usually covers a wide range of topics rather than a specific subject. (ii) Users may be linked to items that they are not interested in at all due to polysemy of tags. (iii) Users may miss items they are actually interested in because synonyms are used. (iv) There exists a large number of tags (e.g., “toread,” “***”) carrying no value to users other than their originators. 73 19th Workshop on Information Technologies and Systems
2. Consistent Nonnegative Matrix Factorization for Tag Clustering USERS WWEBSITES Figure 1. (a) Tripartite graph structure(b) Subject-centered model In order to overcome the above problems of associating each tag with a distinct topic under the tripartite structure, we propose to cluster tags to gain a more compact and informative representation of the underlying subjects that users actually intend with these tags. The common feature-based hard clustering algorithms(such as k-means)are not suitable for clustering tags for two reasons. First, the tags do not have a well-defined feature space associated with them. Second, many tags can be used to represent different subjects. In our work, we employ Nonnegative Matrix Factorization(NMF) to perform soft clustering on tags. NMf has been proved to be an effective technique to extract hidden structures in data that can be represented as matrices(.g, in image processing(Lee et al. 1999)and text mining lee et al 1999, Xu et al. 2003). Given a nonnegative matrix V, an NMF can be defined as v a WH, where W and H are constrained to be nonnegative. The columns of w are often restricted to be unit length so as to make the factorization result unique. From the aspect of clustering, factor matrix W contains cluster centroids and H contains cluster membership indicators. For notational convenience, a variant of the standard NMF is adopted in this paper, that is, VT HwT, where each row of w indicates a cluster centroid and each row of H holds the cluster membership of this row's corresponding object Subject User UT= User UScM x SubjectSTeM User UScM x Subject STaM IT x Subject STeM Figure 2.(a) NMFs on user-tag and item-tag matrices(b) Consistent NMF on user-item-tag matrix As shown in Figure 2(a), two types of NMF can be applied to find the cluster centroids of tags with each centroid representing a subject. UT and IT represent the user-tag and item-tag co-occurrence matrix, respectively, with each entry being the frequency of the corresponding <user, tag> or <item, tag> pair in the training set. Each row of sT BM indicates a clustered subject and the entire matrix constitutes the bases of the subject space. UScM/IScM holds the coordinate values of users/items in the subject space Nevertheless, this NMF approach is only able to capture the cluster membership of one entity (either user or item)on the clustered subjects. We argue, however, from both modeling and computational standpoints, it is critical to consider these two types of cluster membership together. To this end, we propose a new method called Consistent NMF (CONMF), as shown in Figure 2 (b), to capture both memberships as well as gain consistent clustering result on tags. The computational definition of CoNMf will be presented in Section 3. 2. Here we provide the basic intuition. NMF takes one matrix as input and produces two 19th Workshop on Information Technologies and Systems
2. Consistent Nonnegative Matrix Factorization for Tag Clustering Subject UT IT User UI Item Tag USCM ISCM STBM (a) (b) Figure 1. (a) Tripartite graph structure (b) Subject-centered model In order to overcome the above problems of associating each tag with a distinct topic under the tripartite structure, we propose to cluster tags to gain a more compact and informative representation of the underlying subjects that users actually intend with these tags. The common feature-based hard clustering algorithms (such as k-means) are not suitable for clustering tags for two reasons. First, the tags do not have a well-defined feature space associated with them. Second, many tags can be used to represent different subjects. In our work, we employ Nonnegative Matrix Factorization (NMF) to perform soft clustering on tags. NMF has been proved to be an effective technique to extract hidden structures in data that can be represented as matrices (e.g., in image processing (Lee et al. 1999) and text mining (Lee et al. 1999; Xu et al. 2003)). Given a nonnegative matrix 𝐕𝐕, an NMF can be defined as 𝐕𝐕 ≈ 𝐖𝐖𝐖𝐖, where 𝐖𝐖 and 𝐇𝐇 are constrained to be nonnegative. The columns of 𝐖𝐖 are often restricted to be unit length so as to make the factorization result unique. From the aspect of clustering, factor matrix 𝐖𝐖 contains cluster centroids and 𝐇𝐇 contains cluster membership indicators. For notational convenience, a variant of the standard NMF is adopted in this paper, that is, 𝐕𝐕𝐓𝐓 ≈ 𝐇𝐇𝐓𝐓𝐖𝐖𝐓𝐓, where each row of 𝐖𝐖𝐓𝐓 indicates a cluster centroid and each row of 𝐇𝐇𝐓𝐓 holds the cluster membership of this row’s corresponding object. UT IT User Item Tag = USCM ISCM User Item Subject STBM ˣ Subject Tag = ˣ Subject STBM Tag Subject Tag UT IT User Item Tag = USCM ISCM User Item Subject STBM ˣ Subject Tag = (a) (b) Figure 2. (a) NMFs on user-tag and item-tag matrices (b) Consistent NMF on user-item-tag matrix As shown in Figure 2 (a), two types of NMF can be applied to find the cluster centroids of tags with each centroid representing a subject. UT and IT represent the user-tag and item-tag co-occurrence matrix, respectively, with each entry being the frequency of the corresponding <user,tag> or <item,tag> pair in the training set. Each row of STBM indicates a clustered subject and the entire matrix constitutes the bases of the subject space. USCM/ISCM holds the coordinate values of users/items in the subject space. Nevertheless, this NMF approach is only able to capture the cluster membership of one entity (either user or item) on the clustered subjects. We argue, however, from both modeling and computational standpoints, it is critical to consider these two types of cluster membership together. To this end, we propose a new method called Consistent NMF (CONMF), as shown in Figure 2 (b), to capture both memberships as well as gain consistent clustering result on tags. The computational definition of CONMF will be presented in Section 3.2. Here we provide the basic intuition. NMF takes one matrix as input and produces two 74 19th Workshop on Information Technologies and Systems
matrices whose product approximates the input matrix. CONMf takes two matrices as input and produce two correlated factorizations to approximate the two input matrices. The imposed correlation guarantees consistency, needed for reinterpretation of these two resulting factorizations in a meaningful manner in Based on CONMF results, we are able to build a computational model for tagging data as shown in Figure 1(b), where UScM, IScm and SIBy denote the user-subject, item-subject and tag-subject inter- relationships, respectively. We called this model a"subject-centered"model. This model can be viewed as a generalization of the tripartite graph. If no tag clustering is performed the subjects are an identity mapping to tags and the subject-centered model degenerates into the tripartite graph. In our subject centered model, subject becomes the driving force motivating a user to save an item and assign tags. In other words, a user bookmarks an item because this item falls into her subject interests and a user assigns a tag because this tag is able to describe her subject interest with the item being saving. The solid edges shown in Figure 1(b) indicate the presumed structure of tagging data, whereas the dashed edges indicate the real-world observation of users' tagging behavior. Since the clustered subjects are represented by weighed tag vectors, the subject-centered model can help solve some of the key problems the tripartite graph-based method is facing. More specifically, (i) the range of subjects is meaningfully reduced; (ii)tag semantic ambiguity is lessened; (iii) synonymic tags are grouped together to describe the same subject; and (iv) the noises from meaningless tags are discounted 3. Subject-based recommendation In this section, we present how to make recommendations based on the subject-centered model. We first extract the hidden subjects from users' tagging activities via CONMF, and then estimate the probabilities of items being saved by users based on the constructed subject-centered model 3.I Notation We view users, items, tags, and subjects as four random variables and denote them by U, I, t,s, respectively. The joint distribution of any two random variables X and Y is denoted as F(X,Y). The co. occurrence matrix of two entities is represented by the combination of their corresponding letters. For example, UT stands for the user-tag co-occurrence matrix. We use different subscripts to distinguish different types of matrices. Subscript"PM indicates a transition probability matrix(e. g, USpM stands for the probabilities of users getting interested in different subjects); subscript"BM"indicates basis matrix (e.g, STBM Stands for the basis matrix of subject space); subscript"CM represents coordinate matrix, (e.g, UScM holds the coordinate values of users in the subject space). In addition, the symbol fM) furs( M)means normalizing matrix M to unit row length/sum 3. 2 Subject extraction CONMF is employed to discover the subjects hidden in tagging data In Figure 2(b), the weights for matrix UT and IT are assumed to be equal although different weights can be used as well. In our approach, UT and It are normalized to unit row length so that all users are equally weighed with the weight for UT and all items are equally weighed with the weight for It. We choose these weights such that the sum of weights for UT and IT is 2. ConMf can then be formally written as c·furl(UT) (2-c)·frl(T) IS STBM 1) where c(0 <c<2)reflects the tradeoff between contributions from UT and It to the extracted subjects In fact, combining UT and IT for factorization not only guarantees a consistent clustering result on tags, but also enables the clustered subjects to aggregate information from both matrices. In our experiment, CONMF is implemented with the Matlab library function nnmf. The desired number of subjects, k, is selected manually before-hand and the optimum value of this parameter depends on the dataset. However, according to our experiments, quite stable results can be obtained in a large range of k(e. g, 50-250) 19th Workshop on Information Technologies and Systems
matrices whose product approximates the input matrix. CONMF takes two matrices as input and produce two correlated factorizations to approximate the two input matrices. The imposed correlation guarantees consistency, needed for reinterpretation of these two resulting factorizations in a meaningful manner in the tagging domain. Based on CONMF results, we are able to build a computational model for tagging data as shown in Figure 1 (b), where USCM, ISCM and SIBM denote the user-subject, item-subject and tag-subject interrelationships, respectively. We called this model a “subject-centered” model. This model can be viewed as a generalization of the tripartite graph. If no tag clustering is performed, the subjects are an identity mapping to tags and the subject-centered model degenerates into the tripartite graph. In our subjectcentered model, subject becomes the driving force motivating a user to save an item and assign tags. In other words, a user bookmarks an item because this item falls into her subject interests and a user assigns a tag because this tag is able to describe her subject interest with the item being saving. The solid edges shown in Figure 1 (b) indicate the presumed structure of tagging data, whereas the dashed edges indicate the real-world observation of users’ tagging behavior. Since the clustered subjects are represented by weighed tag vectors, the subject-centered model can help solve some of the key problems the tripartite graph-based method is facing. More specifically, (i) the range of subjects is meaningfully reduced; (ii) tag semantic ambiguity is lessened; (iii) synonymic tags are grouped together to describe the same subject; and (iv) the noises from meaningless tags are discounted. 3. Subject-based Recommendation In this section, we present how to make recommendations based on the subject-centered model. We first extract the hidden subjects from users’ tagging activities via CONMF, and then estimate the probabilities of items being saved by users based on the constructed subject-centered model. 3.1 Notation We view users, items, tags, and subjects as four random variables and denote them by U, I, T, S, respectively. The joint distribution of any two random variables X and Y is denoted as F(X,Y). The cooccurrence matrix of two entities is represented by the combination of their corresponding letters. For example, UT stands for the user-tag co-occurrence matrix. We use different subscripts to distinguish different types of matrices. Subscript “PM” indicates a transition probability matrix (e.g., USPM stands for the probabilities of users getting interested in different subjects); subscript “BM” indicates basis matrix (e.g., STBM stands for the basis matrix of subject space); subscript “CM” represents coordinate matrix, (e.g., USCM holds the coordinate values of users in the subject space). In addition, the symbol furl(M)/ furs(M) means normalizing matrix M to unit row length/sum. 3.2 Subject Extraction CONMF is employed to discover the subjects hidden in tagging data. In Figure 2 (b), the weights for matrix UT and IT are assumed to be equal although different weights can be used as well. In our approach, UT and IT are normalized to unit row length so that all users are equally weighed with the weight for UT and all items are equally weighed with the weight for IT. We choose these weights such that the sum of weights for UT and IT is 2. CONMF can then be formally written as: � 𝑐𝑐 ∙ 𝑓𝑓𝑢𝑢𝑢𝑢𝑢𝑢 (𝐔𝐔𝐔𝐔) (2 − 𝑐𝑐) ∙ 𝑓𝑓𝑢𝑢𝑢𝑢𝑢𝑢 (𝐈𝐈𝐈𝐈) � ≈ � 𝐔𝐔𝐔𝐔𝐂𝐂𝐂𝐂 𝐈𝐈𝐈𝐈𝐂𝐂𝐂𝐂 � ∙ 𝐒𝐒𝐒𝐒𝐁𝐁𝐁𝐁 (1) where 𝑐𝑐 (0 < 𝑐𝑐 < 2) reflects the tradeoff between contributions from UT and IT to the extracted subjects. In fact, combining UT and IT for factorization not only guarantees a consistent clustering result on tags, but also enables the clustered subjects to aggregate information from both matrices. In our experiment, CONMF is implemented with the Matlab library function nnmf. The desired number of subjects, k, is selected manually before-hand and the optimum value of this parameter depends on the dataset. However, according to our experiments, quite stable results can be obtained in a large range of k (e.g., 50~250). 75 19th Workshop on Information Technologies and Systems
3.3 Recommendation algorith The basic intuition of subject-based recommendation is that a user's interest in an item can be expressed through subjects. The probability of a user visiting an item is given by p(ilu)= 2sp(slu) p(ils),where p(slu) indicates the probability of user u's interest in subject s and p(ils) represents the probability of saving item i when users are interested in subject s. This formula can be rewritten as UlpM= Us SI According to the above formula, estimating USpM and Slpm becomes another crucial step of our approach besides CONMF Since we already have the coordinate values of users in the subject space, USpM can be computed by UspM= furs (UScM), i.e., the probability that a user is interested in a subject is proportional to her coordinate value on this subject. Likewise, ISPy can be calculated based on IScM. However, simply transposing ISpm will not generate SIpy. Since ISpM and Slpm are actually marginal probability matrices of joint distribution F(l,S)on items and subjects respectively, we compute the joint distribution matrix of F(, S) from ISpm and then estimate SIpM based on F(, S). The details of computing Slpm is illustrated in Item IScM Compute ispy by normalizing IScy to unit Subject Multiply each row of ispm by the frequency of matrix of items' frequencies diag() by Ispy Transpose IS so as to get sI Subject SlpM ow Sum ]subject SI Obtain SIpy by norma sum Figure 3. Estimation of transition probability matrix SIpM SIM=fars((diag(①·fs(IScM))(2) In short, SlpM is calculated using equation(2). Finally, we can compute UlpM based on USpM and SlpM and then recommend those items with the highest probabilities to each user (if the items have not been bookmarked before 3. 4 Tag Generalization As discussed in Section 1, some users tend to use tags which are meaningless to other users. These tags are noises disturbing matrix UT and IT. Since UT records tag frequencies of individual users, there will be considerable noises in it. However, IT is the aggregation of all users' tagging behavior on items, thus it is much more resistant to noisy tags. As a result of this different noise ratio in UT and IT, combining them directly for factorization may lead to poor results. In order to generate a more reliable UT, one possible approach(not without the loss of user-specific information) is as follows. For a user and an item bookmarked by this user, we reset her tag vector as the corresponding tag vectors from IT, reflecting crowd wisdom. In the matrix form, UT UI furl (IT). We call this idea tag generalization and have ested its effectiveness in Section 4 4. An Empirical Study The datasets used in our study were crawled from Delicious, the largest social bookmarking site, from June 2008 to December 2008. Two raw datasets consisting of bookmarking data from 5000 users each were used for our reported experiments. To reduce the size of the raw data, we removed users that had posted less than 15 URLs and items that had been saved by less than 15 users. To create an experimental testbed that can highlight differences between recommendation algorithms, we removed the URLs that had been bookmarked by more than 75 users, which accounted for about 1-2% of all URLs. This removal 19th Workshop on Information Technologies and Systems
3.3 Recommendation Algorithm The basic intuition of subject-based recommendation is that a user’s interest in an item can be expressed through subjects. The probability of a user visiting an item is given by 𝑝𝑝(𝑖𝑖|𝑢𝑢) = ∑ 𝑝𝑝(𝑠𝑠|𝑢𝑢) ∙ 𝑝𝑝(𝑖𝑖|𝑠𝑠) 𝑠𝑠 , where 𝑝𝑝(𝑠𝑠|𝑢𝑢) indicates the probability of user 𝑢𝑢’s interest in subject 𝑠𝑠 and 𝑝𝑝(𝑖𝑖|𝑠𝑠) represents the probability of saving item 𝑖𝑖 when users are interested in subject 𝑠𝑠. This formula can be rewritten as 𝐔𝐔𝐔𝐔𝐏𝐏𝐏𝐏 = 𝐔𝐔𝐔𝐔𝐏𝐏𝐏𝐏 ∙ 𝐒𝐒𝐒𝐒𝐏𝐏𝐏𝐏. According to the above formula, estimating USPM and SIPM becomes another crucial step of our approach besides CONMF. Since we already have the coordinate values of users in the subject space, USPM can be computed by 𝐔𝐔𝐔𝐔𝐏𝐏𝐏𝐏 = 𝑓𝑓𝑢𝑢𝑢𝑢𝑢𝑢 (𝐔𝐔𝐔𝐔𝐂𝐂𝐂𝐂), i.e., the probability that a user is interested in a subject is proportional to her coordinate value on this subject. Likewise, ISPM can be calculated based on ISCM. However, simply transposing ISPM will not generate SIPM. Since ISPM and SIPM are actually marginal probability matrices of joint distribution F(I,S) on items and subjects respectively, we compute the joint distribution matrix of F(I,S) from ISPM and then estimate SIPM based on F(I,S). The details of computing SIPM is illustrated in Figure 3. Item ISCM Subject Item ISPM Subject Item IS Subject Subject SI Item Subject SIPM Item Unit Row Sum Multiply Item Freq Transpose Unit Row Sum Figure 3. Estimation of transition probability matrix SIPM 𝐒𝐒𝐒𝐒𝐏𝐏𝐏𝐏 = 𝑓𝑓𝑢𝑢𝑢𝑢𝑢𝑢 ((𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(I) ∙ 𝑓𝑓𝑢𝑢𝑢𝑢𝑢𝑢 (𝐈𝐈𝐈𝐈𝐂𝐂𝐂𝐂))T) (2) In short, SIPM is calculated using equation (2). Finally, we can compute UIPM based on USPM and SIPM and then recommend those items with the highest probabilities to each user (if the items have not been bookmarked before). 3.4 Tag Generalization As discussed in Section 1, some users tend to use tags which are meaningless to other users. These tags are noises disturbing matrix UT and IT. Since UT records tag frequencies of individual users, there will be considerable noises in it. However, IT is the aggregation of all users’ tagging behavior on items, thus it is much more resistant to noisy tags. As a result of this different noise ratio in UT and IT, combining them directly for factorization may lead to poor results. In order to generate a more reliable UT, one possible approach (not without the loss of user-specific information) is as follows. For a user and an item bookmarked by this user, we reset her tag vector as the corresponding tag vectors from IT, reflecting crowd wisdom. In the matrix form, 𝐔𝐔𝐔𝐔 = 𝐔𝐔𝐔𝐔 ∙ 𝑓𝑓𝑢𝑢𝑢𝑢𝑢𝑢 (𝐈𝐈𝐈𝐈). We call this idea tag generalization and have tested its effectiveness in Section 4. 4. An Empirical Study The datasets used in our study were crawled from Delicious, the largest social bookmarking site, from June 2008 to December 2008. Two raw datasets consisting of bookmarking data from 5000 users each were used for our reported experiments. To reduce the size of the raw data, we removed users that had posted less than 15 URLs and items that had been saved by less than 15 users. To create an experimental testbed that can highlight differences between recommendation algorithms, we removed the URLs that had been bookmarked by more than 75 users, which accounted for about 1~2% of all URLs. This removal Computing SIPM Compute ISPM by normalizing ISCM to unit row sum Multiply each row of ISPM by the frequency of this item being saved in the training set, which is equivalent to multiplying the diagonal matrix of items’ frequencies diag(I) by ISPM. Transpose IS so as to get SI Obtain SIPM by normalizing SI to unit row sum 76 19th Workshop on Information Technologies and Systems
is based on the observation that recommending items that are extremely popular in the datasets is al ways of high probability to be correct thus making algorithm comparison less sensitive. After all these preprocessing steps, two datasets named as Small and Large were obtained. Table 1 provides key statistics of these two datasets Table 1. dataset characteristics Number of users Number of tags Number of transactions Density level (% Average number of URls per user Average number of users per URL We randomly divided each of the datasets into a training set and a test set. The split was based on the training-testing ratio between 20%0-80% and was done at the user level. In the predication phase, all methods recommended 5 items for each user and then compared them with the items in the test set. The evaluation metrics adopted in our experiment were the commonly used ones including precision, recall, F- measure,and rankscore for ranked list prediction Twelve algorithms were evaluated in our experiments. RaNd algorithm generates random recommendations for every user, while PoP algorithm recommends the most popular items to each user The user-based (UB)and item-based (IB)algorithms, as well as their variants namely tagging user-based (TUB)and tagging item-based (TiB)(Peng et al. 2009)were implemented as the baselines. In addition to the topic-based (TB)method, the svd dimensionality reduction method (SVD) which finds hidden structure using singular value decomposition and the fusion algorithm(FUS)extending user/item profiles with tags are also included for comparison. Furthermore, to investigate the benefits of tag generalization we integrate it into the TB and SB algorithms, and rename them as gtB and GSB respectively. The experiment was rerun for 10 times and the final results are averaged over all runs. Table 2 summarizes the results on both datasets Table 2. Experimental results on the Small and large dataset ct Algorithm Precision(%)Recall (%) F-measure(%) Rank Se 23 4.05 22.l8 1 DBB 19.25 2231 27 2468 251 4.55 22.99 222 4.40 23.98 2834 28.50 5.53 RAND 1.74 .39 3.82 108 276 SVD 5.15 1.16 5.7 0.89 672083 6931 0.96 1.75 87 1.77 19th Workshop on Information Technologies and Systems
is based on the observation that recommending items that are extremely popular in the datasets is always of high probability to be correct thus making algorithm comparison less sensitive. After all these preprocessing steps, two datasets named as Small and Large were obtained. Table 1 provides key statistics of these two datasets. Table 1. Dataset characteristics Dataset Small Large Number of users 485 1097 Number of URLs 999 1872 Number of tags 11307 9608 Number of transactions 24612 44599 Density level (%) 5.08 2.17 Average number of URLs per user 50.75 40.66 Average number of users per URL 24.64 23.82 We randomly divided each of the datasets into a training set and a test set. The split was based on the training-testing ratio between 20%-80% and was done at the user level. In the predication phase, all methods recommended 5 items for each user and then compared them with the items in the test set. The evaluation metrics adopted in our experiment were the commonly used ones including precision, recall, Fmeasure, and rankscore for ranked list prediction. Twelve algorithms were evaluated in our experiments. RAND algorithm generates random recommendations for every user, while POP algorithm recommends the most popular items to each user. The user-based (UB) and item-based (IB) algorithms, as well as their variants namely tagging user-based (TUB) and tagging item-based (TIB) (Peng et al. 2009) were implemented as the baselines. In addition to the topic-based (TB) method, the SVD dimensionality reduction method (SVD) which finds hidden structure using singular value decomposition and the fusion algorithm (FUS) extending user/item profiles with tags are also included for comparison. Furthermore, to investigate the benefits of tag generalization, we integrate it into the TB and SB algorithms, and rename them as GTB and GSB respectively. The experiment was rerun for 10 times and the final results are averaged over all runs. Table 2 summarizes the results on both datasets. Table 2. Experimental results on the Small and Large dataset Dataset Algorithm Precision (%) Recall (%) F-measure (%) Rank Score Small RAND 4.27 0.43 0.79 4.27 POP 9.81 1.00 1.81 9.84 UB 22.00 2.23 4.05 22.18 IB 13.52 1.37 2.49 13.52 SVD 19.10 1.93 3.51 19.25 TUB 21.02 2.14 3.88 21.20 TIB 22.31 2.27 4.12 22.27 FUS 24.68 2.51 4.55 24.76 TB 22.99 2.34 4.24 23.15 SB 23.87 2.42 4.40 23.98 GTB 28.34 2.88 5.22 28.50 GSB 30.00 3.05 5.53 30.28 Large RAND 1.74 0.22 0.39 1.73 POP 3.82 0.49 0.86 3.82 UB 4.81 0.61 1.08 4.85 IB 2.75 0.35 0.62 2.76 SVD 5.15 0.65 1.16 5.17 TUB 7.38 0.94 1.66 7.45 TIB 7.06 0.89 1.59 7.06 FUS 7.71 0.98 1.73 7.75 TB 7.62 0.96 1.71 7.67 SB 7.80 0.99 1.75 7.87 GTB 7.89 1.00 1.77 7.99 GSB 8.33 1.05 1.87 8.39 77 19th Workshop on Information Technologies and Systems