Reasonable Tag-Based Collaborative Filtering For Social Tagging Systems Reyn Y Nakamoto Shinsuke Nakajima Jun Miyazaki Nara Institute of science and Nihonbashi Hakozaki Building Motoyama, Kamigamo, Kita Technology Kyoto, Japan, 603-8555 Hakozaki 24-1, Nihonbashi nakajima@ cse. kyoto-su ac jp Nara, Japan, 630-0192 Tokyo, Japan 103-0015 miyazaki@ is naist. jp reyn@ kizasi jp Shunsuke uemura Hirokazu Kato Youichi Inagaki lara Sangyo University Nara institute of science and kizasi Company, Inc. Tateno-Kita 3-12-1, Sango Technolog Nihonbashi Hakozaki Building Ikoma Takayama 8916-5, Ikoma Nara, Japan, 636-8503 Nara, Japan, 630-0192 Hakozaki 24-1 Nihonbashi uemurashunsuke@ nara. su ac jp kato@ is naist. jp Tokyo, Japan 103-0015 inagaki@ kizasi jp ABSTRACT 1. INTRODUCTION In this paper, we present a tag-based collaborative filter As the Internet continues to mature and becomes more ing recommendation method for use with recently popular accessible to the common user, the amount of information online social tagging systems. Combining the information available increases exponentially. Accordingly, finding useful provided by tagging systems with the effective recommen- and relevant information is becoming progressively difficult dation abilities given by collaborative filtering we provide Moreover, a lot of the information available-blogs, various a website recommendation system which provides relevant types of reviews, and so forth-are highly subjective and thus credible recommendations that match the user's changing hard to evaluate purely through content-based algorithms. interests as well as the user's bookmarking profile. Based For such subjective sources, one person may enjoy some- upon user testing, our system provides a higher level of rel- thing while the next may dislike the same-everyone has dif- evant recommendations over other commonly used search fering opinions on what is good. In these cases, people-more and recommendation methods. We describe this system as so than the current ability of content-based algorithms-are well as the relevant user testing results and its implication greatly effective in evaluating and filtering this information towards use in online social tagging systems. For this reason. Tag-based Contextual Collaborative Fil- Categories and Subject descriptors found to be effective in providing website recommendations for social bookmarking systems 7. This method combines H 4m Information Systems Applications]: Miscella- the strengths of both Collaborative Filtering (CF)as well neous as tagging information from social bookmarking systems to provide effective, personalized recommendations to the user. General terms By using CF techniques, we can match users with similar Algorithms, Experimentation preferences. By employing tagging, we can match only the Keywords Both methods employ users to organize and evaluate infor- mation, making them a good fit for each other. With strong collaborative filtering, so user involvement. there is higher confidence in the credibility as well as the quality of the information recommendations. We now extend our algorithm one step further to provide implicit recommendations by considering the user's current interest. Our new method-called Reasonable Tag-Based Collaborative Filtering or RCF-takes into account the user bookmarking profile and the currently viewed page to pro- Permission to make digital or hard copies of all or part of this work for vide relevant website recommendations. We assume that or classroom use is granted without fee provided that copies are these tags are synonymous with the reasons why they liked not made or distributed for profit or commercial advantage and that copies something, and thus, we derived the name for our algorithm bear this notice and the full citation on the first page. To copy otherwise, to Reasonable Tag-Based Collaborative Filtering. In this paper, we explore the effectiveness of our algorithm wICOw08, October 30, 2008, Napa Valley, California, USA. d perform user testing as well. In comparison to the other Copyright2008ACM978-1-605582597/08/10…5 recommendation methods we tested. our method was found
Reasonable Tag-Based Collaborative Filtering For Social Tagging Systems Reyn Y. Nakamoto kizasi Company, Inc. Nihonbashi Hakozaki Building 3F Hakozaki 24-1, Nihonbashi Tokyo, Japan 103-0015 reyn@kizasi.jp Shinsuke Nakajima Kyoto Sangyo University Motoyama, Kamigamo, Kita Kyoto, Japan, 603-8555 nakajima@cse.kyoto-su.ac.jp Jun Miyazaki Nara Institute of Science and Technology Takayama 8916-5, Ikoma Nara, Japan, 630-0192 miyazaki@is.naist.jp Shunsuke Uemura Nara Sangyo University Tateno-Kita 3-12-1, Sango, Ikoma Nara, Japan, 636-8503 uemurashunsuke@nara.su.ac.jp Hirokazu Kato Nara Institute of Science and Technology Takayama 8916-5, Ikoma Nara, Japan, 630-0192 kato@is.naist.jp Youichi Inagaki kizasi Company, Inc. Nihonbashi Hakozaki Building 3F Hakozaki 24-1, Nihonbashi Tokyo, Japan 103-0015 inagaki@kizasi.jp ABSTRACT In this paper, we present a tag-based collaborative filtering recommendation method for use with recently popular online social tagging systems. Combining the information provided by tagging systems with the effective recommendation abilities given by collaborative filtering, we provide a website recommendation system which provides relevant, credible recommendations that match the user’s changing interests as well as the user’s bookmarking profile. Based upon user testing, our system provides a higher level of relevant recommendations over other commonly used search and recommendation methods. We describe this system as well as the relevant user testing results and its implication towards use in online social tagging systems. Categories and Subject Descriptors H.4.m [Information Systems Applications]: Miscellaneous General Terms Algorithms, Experimentation Keywords collaborative filtering, social tagging, recommendation systems Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WICOW’08, October 30, 2008, Napa Valley, California, USA. Copyright 2008 ACM 978-1-60558-259-7/08/10 ...$5.00. 1. INTRODUCTION As the Internet continues to mature and becomes more accessible to the common user, the amount of information available increases exponentially. Accordingly, finding useful and relevant information is becoming progressively difficult. Moreover, a lot of the information available–blogs, various types of reviews, and so forth–are highly subjective and thus, hard to evaluate purely through content-based algorithms. For such subjective sources, one person may enjoy something while the next may dislike the same–everyone has differing opinions on what is good. In these cases, people–more so than the current ability of content-based algorithms–are greatly effective in evaluating and filtering this information. For this reason, Tag-based Contextual Collaborative Filtering (TCCF) was proposed and described in [8]. It was found to be effective in providing website recommendations for social bookmarking systems [7]. This method combines the strengths of both Collaborative Filtering (CF) as well as tagging information from social bookmarking systems to provide effective, personalized recommendations to the user. By using CF techniques, we can match users with similar preferences. By employing tagging, we can match only the users that liked the same information for the same reasons. Both methods employ users to organize and evaluate information, making them a good fit for each other. With strong user involvement, there is higher confidence in the credibility as well as the quality of the information recommendations. We now extend our algorithm one step further to provide implicit recommendations by considering the user’s current interest. Our new method–called Reasonable Tag-Based Collaborative Filtering or RCF–takes into account the user’s bookmarking profile and the currently viewed page to provide relevant website recommendations. We assume that these tags are synonymous with the reasons why they liked something, and thus, we derived the name for our algorithm, Reasonable Tag-Based Collaborative Filtering. In this paper, we explore the effectiveness of our algorithm and perform user testing as well. In comparison to the other recommendation methods we tested, our method was found 11
to be the most effective for implicit live-updating recommen- steps to get satisfactory cluster sizes and document classi- dations. We will describe these results and its corresponding fication. Lastly, they explored creating a personalized tag implications search engine based upon their findings 2. RELATED WORK 2.3 TCCF Website Recommendation System TCCF is an website recommendation algorithm which com- 2.1 Collaborative Filtering Systems bines traditional CF systems and social tagging systems Collaborative Filtering(CF) is a process that uses the The essential idea is that CF provides personalization, and community of users to sort out relevant or important infor- tags provide clues of the reasons why users liked something ation from the non-relevant and non-important informa Unlike traditional CF models which use numeric ratings, the tion. The process is based upon the idea that users should TCCF model uses tags for calculating user similarity and enjoy the same items that their similar users like. From a score prediction. wider perspective, it is the idea of collaborating with other In a social bookmarking service, the act of bookmark- people and sharing knowledge to filter out the best infor- ing a website is a strong indicator of whether something is mation from the rest. Lastly, cF provides a high level of liked. Additionally, we also have tagging information. USu- information credibility since several users are evaluating and rating such information users perspective, and in most cases this is the reason why CF has been proven to work well under certain domains- they liked something. Thus, the key difference between tra- mainly entertainment domains-such as usenet recommenda ditional CF and TCCF: In addition to considering whether tions 9), movie recommendations 5l, product recommenda- users like something by using the tagging information rely f numerical ratings on resources by users 9. Once enough In the TCCF Website Recommendation System, users ratings are in place, similarity scores between users are cal bookmark websites they like using tags, and subsequently is then calculated using the similar users' ratings on other tags. Once the the user has added enough bookmarks, the items. Those resources with a score prediction above a cer- tain threshold are recommended The main drawback of CF is that it only considers if and lar user has rated. The score prediction value is basically the how much a user likes something. However, it does I level that the system thinks the user will like something-the into account the reasons why a user likes something igher score. the more the user should like the website. both of these steps consider bookmarking as well as the attached 2.2 Social Tagging Systems tags when generating recommendation candidates. In this paper, we extend this method and describe it in section 3 ther terms such as metadata, categorization, labels, and so forth. Tagging is the process of attaching natural language 3. REASONABLE TAG-BASED words as metadata to describe some resource like a movie COLLABORATIVE FILTERING FOR photo, book, etc. Tagging vocabulary can be controlled uncontrolled depending on the type and purpose of the ap SOCIAL TAGGING SYSTEMS We now explain our recommendation algorithm and its In recent years, the advent of Social Tagging Systems has use in our recommendation system. The basis of our system brought tagging back into the limelight. Currently, there is a website bookmarking system similar to del icio us. A are several popular online social tagging systems which are sample usage pattern would be as shown in figure 1 the subject of continuing research: they range from website Here, an arrow indicates a user has bookmarked a page bookmarking such as delicio us 3, photo sharing 4, re- and the tags above the arrow indicate the tags used while search paper searching [2 and so forth. All of these sites bookmarking. User A is bookmarking website I with the use tagging for many purposes, but in addition to that, they tags japaneseand'dictionary'. Similarly, user B is book focus on the social networking aspects of tagging to enhance marking website 2 with the tags and"news. Based the experience for end users. However, currently, tags are upon this type of social bookmarking system, our system only used for tag searching-user profile matching and sub- generates effective recommendations based upon the user's sequent recommendations are yet to be implemented. As bookmarking profile as well as their currently viewed page mentioned before, tags provide the clues as to why a user Since our recommendation algorithm is heavily based upon liked something. They are the who, what, when, where, and comparing websites using their topics, we first explain how why of the user's reason for tagging something [6. Because we generate topic domain vectors of this, as well as the similar use of social networ king,so- cial tagging systems provide an ideal choice for com 3.1 Generating Topic Domain Vectors with CF systems iteration of our algorithm, similarity cal- of the research for tagging systems has culation between websites was based on the cosine of the cused on user and tagging patterns. [10 uses a statistica website tag vectors. This tag vector was a feature vector approach towards deriving the emergent semantics of social where the parameters consisted of all the tags attached by tagging systems, namely delicio us. They use the Expec. users. For example, in figure 1, website 1's vector would tation Maximization algorithm(EM) to automatically clus. have a value of one for japanese'and'dictionary'. Web- ter documents, users, and tags. They explored the optimal site 2 would have values of two, two, one, one, and one for umber of domain clusters as well as the number of iterative apple, 'news, "games, 'reviews,, and'tech
to be the most effective for implicit live-updating recommendations. We will describe these results and its corresponding implications. 2. RELATED WORK 2.1 Collaborative Filtering Systems Collaborative Filtering (CF) is a process that uses the community of users to sort out relevant or important information from the non-relevant and non-important information. The process is based upon the idea that users should enjoy the same items that their similar users like. From a wider perspective, it is the idea of collaborating with other people and sharing knowledge to filter out the best information from the rest. Lastly, CF provides a high level of information credibility since several users are evaluating and rating such information. CF has been proven to work well under certain domains– mainly entertainment domains–such as usenet recommendations [9], movie recommendations [5], product recommendations [1], and so forth. Many CF systems rely upon a matrix of numerical ratings on resources by users [9]. Once enough ratings are in place, similarity scores between users are calculated. Based upon these similarity scores, score prediction is then calculated using the similar users’ ratings on other items. Those resources with a score prediction above a certain threshold are recommended. The main drawback of CF is that it only considers if and how much a user likes something. However, it does not take into account the reasons why a user likes something. 2.2 Social Tagging Systems Tagging has been around for sometime, albeit known by other terms such as metadata, categorization, labels, and so forth. Tagging is the process of attaching natural language words as metadata to describe some resource like a movie, photo, book, etc. Tagging vocabulary can be controlled or uncontrolled depending on the type and purpose of the application. In recent years, the advent of Social Tagging Systems has brought tagging back into the limelight. Currently, there are several popular online social tagging systems which are the subject of continuing research: they range from website bookmarking such as del.icio.us [3], photo sharing [4], research paper searching [2] and so forth. All of these sites use tagging for many purposes, but in addition to that, they focus on the social networking aspects of tagging to enhance the experience for end users. However, currently, tags are only used for tag searching–user profile matching and subsequent recommendations are yet to be implemented. As mentioned before, tags provide the clues as to why a user liked something. They are the who, what, when, where, and why of the user’s reason for tagging something [6]. Because of this, as well as the similar use of social networking, social tagging systems provide an ideal choice for combination with CF systems. Much of the research for tagging systems has been focused on user and tagging patterns. [10] uses a statistical approach towards deriving the emergent semantics of social tagging systems, namely del.icio.us. They use the Expectation Maximization algorithm (EM) to automatically cluster documents, users, and tags. They explored the optimal number of domain clusters as well as the number of iterative steps to get satisfactory cluster sizes and document classi- fication. Lastly, they explored creating a personalized tag search engine based upon their findings. 2.3 TCCF Website Recommendation System TCCF is an website recommendation algorithm which combines traditional CF systems and social tagging systems. The essential idea is that CF provides personalization, and tags provide clues of the reasons why users liked something. Unlike traditional CF models which use numeric ratings, the TCCF model uses tags for calculating user similarity and score prediction. In a social bookmarking service, the act of bookmarking a website is a strong indicator of whether something is liked. Additionally, we also have tagging information. Usually, the user will use tags to describe the resource from the user’s perspective, and in most cases this is the reason why they liked something. Thus, the key difference between traditional CF and TCCF: In addition to considering whether or not a user likes a resource, it also takes into account why users like something by using the tagging information . In the TCCF Website Recommendation System, users bookmark websites they like using tags, and subsequently, they can easily retrieve their bookmarks by searching with tags. Once the the user has added enough bookmarks, the first step is finding similar users. This is then followed by calculating a score prediction for every website that the similar user has rated. The score prediction value is basically the level that the system thinks the user will like something–the higher score, the more the user should like the website. Both of these steps consider bookmarking as well as the attached tags when generating recommendation candidates. In this paper, we extend this method and describe it in section 3. 3. REASONABLE TAG-BASED COLLABORATIVE FILTERING FOR SOCIAL TAGGING SYSTEMS We now explain our recommendation algorithm and its use in our recommendation system. The basis of our system is a website bookmarking system similar to del.icio.us. A sample usage pattern would be as shown in figure 1. Here, an arrow indicates a user has bookmarked a page and the tags above the arrow indicate the tags used while bookmarking. User A is bookmarking website 1 with the tags ‘japanese’ and ‘dictionary’. Similarly, user B is bookmarking website 2 with the tags ‘apple’ and ‘news’. Based upon this type of social bookmarking system, our system generates effective recommendations based upon the user’s bookmarking profile as well as their currently viewed page. Since our recommendation algorithm is heavily based upon comparing websites using their topics, we first explain how we generate topic domain vectors. 3.1 Generating Topic Domain Vectors In our previous iteration of our algorithm, similarity calculation between websites was based on the cosine of the website tag vectors. This tag vector was a feature vector where the parameters consisted of all the tags attached by users. For example, in figure 1, website 1’s vector would have a value of one for ‘japanese’ and ‘dictionary’. Website 2 would have values of two, two, one, one, and one for ‘apple’, ‘news’, ‘games, ‘reviews’, and ‘tech’. 12
Users websites In this case. n is the number of distinct tags attached to ebsite k by all users. P(tilD,)is the conditional proba a japanese, dictionary ility of tag term i(shown as ti) given domain j(shown as Di). tfi k idfi is as calculated by standard tf-idf, where tfi k is the number of times any user bookmarked website k with tag i divided by the total number of times that the apple, news website is bookmarked. Similarly, the idfi is determined by the total number of websites in the entire set divided by the number of websites bookmarked with tag i. So, instead of the website vector consisting of tag term counts, a websites domain vector parameters are now its topic domain weight to that particu- lar topic domain. An example calculation is shown in figure p(t D softwar Figure 1: Website Bookmarking System Overview This approach has the inherent problems that are associ- ted with natural language processing. For ex dle. vecto containing two synonyms, such asfunny'or humourous, DVk=(0.86,0.02,…) should be similar; however, when comparing only the base tag, these synonyms are considered as different terms, and thus, their vector cosine value will be low To deal with this Figure 2: Calculation of k apple coms website we applied a similar approach to that of [10], by using the topic domain vector EM algorithm to cluster the websites into topic domain Once this was done, we compared the websites based upon their topic vector rather than their tag vector In this case, apple. com has three tags attached by all users: Our data consists of three months worth of mining of apple,, 'mac', and 'software. Using example domain term delicio us RSS feeds. In total, we have over 100.000 users probability values, we show the calculations for hypothetical domains 1 and 2 for a website k 2.5 million webpages, 3.6 million bookmarks, and 870, 000 istinct tags. After mining the data, we subsequently stemmed Similarly, a bookmark topic domain vector created from r a bookmarking website k, DVA-k, would be calculated the tags using the well-known Porter stemming algorithm Then, the number of unique stemmed tags was about 780,000 the same fashion as shown in equation 3 Over this data, we ran the EM algorithm for 50 iterations DVA→k=(dm,A-k,du2,A→k,dtr3,A-k,…d100,A-k)(3) over 20,000 of the top bookmarked documents and the top 20,000 used tags and clustered them into 100 topic domains. The only exception is that the domain tag weight only The number of topic domains was chosen after a subjective considers the tags that user A used on website k as shown comparison of topic separation they produced. Finding the optimal number of topic domains is described in [10] and is beyond the scope of our research. d1,A-k=∑P(tl|D)·id4f,A→k Based on these topic domain clusters, website feature vec- tors were created for each document as shown in equation Since a user can only apply atmost one of the same tag term 1. We will from now refer to these as topic domain s to a bookmark, only the idf is used here. and abbreviate it 3.2 RCF Recommendations DVk =(dwi. k, dw2, k, dws, k., da100k) We now describe our Reasonable Tag-based Collaborative Here, dwj, k is the topic domain weight of a website k for Filtering(RCF) algorithm. The process to generate recom- a topic domain j. In other words, this is how strongly the mendations follows these steps: document belongs in a given topic. This is calculated by summing the product of the conditional probability and tag term weight for all tags attached to website k. This is cal 2. Finding recommendation candidates culated as shown in equation 2 3. Providing live-updating recommendations based upon dn,k=∑P(tlD)tkin We now describe these steps
A B C D 1 2 3 4 japanese, dictionary apple, news mac, rumors tennis games, reviews apple, tech, news tennis, sports Users Websites Figure 1: Website Bookmarking System Overview This approach has the inherent problems that are associated with natural language processing. For example, vectors containing two synonyms, such as ‘funny’ or ‘humourous’, should be similar; however, when comparing only the base tag, these synonyms are considered as different terms, and thus, their vector cosine value will be low. To deal with this, we applied a similar approach to that of [10], by using the EM algorithm to cluster the websites into topic domains. Once this was done, we compared the websites based upon their topic vector rather than their tag vector. Our data consists of three months worth of mining of del.icio.us RSS feeds. In total, we have over 100,000 users, 2.5 million webpages, 3.6 million bookmarks, and 870,000 distinct tags. After mining the data, we subsequently stemmed the tags using the well-known Porter stemming algorithm. Then, the number of unique stemmed tags was about 780,000. Over this data, we ran the EM algorithm for 50 iterations over 20,000 of the top bookmarked documents and the top 20,000 used tags and clustered them into 100 topic domains. The number of topic domains was chosen after a subjective comparison of topic separation they produced. Finding the optimal number of topic domains is described in [10] and is beyond the scope of our research. Based on these topic domain clusters, website feature vectors were created for each document as shown in equation 1. We will from now refer to these as ‘topic domain vectors’ and abbreviate it as DV . DVk = (dw1,k, dw2,k, dw3,k..., dw100,k) (1) Here, dwj,k is the topic domain weight of a website k for a topic domain j. In other words, this is how strongly the document belongs in a given topic. This is calculated by summing the product of the conditional probability and tag term weight for all tags attached to website k. This is calculated as shown in equation 2. dwj,k = Xn i=0 P(ti|Dj ) · tfi,k · idfi (2) In this case, n is the number of distinct tags attached to website k by all users. P(ti|Dj ) is the conditional probability of tag term i (shown as ti) given domain j (shown as Dj ). tfi,k · idfi is as calculated by standard tf-idf, where tfi,k is the number of times any user bookmarked website k with tag i divided by the total number of times that the website is bookmarked. Similarly, the idfi is determined by the total number of websites in the entire set divided by the number of websites bookmarked with tag i. So, instead of the website vector consisting of tag term counts, a website’s domain vector parameters are now its topic domain weight value, i.e. how much the document belongs to that particular topic domain. An example calculation is shown in figure 2. apple mac software website k's attached tags D mac (0.69) software (0.68) osx (0.66) tool (0.59) apple (0.58) D program(0.78) develop(0.56) refer(0.52) code(0.34) software(0.29) +0.58 * 0.81 +0.69 * 0.51 +0.58 * 0.07 tf * idf apple (0.81) mac (0.51) software (0.07) +0.29 * 0.07 dw = 0.86 1,k dw = 0.02 2,k DVk = (0.86, 0.02, ...) i 2 1 p(t |D )j Figure 2: Calculation of k = apple.com’s website topic domain vector In this case, apple.com has three tags attached by all users: ‘apple’, ‘mac’, and ‘software’. Using example domain term probability values, we show the calculations for hypothetical domains 1 and 2 for a website k. Similarly, a bookmark topic domain vector created from user A bookmarking website k, DVA→k, would be calculated in the same fashion as shown in equation 3. DVA→k = (dw1,A→k, dw2,A→k, dw3,A→k, ...dw100,A→k) (3) The only exception is that the domain tag weight only considers the tags that user A used on website k as shown in equation 4. dwj,A→k = Xn i=0 P(ti|Dj ) · idfi,A→k (4) Since a user can only apply atmost one of the same tag term to a bookmark, only the idf is used here. 3.2 RCF Recommendations We now describe our Reasonable Tag-based Collaborative Filtering (RCF) algorithm. The process to generate recommendations follows these steps: 1. Finding similar users. 2. Finding recommendation candidates. 3. Providing live-updating recommendations based upon the user’s current interest. We now describe these steps. 13
3.2.1 Finding Similar Users website liked. This is especially important when considering We start with finding similar users. Similar larger sites, such as Yahoo! or Slashdot, because they tend tional collaborative filtering, it is based upon to cover many topics, and thus, users may like the same site liked items-in this case, commonly bookmarked However, we also consider the used tags-which o be the reason why a user liked a resource. For example 3.2.2 Finding Recommendation Candidates from our previously shown system in figure 1, suppose we Next, we find recommendation candidates by calculating want to find similar users for user B. It would be calculated core prediction-the score which represents how much the s shown in figure 3. ke a resource. Again, similar to traditiona collaborative filtering we recommend websites that the sim- ilar users liked, except that now we also add the topic domain that the original user and the similar user have a common interest on. The goal being to recommend websites that match the topic that the two users commonly apple, news 2 For example, in the system shown in figure 1, we will find recommendation candidates for user B(user B has a sim- ilar user C-shown in figure 4). User B and C both liked website 2 for the reason that the website is about apple technology news. Thus, we find websites that match this topic. In this case, similar user C has two other bookmarks. Since only website 3s domain vector matches the commonly bookmarked website 2s domain vector-indicated by the dot. ted arrow-it has a high score prediction, and thus, becomes Figure 3: Example User Similarity Calculation a recommendation candidate(denoted by a star). On the other hand. C's other bookmark. website 4 does not match the commonly bookmarked website's topic domain-marked Since user B and C have bookmarked the same website by the crossed-out dotted line-and thus, its score prediction with similar tags-indicated by the dotted arrow-their simi- is lower larity score is higher. In this case, C becomes a similar user, indicated by the star. On the other hand, users B and A have bookmarked the same site, but they do not use similar target website, x tags-indicated by the crossed-out dotted arrow-and thus common bookmark. k their similarity score is lower. User similarity for a user A and a user B is calculated as apple, tech, I simre,(A,B)=a.-2Isim(DVA-k, DVB-k)) Figure 4: Example score prediction calculation Here, user similarity is the average of the cosine of A and B's domain vectors for each commonly bookmarked website, k In other words, if they bookmarked the same websites with The score prediction algorithm is shown in equation 7 similar tags, we assume that they liked the website for the Here, we are finding the score prediction for a target website same reasons, and thus, they are similar users. The first half r for a user A. We first take all bookmarks from all similar of the equation calculates the similarity between the topic users Sk with a user similarity above a certain user simi- domain vectors of each common bookmark for both users. larity threshold-in our case, 0.75. We then compare each The second half of the equation represents the number of of those bookmark domain vector DVsk-r with the book common bookmarks and increases as the number of common mark domain vector on the commonly bookmarked website bookmarks goes k, DVS,=k. We average these scores from all users and Also n is the number of bookmarks that user A and user then those webpages with scores above a certain threshold- main vector comparison, which for our experiments was set ditionally, we record the association between each commonly to 0.9. Additionally, sim(DVA-k, DVB-k)is the cosine of bookmarked webpage k and its generated recommendations user A's bookmark domain vector on website k and user B's This is used in the final score calculation described in the bookmark domain vector on website k as shown in equation following section Also, a is the weight given to the domain vector compar- ison, and in this case, it is 0.90. The left side calculates stm(DVA→k A→k‖!DVB (6) the similarity between the topic domain vectors of the com- mon bookmark and the target bookmark, and the right side Different from traditional CF, users must use similar tags of the equation represents the number of similar users that on the common website in addition to having the common have bookmarked the target website aT
3.2.1 Finding Similar Users We start with finding similar users. Similar to traditional collaborative filtering, it is based upon commonly liked items–in this case, commonly bookmarked websites. However, we also consider the used tags–which we assume to be the reason why a user liked a resource. For example, from our previously shown system in figure 1, suppose we want to find similar users for user B. It would be calculated as shown in figure 3. B C 2 apple, tech, news apple, news A games, reviews common bookmark, k Figure 3: Example User Similarity Calculation Since user B and C have bookmarked the same website with similar tags–indicated by the dotted arrow–their similarity score is higher. In this case, C becomes a similar user, indicated by the star. On the other hand, users B and A have bookmarked the same site, but they do not use similar tags–indicated by the crossed-out dotted arrow–and thus, their similarity score is lower. User similarity for a user A and a user B is calculated as shown in equation 5. simrcf (A, B) = α · 1 n Xn 1 {sim(DVA→k, DVB→k)} + (1 − α) · log2(1 + n) (5) Here, user similarity is the average of the cosine of A and B’s domain vectors for each commonly bookmarked website, k. In other words, if they bookmarked the same websites with similar tags, we assume that they liked the website for the same reasons, and thus, they are similar users. The first half of the equation calculates the similarity between the topic domain vectors of each common bookmark for both users. The second half of the equation represents the number of common bookmarks and increases as the number of common bookmarks goes up. Also, n is the number of bookmarks that user A and user B have in common. α is the weight given to the topic domain vector comparison, which for our experiments was set to 0.9. Additionally, sim(DVA→k, DVB→k) is the cosine of user A’s bookmark domain vector on website k and user B’s bookmark domain vector on website k as shown in equation 6. sim(DVA→k, DVB→k) = DVA→k · DVA→k |DVA→k||DVB→k| (6) Different from traditional CF, users must use similar tags on the common website in addition to having the common website liked. This is especially important when considering larger sites, such as Yahoo! or Slashdot, because they tend to cover many topics, and thus, users may like the same site for different reasons. 3.2.2 Finding Recommendation Candidates Next, we find recommendation candidates by calculating score prediction–the score which represents how much the user should like a resource. Again, similar to traditional collaborative filtering, we recommend websites that the similar users liked, except that now we also additionally match the topic domain that the original user and the similar user have a common interest on. The goal being to recommend websites that match the topic that the two users commonly liked. For example, in the system shown in figure 1, we will find recommendation candidates for user B (user B has a similar user C–shown in figure 4). User B and C both liked website 2 for the reason that the website is about apple technology news. Thus, we find websites that match this topic. In this case, similar user C has two other bookmarks. Since only website 3’s domain vector matches the commonly bookmarked website 2’s domain vector–indicated by the dotted arrow–it has a high score prediction, and thus, becomes a recommendation candidate (denoted by a star). On the other hand, C’s other bookmark, website 4, does not match the commonly bookmarked website’s topic domain–marked by the crossed-out dotted line–and thus, its score prediction is lower. 2 C 3 4 mac, rumors apple, tech, news tennis, sports common bookmark, k target website, x Figure 4: Example score prediction calculation The score prediction algorithm is shown in equation 7. Here, we are finding the score prediction for a target website x for a user A. We first take all bookmarks from all similar users Sk with a user similarity above a certain user similarity threshold–in our case, 0.75. We then compare each of those bookmark domain vector DVSk→x with the bookmark domain vector on the commonly bookmarked website k, DVSk→k. We average these scores from all users and then those webpages with scores above a certain threshold– 0.50 in our tests–become recommendation candidates. Additionally, we record the association between each commonly bookmarked webpage k and its generated recommendations. This is used in the final score calculation described in the following section. Also, α is the weight given to the domain vector comparison, and in this case, it is 0.90. The left side calculates the similarity between the topic domain vectors of the common bookmark and the target bookmark, and the right side of the equation represents the number of similar users that have bookmarked the target website x. 14
corepred(a, r)=a h-1[simrey(A, Sk).sim(DVSK-k, DVsk -1) +(1-a).log2(1+2Isimrer(A, Sk))(7) ∑k=1{ sTore/(A,Sk) all users attached tags common bookmark, k recommendation candidates.x pple, mac, software mac. rumors c tennis × Figure 5: Example Final Score Calculation 3.2.3 Providing Live-Updating Recommendations higher than a threshold of 0.50 are shown to the user in the Based Upon User's Current nterest descending order of the score The last step in the process of generating recommenda tions is determining which of the recommendation candi- 3.3 Live-Updating Website recommendation dates match the current interest. In our case. we assume Prototype System the user's current interest to be the topic of the website Using our RCF algorith, we created a prototy they are presently viewing. We then generate the topic do- main vector for the current ly viewed website by using the mendation system. It was created to provide live-updating ags from all users bookmarks from our mined data set website recommendations that automatically update based on the page that they are currently viewing. Thus, it was For example, in the system shown in figure 1, we provide designed to be viewable during the user's normal browsing recommendations for user B as shown in figure 5 Here, B is looking at a site which has the tags apple, activity. It is currently implemented as a Firefox sidebar mac, and software attached to it by bookmarks from all 6. While the user is browsing, the recommendations are ers. B also has two recommendation candidates, website updated whenever the viewed website changes, thereby pro- ystem), which were generated from section 3. 2.2. Website 3 viding recommendations relevant to the user's current inter est. Here, the generated recommendations are shown in the has been associated with the commonly bookmarked website sidebar on the left of the image. If they feel the recommen- 2. and website 6 has been associated with website 5. W then compare the user's currently viewed webpage's domain dations are useful, they can click on it and the browser will vector to all of B,s bookmarks. If we find one that is above be redirected to the recommended website a certain threshold. we take the recommendation candidates that are attached to it and calculate its final score as shown e全a in equation 8. All websites above a certain threshold are then recommended to the end user In this case. since bs bookmark domain vector matches the current websites domain vector, its attached website Pary as 52 180or is used for final score calculation if its score is above a certain threshold. it is recommended. B also has an arbi. trary recommendation candidate, website 6. However, since Bs bookmark on website 5 does not match the currently viewed website. its associated recommendation candidate is not recommended at this time. if the current website were to change to something about sports or tennis, website 6 would be recommended if its final score was above the se. ected threshold In equation 8, user A is viewing webpage cur, k is the commonly bookmarked website, and z is a recommenda- tion candidate. Lastly, B is the weight given to the part of the equation which determines how similar the recommen Figure 6: System Interface dation candidate should be to the current page. For our experiments, this was set to 1. All resultant final scores
scorepred(A, x) = α · Pn k=1{simrcf (A, Sk) · sim(DVSk→k, DVSk→x)} Pn k=1{simrcf (A, Sk)} + (1 − α) · log2(1 + Xn k=1 {simrcf (A, Sk)} (7) B 2 C apple, news common bookmark, k 3 mac, rumors cur E 5 tennis, sports 6 tennis all users attached tags apple, mac, software recommendation candidates, x Figure 5: Example Final Score Calculation 3.2.3 Providing Live-Updating Recommendations Based Upon User’s Current Interest The last step in the process of generating recommendations is determining which of the recommendation candidates match the current interest. In our case, we assume the user’s current interest to be the topic of the website they are presently viewing. We then generate the topic domain vector for the currently viewed website by using the tags from all user’s bookmarks from our mined data set. For example, in the system shown in figure 1, we provide recommendations for user B as shown in figure 5. Here, B is looking at a site which has the tags ‘apple’, ‘mac’, and ‘software’ attached to it by bookmarks from all users. B also has two recommendation candidates, website 3 and 6 (6 is an arbitary webpage not shown in the example system), which were generated from section 3.2.2. Website 3 has been associated with the commonly bookmarked website 2, and website 6 has been associated with website 5. We then compare the user’s currently viewed webpage’s domain vector to all of B’s bookmarks. If we find one that is above a certain threshold, we take the recommendation candidates that are attached to it and calculate its final score as shown in equation 8. All websites above a certain threshold are then recommended to the end user. In this case, since B’s bookmark domain vector matches the current website’s domain vector, its attached website, 3, is used for final score calculation. If its score is above a certain threshold, it is recommended. B also has an arbitrary recommendation candidate, website 6. However, since B’s bookmark on website 5 does not match the currently viewed website, its associated recommendation candidate is not recommended at this time. If the current website were to change to something about sports or tennis, website 6 would be recommended if its final score was above the selected threshold. In equation 8, user A is viewing webpage cur, k is the commonly bookmarked website, and x is a recommendation candidate. Lastly, β is the weight given to the part of the equation which determines how similar the recommendation candidate should be to the current page. For our experiments, this was set to 1. All resultant final scores higher than a threshold of 0.50 are shown to the user in the descending order of the score. 3.3 Live-Updating Website Recommendation Prototype System Using our RCF algorithm, we created a prototype recommendation system. It was created to provide live-updating website recommendations that automatically update based on the page that they are currently viewing. Thus, it was designed to be viewable during the user’s normal browsing activity. It is currently implemented as a Firefox sidebar plugin. The recommendation interface is shown in figure 6. While the user is browsing, the recommendations are updated whenever the viewed website changes, thereby providing recommendations relevant to the user’s current interest. Here, the generated recommendations are shown in the sidebar on the left of the image. If they feel the recommendations are useful, they can click on it and the browser will be redirected to the recommended website. Figure 6: System Interface 15