Tag recommendation for Folksonomies Oriented towards Individual users Marek Lipczak Faculty of Computer Science, Dalhousie University, Halifax, Canada, B3H 1W5 Abstract. Tagging has become a standard way of organizing informa- tion on the Web, particularly in folksonomies- data repositories freely created by communities of users. A few tags attached to each resource create a bridge between heterogeneous data and users accustomed to keyword-based search and browsing. To establish this connection, tag- ging requires users to manually define tags for each resource they enter to the system. This potentially time-consuming step can be eased by tag ecommender systems, which propose terms that users may choose to use as tags. This paper suggests and evaluates potential sources of rec- ommended tags, focusing on folksonomies oriented towards individual suggestions are used to propose a three-step t dation system. Basic tags are extracted from the resource title. In the ext step, the set of potential recommendations is extended by related tags proposed by a lexicon based on co-occurrences of tags within re- rce's posts. Finally, tags are filtered by the users personomy -a set of tags previously used by the user. 1 Introduction Folksonomy services allow users to store and share various types of Internet resources. The content of folksonomies is completely defined by communities of their users. Large number of creators and resources push the folksonomies from the traditional hierarchical data structure design based on directories cre- ated by system editors(e. g, Open Directory Project )to tag-based taxonomies defined jointly by service users(e. g, BibSonomy 2, del. icio. us Flickr, Techno- ratio).While adding a resource to the system, users are asked to define a set of tags -keywords which describe it and relate it to other resources gathered in the system. To ease this process, some folksonomy services recommend a set of tially matching tags. Proposing a tag recommendation system was a task of ECML PKDD discovery challenge 20086. This paper presents a tag recom- mendation system submitted to the challenge http://www.dmoz.org/about.html http://bibsonomy.org/help/about, 3http://del.icio.us/about/ Shttp://technorati.com/about, http://www.kde.cs.uni-kassel.de/ws/rsdc08/
Tag Recommendation for Folksonomies Oriented towards Individual Users Marek Lipczak Faculty of Computer Science, Dalhousie University, Halifax, Canada, B3H 1W5 lipczak@cs.dal.ca Abstract. Tagging has become a standard way of organizing information on the Web, particularly in folksonomies – data repositories freely created by communities of users. A few tags attached to each resource create a bridge between heterogeneous data and users accustomed to keyword-based search and browsing. To establish this connection, tagging requires users to manually define tags for each resource they enter to the system. This potentially time-consuming step can be eased by tag recommender systems, which propose terms that users may choose to use as tags. This paper suggests and evaluates potential sources of recommended tags, focusing on folksonomies oriented towards individual users. These suggestions are used to propose a three-step tag recommendation system. Basic tags are extracted from the resource title. In the next step, the set of potential recommendations is extended by related tags proposed by a lexicon based on co-occurrences of tags within resource’s posts. Finally, tags are filtered by the user’s personomy – a set of tags previously used by the user. 1 Introduction Folksonomy services allow users to store and share various types of Internet resources. The content of folksonomies is completely defined by communities of their users. Large number of creators and resources push the folksonomies from the traditional hierarchical data structure design based on directories created by system editors (e.g., Open Directory Project1 ) to tag-based taxonomies defined jointly by service users (e.g., BibSonomy2 , del.icio.us3 , Flickr4 , Technorati5 ).While adding a resource to the system, users are asked to define a set of tags – keywords which describe it and relate it to other resources gathered in the system. To ease this process, some folksonomy services recommend a set of potentially matching tags. Proposing a tag recommendation system was a task of ECML PKDD discovery challenge 20086 . This paper presents a tag recommendation system submitted to the challenge. 1 http://www.dmoz.org/about.html 2 http://bibsonomy.org/help/about/ 3 http://del.icio.us/about/ 4 http://flickr.com/about/ 5 http://technorati.com/about/ 6 http://www.kde.cs.uni-kassel.de/ws/rsdc08/
The formal definition of folksonomy can be found in [ 6. A folksonomy is a collection of resources entered by users in posts. Each post consists of a resource nd a set of tags attached to it by a user. generally, the resource is specific to the user who added it to the system. However, for some types of resources(e. g bookmarks)identical resources can be added to the system by different users In the latter case, by the set of resource tags we denote all tags attached to a given resource by various users Folksonomies can be classified into two types based on the objective of the tagging process. The first type, represented by bibSonomy and delicio us, treats resources(e.g, personal bookmarks)as an individual property of a user. Here the aim of tags is to create a repository tailored to individual user interests. In this paper, this type is referred to as folksonomies oriented towards individual users. The second type of folksonomies, represented by Flickr and Technorati, is a shared repository of public resources(e. g, blog entries). In this case tags are added keeping in mind a broad audience that in the future would like to search for the resource. In this paper, this type is referred to as folksonomies oriented towards broad audience. As the reason of tagging a resource is fundamentally different, we may expect that a tag recommendation system that suits one folk- sonomy type would be inappropriate for the other. This paper focuses on the first type, proposing a tag recommender for individual users 2 Related work The attention of researchers is mostly directed to tag recommendation systems for broad audience folksonomies. Tag Assist [12 is a system designed to recom- mend tags of blog posts. The recommendation is built on tags previously at- tached to similar resources. Earlier, meaning disambiguation is performed based on co-occurrence of tags in the complete repository. Co-occurrence of tags was also used by Sigurbjornsson and van Zwol [11 to propose tags that complement user-defined tags of photographs in Flickr The problem of tag recommendation in folksonomies oriented towards indi- vidual users was addressed by Jaschke et al. [ 7. They compared a number of recommendation techniques including collaborative filtering, PageRank, and it modification suited for folksonomies-FolkRank. The evaluation showed that the FolkRank based recommender outperforms other approaches; however, the tests were performed on a dense core of folksonomy, thus might be not representative Most of the tag recommendation systems are based on the tags that are al- ready present in the system. An exception from this rule is the system presented by Lee and Chun 9. The system recommends tags retrieved from the content of a blog, using artificial neural network. The network is trained based on sta tistical information about word frequencies and lexical information about word semantics extracted from WordNet Schmitz et al. 10 proposed association rule mining as a technique that might be useful in the tag recommendation process. The intuition behind this concept was also used in the system presented by this paper
The formal definition of folksonomy can be found in [6]. A folksonomy is a collection of resources entered by users in posts. Each post consists of a resource and a set of tags attached to it by a user. Generally, the resource is specific to the user who added it to the system. However, for some types of resources (e.g., bookmarks) identical resources can be added to the system by different users. In the latter case, by the set of resource tags we denote all tags attached to a given resource by various users. Folksonomies can be classified into two types based on the objective of the tagging process. The first type, represented by BibSonomy and del.icio.us, treats resources (e.g., personal bookmarks) as an individual property of a user. Here, the aim of tags is to create a repository tailored to individual user interests. In this paper, this type is referred to as folksonomies oriented towards individual users. The second type of folksonomies, represented by Flickr and Technorati, is a shared repository of public resources (e.g., blog entries). In this case tags are added keeping in mind a broad audience that in the future would like to search for the resource. In this paper, this type is referred to as folksonomies oriented towards broad audience. As the reason of tagging a resource is fundamentally different, we may expect that a tag recommendation system that suits one folksonomy type would be inappropriate for the other. This paper focuses on the first type, proposing a tag recommender for individual users. 2 Related work The attention of researchers is mostly directed to tag recommendation systems for broad audience folksonomies. TagAssist [12] is a system designed to recommend tags of blog posts. The recommendation is built on tags previously attached to similar resources. Earlier, meaning disambiguation is performed based on co-occurrence of tags in the complete repository. Co-occurrence of tags was also used by Sigurbj¨ornsson and van Zwol [11] to propose tags that complement user-defined tags of photographs in Flickr. The problem of tag recommendation in folksonomies oriented towards individual users was addressed by J¨aschke et al. [7]. They compared a number of recommendation techniques including collaborative filtering, PageRank, and its modification suited for folksonomies – FolkRank. The evaluation showed that the FolkRank based recommender outperforms other approaches; however, the tests were performed on a dense core of folksonomy, thus might be not representative. Most of the tag recommendation systems are based on the tags that are already present in the system. An exception from this rule is the system presented by Lee and Chun [9]. The system recommends tags retrieved from the content of a blog, using artificial neural network. The network is trained based on statistical information about word frequencies and lexical information about word semantics extracted from WordNet. Schmitz et al. [10] proposed association rule mining as a technique that might be useful in the tag recommendation process. The intuition behind this concept was also used in the system presented by this paper
3 Examined dataset All presented experiments and the evaluation of proposed tag recommenda- tion system were performed on a snapshot of Bib Sonomy 5 containing 2, 570 users, 242, 175 resources and 274, 139 posts(after preprocessing). The snapshot was provided by the organizers of the eCMl PKdd discovery challenge 2008. The preprocessing phase included removing useless tags(e. g, "system: unfiled") changing all letters to lower case and removing non-alphabetical and non-numeri- cal characters from tags The statistical characteristics of folksonomies have been an object of many research publications 2, 3, 8, 11]. In the following sections I present experiments particularly important from the perspective of the tag recommendation task 3.1 General characteristics The frequency distribution of tags from the Bibsonomy snapshot shows that mid and low-frequency tags follow Zipfs distribution(Fig. 1). Zipf's distribution does not hold for high-frequency tags. The frequency distribution of tags from Flickr, which represents folksonomies oriented towards broad audience shows important differences [11]. Flickr's low-frequency tags does not follow Zipf's distribution a possible explanation of this fact is a smaller number of user specific tags in comparison to folksonomies oriented towards individual users. In addition Flickr's high-frequency tags follows Zipfs distribution and are too general to be used as recommendation. The list of the most frequent tags from Bibsonomy (“ software”,“web20”,“ tools”,“web”,"blog") shows that tag recommenders for folksonomies oriented towards individual users should not ignore high-frequency The difference between two folksonomy types may have impact on the effi ciency of applied tag recommendation methods. A commonly used collaborative filtering approach is based on the intuition that the best recommendation con- sists of tags attached to the resource by people similar to the user. This approach proved its quality in many recommendation systems; however, the intuition be- hind it can be deceiving. Folksonomies like BibSonomy or del icio us are mainly designed as a collection of repositories of individual users. B user defines his/ her own set of used tags- personomy [ 6, which describes the resources from a user's point of view. As a result, users addressing similar re- ources do not have to use similar tags, and similar personomies do not have to be associated with similarity in tagged resources. In fact, there is no such correlation in the processed BibSonomy snapshot. The cosine similarity between users calculated based on tags seems to be uncorrelated with that calculated based on resources(Fig. 2). In this situation recommending tags assigned to a resource by similar use laborative filtering) should give similar results as recommending the tags frequently attached to the resource by any user. This conclusion seems to be confirmed by the experiment presented by Jaschke et al. 7. Minding the limitations of the collaborative approach I decided to focus on a tag space that is directly related to a pos
3 Examined dataset All presented experiments and the evaluation of proposed tag recommendation system were performed on a snapshot of BibSonomy [5] containing 2, 570 users, 242, 175 resources and 274, 139 posts (after preprocessing). The snapshot was provided by the organizers of the ECML PKDD discovery challenge 2008. The preprocessing phase included removing useless tags (e.g., “system:unfiled”), changing all letters to lower case and removing non-alphabetical and non-numerical characters from tags. The statistical characteristics of folksonomies have been an object of many research publications [2, 3, 8, 11]. In the following sections I present experiments particularly important from the perspective of the tag recommendation task. 3.1 General characteristics The frequency distribution of tags from the Bibsonomy snapshot shows that midand low-frequency tags follow Zipf’s distribution (Fig. 1). Zipf’s distribution does not hold for high-frequency tags. The frequency distribution of tags from Flickr, which represents folksonomies oriented towards broad audience shows important differences [11]. Flickr’s low-frequency tags does not follow Zipf’s distribution. A possible explanation of this fact is a smaller number of user specific tags in comparison to folksonomies oriented towards individual users. In addition, Flickr’s high-frequency tags follows Zipf’s distribution and are too general to be used as recommendation. The list of the most frequent tags from Bibsonomy (“software”, “web20”, “tools”, “web”, “blog”) shows that tag recommenders for folksonomies oriented towards individual users should not ignore high-frequency terms. The difference between two folksonomy types may have impact on the effi- ciency of applied tag recommendation methods. A commonly used collaborative filtering approach is based on the intuition that the best recommendation consists of tags attached to the resource by people similar to the user. This approach proved its quality in many recommendation systems; however, the intuition behind it can be deceiving. Folksonomies like BibSonomy or del.icio.us are mainly designed as a collection of repositories of individual users. By adding posts, each user defines his/her own set of used tags – personomy [6], which describes the resources from a user’s point of view. As a result, users addressing similar resources do not have to use similar tags, and similar personomies do not have to be associated with similarity in tagged resources. In fact, there is no such correlation in the processed BibSonomy snapshot. The cosine similarity between users calculated based on tags seems to be uncorrelated with that calculated based on resources (Fig. 2). In this situation recommending tags assigned to a resource by similar users (collaborative filtering) should give similar results as recommending the tags frequently attached to the resource by any user. This conclusion seems to be confirmed by the experiment presented by J¨aschke et al. [7]. Minding the limitations of the collaborative approach I decided to focus on a tag space that is directly related to a post
Fig. 2. Cosine similarity between each pair Fig 1. The overall frequency distribu- of users calculated based on tags(tf-idf tion of tags(after preprocessing and re- weights)and resources(binary weights) moving posts classified as imported). The two values seem to be independent 3.2 Characteristics based on individual posts Considering only the direct surrounding of the post, the potential tag recom mendations can be obtained from the resource itself. the set of tags attached to the resource in previous posts, or the set of tags that were already used by the user(users personor Exploiting tags from the resource depends on the folksonomy character. In BibSonomy the resource can be a bibtex entry or a web- page bookmark. The first contains bibliographic information about a research publication including its title and abstract. The second contains web-page title and URL. Preliminary experiments showed that using title words as tags outper forms the results of abstracts and URLs. The latter two contain lesser amount of correct tags. The title is the only element that joins both resource types and it is common in other folksonomies, which are its additional advantages. I decided to use the title as the representation of resource. To evaluate the three potential sources of tag recommendations, namely words from the resource title, resource tags and user's personomy, I checked for each post if its tags can be found in any of these sources associated with other posts in the folksonomy. The quality of sources was measured by precision (i. e, number of correct tags retrieved divided by the total number of retrieved tags)and recall (i.e, number of correct tags retrieved divided by the total ber of correct tags). These are standard information retrieval metrics [4.The value of recall was averaged over all tested posts. The averaged recall informs us how many correct tags can be found in a source. The value of precision was averaged only over posts, for which the source returned any tags. Precision av eraged this way is the ratio of correct tags among all tags retrieved. In addition, I present the total number of potential tags obtained from the sources, and the number of correct tags among them(Fig 3) User's personomy is the richest source of correct tag recommendations. For the tested BibSonomy snapshot it gave access to 90% of tags from test posts. On
1 10 100 1000 10000 100000 1 10 100 1000 10000 100000 frequency rank Fig. 1. The overall frequency distribution of tags (after preprocessing and removing posts classified as imported). Fig. 2. Cosine similarity between each pair of users calculated based on tags (tf-idf weights) and resources (binary weights). The two values seem to be independent. 3.2 Characteristics based on individual posts Considering only the direct surrounding of the post, the potential tag recommendations can be obtained from the resource itself, the set of tags attached to the resource in previous posts, or the set of tags that were already used by the user (user’s personomy). Exploiting tags from the resource depends on the folksonomy character. In BibSonomy the resource can be a bibtex entry or a webpage bookmark. The first contains bibliographic information about a research publication including its title and abstract. The second contains web-page title and URL. Preliminary experiments showed that using title words as tags outperforms the results of abstracts and URLs. The latter two contain lesser amount of correct tags. The title is the only element that joins both resource types and it is common in other folksonomies, which are its additional advantages. I decided to use the title as the representation of resource. To evaluate the three potential sources of tag recommendations, namely words from the resource title, resource tags and user’s personomy, I checked for each post if its tags can be found in any of these sources associated with all other posts in the folksonomy. The quality of sources was measured by precision (i.e., number of correct tags retrieved divided by the total number of retrieved tags) and recall (i.e., number of correct tags retrieved divided by the total number of correct tags). These are standard information retrieval metrics [4]. The value of recall was averaged over all tested posts. The averaged recall informs us how many correct tags can be found in a source. The value of precision was averaged only over posts, for which the source returned any tags. Precision averaged this way is the ratio of correct tags among all tags retrieved. In addition, I present the total number of potential tags obtained from the sources, and the number of correct tags among them (Fig. 3). User’s personomy is the richest source of correct tag recommendations. For the tested BibSonomy snapshot it gave access to 90% of tags from test posts. On
段‰ 8 2 65.047.072 Tags not found: 62, 324(7 Fig 3. Venn diagrams presenting average recall, plus the number of correct tags found in three potential sources of tags(left)and average precision, plus the total number of tags retrieved from these sources(right) the other hand, correct tags from personomy are accompanied by a large num- ber of incorrect tags(precision around 0.001). Compared to tags retrieved from personomy,the recommendation based on resource title is much more precise however, the number of correct tags found this way is lower. In addition, most of these tags can be also found in the users personomy. Finally, both recall and precision values show that resource tags are not a good source of potential tag recommendations. The character of each tag recommendation source and their otential usability in tag recommendation system are discussed in the following Resource title Resource title appears to be the most robust source of tag recommendations. Among all posts in processed BibSonomy snapshot only 51 resource titles were unable to produce any tags(no letters or numbers in the title). In addition, among all discussed sources the title seems to be the most strongly related to the resource. The drawback of this source is low recall which makes the title inappropriate as a stand-alone tag recommender. The title is a simplified natural language sentence, which should be cleaned of words with no informative value(e. g, stopwords) Resource tags Tags assigned to the resource by other folksonomy users are not a good source of tag recommendations. One of the reasons is the sparsity of data; 92% of resources were added to the system only once. This fact significantly limits the possible recall of this source of tags. The other issue is the personal haracter of posts(discussed in section 3. 1), which hurts the precision of retrieved tags. The variety of tags attached by users creates, however, another application f resource tag sets. Mining relations between tags attached to the same resource can result in a simplified semantic lexicon. The lexicon would not give us the
Title Tags Resource Tags (0.03) 26,054 (0.01) 10,162 (0.004) 4,476 (0.66) 529,357 (0.15) 113,707 (0.07) 74,926 (0.02) 22,746 (avg. recall) nr of correct tags Total nr of tested tags: 843,752 Tags not found: 62,324 (7%) Personomy Tags Title Tags Resource Tags (0.04) 713,025 (0.04) 218,420 (0.22) 19,637 (0.001) 565,047,072 (0.26) 444,295 (0.1) 762,316 (0.13) 177,977 (avg. precision) nr of proposed tags Personomy Tags Fig. 3. Venn diagrams presenting average recall, plus the number of correct tags found in three potential sources of tags (left) and average precision, plus the total number of tags retrieved from these sources (right). the other hand, correct tags from personomy are accompanied by a large number of incorrect tags (precision around 0.001). Compared to tags retrieved from personomy, the recommendation based on resource title is much more precise; however, the number of correct tags found this way is lower. In addition, most of these tags can be also found in the user’s personomy. Finally, both recall and precision values show that resource tags are not a good source of potential tag recommendations. The character of each tag recommendation source and their potential usability in tag recommendation system are discussed in the following sections. Resource title Resource title appears to be the most robust source of tag recommendations. Among all posts in processed BibSonomy snapshot only 51 resource titles were unable to produce any tags (no letters or numbers in the title). In addition, among all discussed sources the title seems to be the most strongly related to the resource. The drawback of this source is low recall which makes the title inappropriate as a stand-alone tag recommender. The title is a simplified natural language sentence, which should be cleaned of words with no informative value (e.g., stopwords). Resource tags Tags assigned to the resource by other folksonomy users are not a good source of tag recommendations. One of the reasons is the sparsity of data; 92% of resources were added to the system only once. This fact significantly limits the possible recall of this source of tags. The other issue is the personal character of posts (discussed in section 3.1), which hurts the precision of retrieved tags. The variety of tags attached by users creates, however, another application of resource tag sets. Mining relations between tags attached to the same resource can result in a simplified semantic lexicon. The lexicon would not give us the