STaR: a Social Tag Recommender System Cataldo Musto. Fedelucio Narducci Marco de gemmis Pasquale Lops, and Giovanni Semeraro Department of Computer Science, University of Bari "Aldo Moro", Italy imusto, narducci, degemmis, lops, semeraroj@di uniba.it Abstract. The continuous growth of collaborative platforms we are re- cently witnessing made possible the passage from an elitary' Web, writ- ten by few and read by many, towards the so-called Web 2.0, a more user-centric'vision, where users become active contributors in Web dy- namics. In this context, collaborative tagging systems are rapidly emerg- ing: in these platforms users can annotate resources they like with freel chosen keyword(called tags) in order to make retrieval of information and serendipitous browsing more and more easier. However, as tags are andled in a simply syntactical way, collaborative tagging systems suffer of typical Information Retrieval(IR)problems like polysemy and onymy: so, in order to reduce the impact of these drawbacks and to aid at the same time the so-called tag convergence, systems that assist the user in the task of tagging are required. The goal of these systems(called tag recommenders)is to suggest a set of relevant key words for the re- ources to be annotated by exploiting different approaches In this paper we present a tag recommender developed for the ECML-PKDD 2009 Discovery Challenge Our approach is based on two assumptions: firstly if two or more resources share some common patterns(e.g. the same fea- tures in the textual description), we can exploit this information suppos- that they could be annotated with similar tags. Furthermore, since each user has a typical manner to label resources, a tag recommender might exploit this information to weigh more the tags she already used Key words: Recommender Systems, Web 2.0, Collaborative Tagging Systems, Folksonomic 1 Introduction The coming of Web 2.0 has changed the role of Internet users and the shape of services offered by the World wide Web. Since web sites tend to be more interac- tive and user-centric than in the past, users are shifting from passive consumers of information to active produce Web 2.0 applications, users are ab to easily publish content such as photos, videos, political opinions, reviews, so they are identified as Web prosumers: producers consumers of knowledge One of the forms of user-generated content (UGC) that has drawn more at- tention from the research community is tagging, which is the act of annotating
STaR: a Social Tag Recommender System Cataldo Musto, Fedelucio Narducci, Marco de Gemmis, Pasquale Lops, and Giovanni Semeraro Department of Computer Science, University of Bari “Aldo Moro”, Italy {musto,narducci,degemmis,lops,semeraro}@di.uniba.it Abstract. The continuous growth of collaborative platforms we are recently witnessing made possible the passage from an ‘elitary’ Web, written by few and read by many, towards the so-called Web 2.0, a more ‘user-centric’ vision, where users become active contributors in Web dynamics. In this context, collaborative tagging systems are rapidly emerging: in these platforms users can annotate resources they like with freely chosen keyword (called tags) in order to make retrieval of information and serendipitous browsing more and more easier. However, as tags are handled in a simply syntactical way, collaborative tagging systems suffer of typical Information Retrieval (IR) problems like polysemy and synonymy: so, in order to reduce the impact of these drawbacks and to aid at the same time the so-called tag convergence, systems that assist the user in the task of tagging are required. The goal of these systems (called tag recommenders) is to suggest a set of relevant keywords for the resources to be annotated by exploiting different approaches. In this paper we present a tag recommender developed for the ECML-PKDD 2009 Discovery Challenge. Our approach is based on two assumptions: firstly, if two or more resources share some common patterns (e.g. the same features in the textual description), we can exploit this information supposing that they could be annotated with similar tags. Furthermore, since each user has a typical manner to label resources, a tag recommender might exploit this information to weigh more the tags she already used to annotate similar resources. Key words: Recommender Systems, Web 2.0, Collaborative Tagging Systems, Folksonomies 1 Introduction The coming of Web 2.0 has changed the role of Internet users and the shape of services offered by the World Wide Web. Since web sites tend to be more interactive and user-centric than in the past, users are shifting from passive consumers of information to active producers. By using Web 2.0 applications, users are able to easily publish content such as photos, videos, political opinions, reviews, so they are identified as Web prosumers: producers + consumers of knowledge. One of the forms of user-generated content (UGC) that has drawn more attention from the research community is tagging, which is the act of annotating
resources of interests with free keywords, called tags, in order to help users in organizing, browsing and searching resources through the building of a socially onstructed classification schema, called folksonomy [18. In contrast to systems here information about resources is only provided by a small set of experts collaborative tagging systems take into account the way individuals conceive the information contained in a resource [ 19. Well-known example of platforms that embed tagging activity are Flickr to share photos, YouTube to share videos Del iciousto share bookmarks, Last. m to share music listening habits and Bibsonomy to share bookmarks and lists of literature. Although these systems provide heterogeneous contents, they have a common core: once a user is logged in, she can post a new resource and choose some significant keywords to identify it. Besides, users can label resources previously posted from other users. This phenomenon represents a very important opportunity to categorize the resources on the web, otherwise hardly feasible. The act of tagging resources from different users is the social aspect of this activity; in this way tags create a connection among users and items. Users that label the same resource by using the same tags could have similar tastes and items labeled with the same tags could have common characteristics Many would argue that the power of tagging lies in the ability for people to freely determine the appropriate tags for a resource without having to rely on a predefined lexicon or hierarchy [11]. Indeed, folksonomies are fully free and reflect the user mind, but they suffer of the same problems of unchecked vocabulary. Golder et. al. 5 identified three major problems with current tagging systems polysemy, synonymy, and level variation. Polysemy refers to situations where tags can have multiple meanings: for example a resource tagged with the term turkey could indicate a news taken from an online newspaper about politics or a recipe for Thanksgiving Day. When multiple tags share a single meaning we refer to it as synonymy. In collaborative tagging systems we can have simple morphological variations(for example we can find 'blog', ',,web log, to identify a common blog) but also semantic similarity(like resources tagged with arts'versus'cultural heritage ) The third problem, called level variations, refers to the phenomenon of tagging at different level of abstraction. Some people can annotate a web page containing a recipe for roast turkey with the tag roast turkeybut also with a simple recipe In order to avoid these problems, in the last years many tools have been developed to facilitate the user in the task of tagging and to aid the tag con- vergence [4: these systems are know as tag recommenders. When a user posts a resource in a Web 2.0 platform, a tag recommender suggests some significant keywords to label the item following some criteria to filter out the noise from the complete tag space http://www.flickr.com 2http://www.youtube.com http://delicious.com/ http://www.last.fm/ http://www.bibsonomy.or
resources of interests with free keywords, called tags, in order to help users in organizing, browsing and searching resources through the building of a sociallyconstructed classification schema, called folksonomy [18]. In contrast to systems where information about resources is only provided by a small set of experts, collaborative tagging systems take into account the way individuals conceive the information contained in a resource [19]. Well-known example of platforms that embed tagging activity are Flickr1 to share photos, YouTube2 to share videos, Del.icio.us3 to share bookmarks, Last.fm4 to share music listening habits and Bibsonomy5 to share bookmarks and lists of literature. Although these systems provide heterogeneous contents, they have a common core: once a user is logged in, she can post a new resource and choose some significant keywords to identify it. Besides, users can label resources previously posted from other users. This phenomenon represents a very important opportunity to categorize the resources on the web, otherwise hardly feasible. The act of tagging resources from different users is the social aspect of this activity; in this way tags create a connection among users and items. Users that label the same resource by using the same tags could have similar tastes and items labeled with the same tags could have common characteristics. Many would argue that the power of tagging lies in the ability for people to freely determine the appropriate tags for a resource without having to rely on a predefined lexicon or hierarchy [11]. Indeed, folksonomies are fully free and reflect the user mind, but they suffer of the same problems of unchecked vocabulary. Golder et. al. [5] identified three major problems with current tagging systems: polysemy, synonymy, and level variation. Polysemy refers to situations where tags can have multiple meanings: for example a resource tagged with the term turkey could indicate a news taken from an online newspaper about politics or a recipe for Thanksgiving’ Day. When multiple tags share a single meaning we refer to it as synonymy. In collaborative tagging systems we can have simple morphological variations (for example we can find ‘blog’, ‘blogs’, ‘web log’, to identify a common blog) but also semantic similarity (like resources tagged with ‘arts’ versus ‘cultural heritage’). The third problem, called level variations, refers to the phenomenon of tagging at different level of abstraction. Some people can annotate a web page containing a recipe for roast turkey with the tag ‘roastturkey’ but also with a simple ‘recipe’. In order to avoid these problems, in the last years many tools have been developed to facilitate the user in the task of tagging and to aid the tag convergence [4]: these systems are know as tag recommenders. When a user posts a resource in a Web 2.0 platform, a tag recommender suggests some significant keywords to label the item following some criteria to filter out the noise from the complete tag space. 1 http://www.flickr.com 2 http://www.youtube.com 3 http://delicious.com/ 4 http://www.last.fm/ 5 http://www.bibsonomy.org/
This paper presents STaR(Social Tag Recommender system), a tag recom- mender system developed for the ECML-PKDD 2009 Discovery Challenge. The idea behind our work is that folksonomies create connections among users and items, so we tried to point out two concepts: Resources with similar content could be annotated with similar tags a tag recommender needs to take into account the previous tagging activity of users, by weighting more tags already used to annotate similar resources In this work we identify two main aspects in the tag recommendation task firstly, each user has a typical manner to label resources (for example using personal tags such as beautiful, ugly,,pleasant, etc. which are not connecte to the content of the item, or simply tagging using general tags like 'politic sport, etc. ); next, similar resources usually share common tags: when a user posts a resource r on the platform, our system takes into account how she(if she is already stored in the system)and the entire community previously tagged resources similar to r in order to suggest relevant tags. Next, we develop this model and we tested it on a dataset extracted from BibSonomy. The paper is organized as follows. Section 2 analyzes related work. The gen- eral problem of tag recommendation is introduced in Section 3. Section 4 explains the architecture of the system and how the recommendation approach is imple- mented. The experimental section carried out is described in Section 5.1, while conclusions and future works are drawn in last section 2 Related Work Previous work in the tag recommendation area can be broadly divided into three classes: content-based, collaborative and graph-based approaches. In the content-based approach, a system exploits some textual source with Information Retrieval-related techniques 1 in order to extract relevant unigrams or bigrams from the text. Brooks et. al 3, for example, develop a tag recom- mender system that automatically suggests tags for a blog post extracting the top three terms exploiting TF/IDF scoring [14. The system presented by Lee nd Chun 8 recommends tags retrieved from the content of a blog using artificial neural networks The network is trained based on statistical information about word frequencies and lexical information about word semantics extracted from WordNet. The collaborative approach for tag recommendation, instead, presents by Mishne and implemented in Auto Tag [12], the system suggests tags basedo some analogies with collaborative filtering methods 2. In the model proposed the other tags associated with similar posts in a given collection. The recommen- dation process is performed in three steps: first, the tool finds similar posts and extracts their tags. All the tags are then merged, building a general folksonomy that is filtered and reranked. The top-ranked tags are suggested to the user, who selects the most appropriate ones to attach to the post. Tag Assist [16 improves the AutoTags'approach performing a lossless compression over existing tag data It finds similar blog posts and suggests a subset of the associated tag through a
This paper presents STaR (Social Tag Recommender system), a tag recommender system developed for the ECML-PKDD 2009 Discovery Challenge. The idea behind our work is that folksonomies create connections among users and items, so we tried to point out two concepts: – Resources with similar content could be annotated with similar tags; – A tag recommender needs to take into account the previous tagging activity of users, by weighting more tags already used to annotate similar resources. In this work we identify two main aspects in the tag recommendation task: firstly, each user has a typical manner to label resources (for example using personal tags such as ‘beautiful’, ‘ugly’, ‘pleasant’, etc. which are not connected to the content of the item, or simply tagging using general tags like ‘politics’, ‘sport’, etc.); next, similar resources usually share common tags: when a user posts a resource r on the platform, our system takes into account how she (if she is already stored in the system) and the entire community previously tagged resources similar to r in order to suggest relevant tags. Next, we develop this model and we tested it on a dataset extracted from BibSonomy. The paper is organized as follows. Section 2 analyzes related work. The general problem of tag recommendation is introduced in Section 3. Section 4 explains the architecture of the system and how the recommendation approach is implemented. The experimental section carried out is described in Section 5.1, while conclusions and future works are drawn in last section. 2 Related Work Previous work in the tag recommendation area can be broadly divided into three classes: content-based, collaborative and graph-based approaches. In the content-based approach, a system exploits some textual source with Information Retrieval-related techniques [1] in order to extract relevant unigrams or bigrams from the text. Brooks et. al [3], for example, develop a tag recommender system that automatically suggests tags for a blog post extracting the top three terms exploiting TF/IDF scoring [14]. The system presented by Lee and Chun [8] recommends tags retrieved from the content of a blog using artificial neural networks. The network is trained based on statistical information about word frequencies and lexical information about word semantics extracted from WordNet. The collaborative approach for tag recommendation, instead, presents some analogies with collaborative filtering methods [2]. In the model proposed by Mishne and implemented in AutoTag [12], the system suggests tags based on the other tags associated with similar posts in a given collection. The recommendation process is performed in three steps: first, the tool finds similar posts and extracts their tags. All the tags are then merged, building a general folksonomy that is filtered and reranked. The top-ranked tags are suggested to the user, who selects the most appropriate ones to attach to the post. TagAssist [16] improves the AutoTags’ approach performing a lossless compression over existing tag data. It finds similar blog posts and suggests a subset of the associated tag through a
Tag Suggestion Engine(TSE)which leverages previously tagged posts providing appropriate suggestions for new content. In [10 the tag recommendations task is performed through a user-based collaborative filtering approach. The method seems to produce good results when applied on the user-tag matrix, so they show hat users with a similar tag vocabulary tend to tag alike The problem of tag recommendation through graph-based approaches has been firstly addressed by Jaschke et al in 7. They compared some recommendation techniques including collaborative filtering, PageRank and FolkRank. The key idea behind Folk Rank algorithm is that a resource which is tagged by important tags from impor tant users becomes important itself. The same concept holds for tags and users, thus the approach uses a graph whose vertices mutually reinforce themselves by spreading their weights. The evaluation showed that Folk Rank outperforms Schmitz et al. [15 proposed association rule 1 mining as a nique that might be useful in the tag recommendation process. In literature we can find also some hybrid methods integrating two or more approaches(mainly content and collaborative ones) in order to reduce their typical drawbacks an point out their qualities. Heymann et. al [6 present a tag recommender that ex- ploits at the same time social knowledge and textual sources. They suggest tags based on page text, anchor text, surrounding hosts, adding tags used by others users to label the URL. The effectiveness of this approach is also confirmed by the use of a large dataset crawled from del icio us for the experimental evalua- tion. A hybrid approach is also proposed by Lipczak in 9. Firstly, the system extracts tags from the title of the resource. Afterwards, based on an analysis usually co-occur with terms in the title. Finally, tags are filtered and reranked exploiting the information stored in a so-called"personomy'", the set of the tags previously used by the user. t Finaly, in (17]the authors proposed a model based on both textual content nd tags associated with the resource. They introduce the concept of conflate igs to indicate a set of related tag(like blog, blogs, ecc. )used to annotate a resource. Modeling in this way the existing tag space they are able to suggest various tags for a given bookmark exploiting both user and document models They win the previous edition of the Tag Recommendation Challenge 3 Description of the Task STar has been designed to participate at the ECML-PKDD 2009 Discovery Challenge. In this section we will firstly introduce a formal model for recom- mendation in folksonomies, then we will analyze the specific requirements of the task proposed for the Challenge http://www.kde.cs.uni-kassel.de/ws/dc09
Tag Suggestion Engine (TSE) which leverages previously tagged posts providing appropriate suggestions for new content. In [10] the tag recommendations task is performed through a user-based collaborative filtering approach. The method seems to produce good results when applied on the user-tag matrix, so they show that users with a similar tag vocabulary tend to tag alike. The problem of tag recommendation through graph-based approaches has been firstly addressed by J¨aschke et al. in [7]. They compared some recommendation techniques including collaborative filtering, PageRank and FolkRank. The key idea behind FolkRank algorithm is that a resource which is tagged by important tags from important users becomes important itself. The same concept holds for tags and users, thus the approach uses a graph whose vertices mutually reinforce themselves by spreading their weights. The evaluation showed that FolkRank outperforms other approaches. Schmitz et al. [15] proposed association rule mining as a technique that might be useful in the tag recommendation process. In literature we can find also some hybrid methods integrating two or more approaches (mainly, content and collaborative ones) in order to reduce their typical drawbacks and point out their qualities. Heymann et. al [6] present a tag recommender that exploits at the same time social knowledge and textual sources. They suggest tags based on page text, anchor text, surrounding hosts, adding tags used by others users to label the URL. The effectiveness of this approach is also confirmed by the use of a large dataset crawled from del.icio.us for the experimental evaluation. A hybrid approach is also proposed by Lipczak in [9]. Firstly, the system extracts tags from the title of the resource. Afterwards, based on an analysis of co-occurrences, the set of candidate tags is expanded adding also tags that usually co-occur with terms in the title. Finally, tags are filtered and reranked exploiting the information stored in a so-called ”personomy”, the set of the tags previously used by the user. Finally, in [17] the authors proposed a model based on both textual content and tags associated with the resource. They introduce the concept of conflated tags to indicate a set of related tag (like blog, blogs, ecc.) used to annotate a resource. Modeling in this way the existing tag space they are able to suggest various tags for a given bookmark exploiting both user and document models. They win the previous edition of the Tag Recommendation Challenge. 3 Description of the Task STaR has been designed to participate at the ECML-PKDD 2009 Discovery Challenge6 . In this section we will firstly introduce a formal model for recommendation in folksonomies, then we will analyze the specific requirements of the task proposed for the Challenge. 6 http://www.kde.cs.uni-kassel.de/ws/dc09
3.1 Recommendation in Folksonomies A collaborative tagging system is a platform composed of users, resources and tags that allows users to freely assign tags to resources. Following the definition introduced in 7, a folksonomy can be described as a triple (U,R, T)where -U is a set of users: R is a set of resource. T is a set of tags We can also define a tag assignment function TAS: UX R-T. The tag recommendation task for a given user uE U and a resource r E R can be finally described as the generation of a set of tags TAS(u, r)cT according to some relevance model In our approach these tags are generated from a ranked set of candidate tags from which the top n elements are suggested to the user 3.2 Description of the ECML-PKDD 2009 Discovery Challenge The 2009 edition of the Discovery Challenge consists of three recommendation tasks in the area of social bookmarking. We compete for the first task, content- based tag recommendation, whose goal is to exploit content-based recommenda- tion approaches in order to provide a relevant set of tags to the user when she submits a new item(Bookmark or BibTeX entry) into Bibsonomy tag as- signment: the dataset contains 263, 004 bookmark posts and 158, 924 Bib TeX en- tries submitted by 3, 617 different users. For each of the 235, 328 different URLs and the 143, 050 different BibTeX entries were also provided some textual meta- data(such as the title of the resource, the description, the abstract and so on) Each candidate recommender is evaluated by comparing the real tags(namely the tags a user adopts to annotate an unseen resource) with the suggested ones The accuracy is finally computed using classical IR metrics, such as Precision Recall and F1-Measure( Section 5.1) By analyzing the aforementioned requirements, we designed STaR thinking at a prediction task rather than a recommendation one. Consequently, we will try to emphasize the previous tagging activity of the user, also looking for connections and patterns among resources. All these decisions will be thoroughly analyzed in the next section describing the architecture of STaR 4 STaR: a Social Tag Recommender System STaR (Social Tag Recommender) is a content-based tag recommender system developed at the University of Bari. The inceptive idea behind STaR is to im- prove the model implemented in systems like Tag Assist [16 or AutoTag [12 Although we agree with the idea that resources with similar content could be annotated with similar tags, in our opinion Mishne's approach presents t portant drawbacks
3.1 Recommendation in Folksonomies A collaborative tagging system is a platform composed of users, resources and tags that allows users to freely assign tags to resources. Following the definition introduced in [7], a folksonomy can be described as a triple (U, R, T) where: – U is a set of users; – R is a set of resources; – T is a set of tags. We can also define a tag assignment function tas: U × R → T. The tag recommendation task for a given user u ∈ U and a resource r ∈ R can be finally described as the generation of a set of tags tas(u, r) ⊆ T according to some relevance model. In our approach these tags are generated from a ranked set of candidate tags from which the top n elements are suggested to the user. 3.2 Description of the ECML-PKDD 2009 Discovery Challenge The 2009 edition of the Discovery Challenge consists of three recommendation tasks in the area of social bookmarking. We compete for the first task, contentbased tag recommendation, whose goal is to exploit content-based recommendation approaches in order to provide a relevant set of tags to the user when she submits a new item (Bookmark or BibTeX entry) into Bibsonomy. The organizers make available a training set with some examples of tag assignment: the dataset contains 263,004 bookmark posts and 158,924 BibTeX entries submitted by 3,617 different users. For each of the 235,328 different URLs and the 143,050 different BibTeX entries were also provided some textual metadata (such as the title of the resource, the description, the abstract and so on). Each candidate recommender is evaluated by comparing the real tags (namely, the tags a user adopts to annotate an unseen resource) with the suggested ones. The accuracy is finally computed using classical IR metrics, such as Precision, Recall and F1-Measure (Section 5.1). By analyzing the aforementioned requirements, we designed STaR thinking at a prediction task rather than a recommendation one. Consequently, we will try to emphasize the previous tagging activity of the user, also looking for connections and patterns among resources. All these decisions will be thoroughly analyzed in the next section describing the architecture of STaR. 4 STaR: a Social Tag Recommender System STaR (Social Tag Recommender) is a content-based tag recommender system, developed at the University of Bari. The inceptive idea behind STaR is to improve the model implemented in systems like TagAssist [16] or AutoTag [12]. Although we agree with the idea that resources with similar content could be annotated with similar tags, in our opinion Mishne’s approach presents two important drawbacks: