Web Semantics: Science, Services and Agents on the world wide web 9(2011)1-15 Contents lists available at Science Direct Web Semantics: Science, Services and agents on the world wide web ELSEVIER journalhomepagewww.elsevier.com/locate/websem Categorising social tags to improve folksonomy-based recommendations Ivan Cantadora, C, *, ioannis Konstas D, Joemon M. Jose adria, Calle francisco Tomas y valiente. 11. 28049 Madrid, Spain b School of Informatics, University of Edinburgh, EH8 9AB Edinburgh, United Ki Department of Computing Science, University of Glasgow, G12 8QQ Glasgow, United Kingdom ARTICLE INFO A BSTRACT In social tagging systems, users have different purposes when they annotate items. Tags not only depict the content of the annotated items. for example by listing the objects that appear in a photo, or express eceived in revised form 11 October 2010 ccepted 11 October 2010 ontextual information about the items, for example by providing the location or the time in which a Available online 20 october 2010 photo was taken, but also describe subjective qualities and opinions about the items, or can be related to rganisational aspects, such as self-references and personal tasks Current folksonomy-based search and recommendation models exploit the social as a whole to retrieve those items relevant to a tag-based query or user profile, and do not take i aeration the purposes of tags. We hypothesise that a significant percentage of tags are noisy for trieval and believe that the distinction of the personal intentions underlying the tags may be the accuracy of search and recommendation processes. W3C Linking Open Data We present a mechanism to automatically filter and classify raw tags in a set of purpose-oriented cat egories. Our approach finds the underlying meanings( ts ) of the tags, mapping them to semantic entities belonging to external knowledge bases, namely WordNet and wikipedia, through the exploita tion of ontologies created within the w3C Linking Open Data initiative. The obtained concepts are then transformed into semantic classes that can be uniquely assigned to content- and context-based cate- gories. The identification of subjective and organisational tags is based on natural language processing heuristics a. We collected a representative dataset from Flickr social tagging system, and conducted an empirical study to categorise real tagging data, and eval her the resultant tags categories really ben- efit a recommendation model using the rand with Restarts method. The results show that content-and context-based tags are considered to subjective and organisational tags, achieving equivalent performance to using the whole tag d wu eror o O 2010 Elsevier B V. All rights reserved Flickr, 'audio tracks in Last. m, video clips in YouTube, and Web documents in Delicious, 4 among others. a user can usually create (upload) items, and annotate them with tags he considers appro- priate. In some folksonomies, the user can also tag items he did not During the last few years, we have been witnessing an unex- create pected success and increasing popularisation of social tagging The main advantage of folksonomies is that users are not systems. In these systems, users create or upload content(items), requested to rely on a priori agreed knowledge structure or shared annotate it with freely chosen words(tags), and share it with other vocabulary and thus are not imposed any constraint in the tag Isers. The whole set of tags constitutes an unstructured knowl- ging process and information management. Nevertheless, this issue edge classification scheme that is commonly known as folksonomy implies a number of limitations on the content retrieval mecha- [32]. This implicit classification is then used to search and recom- nisms. Social tags may explicitly describe the content of an item, e. g mend items. The nature of tagged items is manifold: photos in by listing physical objects that are shown in a photo or a video, or Flickr-photosharinghttp://www.flickr.com. Tel:+34914972358;fax:+34914972235. E-mail address: ivan. cantador@uames(l. Cantador) 4Delicious-socialbookmarkinghttp://delicious.com. J-8268S-see front matter o 2010 Elsevier B V. All rights reserved
Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 1–15 Contents lists available at ScienceDirect Web Semantics: Science, Services and Agents on the World Wide Web journal homepage: www.elsevier.com/locate/websem Categorising social tags to improve folksonomy-based recommendations Iván Cantador a,c,∗, Ioannis Konstas b,c, Joemon M. Josec a Departamento de Ingeniería Informática, Universidad Autónoma de Madrid, Calle Francisco Tomás y Valiente, 11, 28049 Madrid, Spain b School of Informatics, University of Edinburgh, EH8 9AB Edinburgh, United Kingdom c Department of Computing Science, University of Glasgow, G12 8QQ Glasgow, United Kingdom article info Article history: Received 27 July 2010 Received in revised form 11 October 2010 Accepted 11 October 2010 Available online 20 October 2010 Keywords: Social tagging Recommender systems Ontologies Semantic Web W3C Linking Open Data abstract In social tagging systems, users have different purposes when they annotate items. Tags not only depict the content of the annotated items, for example by listing the objects that appear in a photo, or express contextual information about the items, for example by providing the location or the time in which a photo was taken, but also describe subjective qualities and opinions about the items, or can be related to organisational aspects, such as self-references and personal tasks. Current folksonomy-based search and recommendation models exploit the social tag space as a whole to retrieve those items relevant to a tag-based query or user profile, and do not take into consideration the purposes of tags. We hypothesise that a significant percentage of tags are noisy for content retrieval, and believe that the distinction of the personal intentions underlying the tags may be beneficial to improve the accuracy of search and recommendation processes. We present a mechanism to automatically filter and classify raw tags in a set of purpose-oriented categories. Our approach finds the underlying meanings (concepts) of the tags, mapping them to semantic entities belonging to external knowledge bases, namely WordNet and Wikipedia, through the exploitation of ontologies created within the W3C Linking Open Data initiative. The obtained concepts are then transformed into semantic classes that can be uniquely assigned to content- and context-based categories. The identification of subjective and organisational tags is based on natural language processing heuristics. We collected a representative dataset from Flickr social tagging system, and conducted an empirical study to categorise real tagging data, and evaluate whether the resultant tags categories really benefit a recommendation model using the Random Walk with Restarts method. The results show that content- and context-based tags are considered superior to subjective and organisational tags, achieving equivalent performance to using the whole tag space. © 2010 Elsevier B.V. All rights reserved. 1. Introduction 1.1. Motivation During the last few years, we have been witnessing an unexpected success and increasing popularisation of social tagging systems. In these systems, users create or upload content (items), annotate it with freely chosen words (tags), and share it with other users. The whole set of tags constitutes an unstructured knowledge classification scheme that is commonly known as folksonomy [32]. This implicit classification is then used to search and recommend items. The nature of tagged items is manifold: photos in ∗ Corresponding author at: Departamento de Ingeniería Informática, Universidad Autónoma de Madrid, Calle Francisco Tomás y Valiente, 11, 28049 Madrid, Spain. Tel.: +34 91 497 2358; fax: +34 91 497 2235. E-mail address: ivan.cantador@uam.es (I. Cantador). Flickr,1 audio tracks in Last.fm,2 video clips in YouTube,3 and Web documents in Delicious,4 among others. A user can usually create (upload) items, and annotate them with tags he considers appropriate. In some folksonomies, the user can also tag items he did not create. The main advantage of folksonomies is that users are not requested to rely on a priori agreed knowledge structure or shared vocabulary, and thus are not imposed any constraint in the tagging process and information management. Nevertheless, this issue implies a number of limitations on the content retrieval mechanisms. Social tags may explicitly describe the content of an item, e.g. by listing physical objects that are shown in a photo or a video, or 1 Flickr – Photo sharing, http://www.flickr.com. 2 Last.fm – Personal online radio, http://www.last.fm. 3 YouTube – Video sharing, http://www.youtube.com. 4 Delicious – Social bookmarking, http://delicious.com. 1570-8268/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.websem.2010.10.001
L Contador et al /Web Semantics: Science, Services and Agents on the World Wide Web 9(2011)1-15 by giving keywords that appear in a Web document or a song lyric. multi-domain tagging system where photos are freely annotated They may also provide contextual information about the annotated by their owners. the results show there are certain tag cate- item, e.g. by identifying the place a photo was taken, or the date gories that are superior to others in terms of recommendat a video was recorded Furthermore, they may express subjective performance, and even equivalent to that obtained when using opinions and qualities(nice picture, rock music, dark movie the whole tag space. scene, incomprehensible text), or self-references and personal We have collected a dataset from Flickr system, which we have tasks(my wife, to read, work). This suggests that users have dif- made available to the research community ferent intentions when tagging, and not all the tags available in a folksonomy are related with the content of the annotated items 3]. 1.3. Structure of the engines do not take into account the above distinction of tags, and run their content retrieval algorithms in the entire tag space. The The rest of the paper is organised as follows. Section 2 describes problem is that although useful subjective and organisational tags related works that have motivated our research. Section 3 presents e for the purposes (intentions)of an individual, still they m. an overview of our approach to automatically categorise social fail to be of benefit when recommending items to other users. As a tags based on their purpose. Section 4 explains in more detail result, mixing these with the rest content- and context-based tag the approach, describing how the semantic concepts underlying social tags are identified, and how they are mapped to a set of mendations. We hypothesise that distinguishing and considering predefined purpose-oriented categories. Section 5 describes the purpose-oriented categories of tags could be extremely valuable to folksonomy-based recommendation model with which we have improve the accuracy of recommendation approaches. Hence, the evaluated our tag categorisation proposal. Section 6 presents the corroboration of this assumption represents the main challenge to conducted experiments, and Section 7 provides a discussion of the address in the work presented herein. obtained results. Finally, Section 8 contains conclusions and future In order to achieve such tag categorisation, we first have to research lines understand the meaning of each social tag. For example, to deter- mine that kilt can be categorised as a"content-based"tag, it has to be identified that a kilt is a Scottish piece of cloth, i.e. a"phys- 2. Related work ical entity". Similarly, to categorise glasgow as a"context-based tag, it has to be identified that glasgow is a city in Scotland, ie a 2.1. Categorisation of social tags location It is our objective to study the role of various tag categori a prior goal in many social tagging systems is to meet the for item recommendation. However, categorising a set of general needs of individual users, e.g. by allowing personal organisation purpose tags is not trivial In this context, we have identified the of items and their subsequent retrieval. Nonetheless, social tags following research questions should help other people to browse and find items. Furthermore, being a mechanism of community-based item description, they RQ1: Is it possible to find out the underlying meanings of social should also facilitate information sharing and discovery(recom tags in a general way? mendation). Marlow et al. [22], and Ames and Naaman [3 discuss We should (1)identify the meanings of social tags indepen- an exhaustive list of incentives expressing the range of potential dently of the domains covered by the folksonomies they belong motivations that influence tagging. Among them, content man- to: and (2) be aware of contemporary terminology that continu- agement and retrieval are shown as two of the most important usly appears in our daily lives(web 2.0, podcast, diy). incentives to tag resources. Our work is based on this observation RQ2: Is it possible to automatically categorise social tags based and aims to identify which social tags are more useful for content on their intention? retrieval and recommendation The transformation of semantic concepts into purpose- To provide such functionalities, it is not obvious how social tags oriented categories could be done by exploiting external can be best exploited Suchanek et al. [35 ] show that user-generated knowledge bases such as thesauri, taxonomies and ontologies. tags present significant semantic noise more than terms extracted RQ3: Is a purpose-oriented categorisation of social tags useful from Web page contents or search queries. When tagging, peo- for folksonomy-based recommendation strategies? ple not only introduce misspellings(barcelona, barclona), and To validate the utility of the purpose-oriented categories, these use different synonyms(car, automobile), acronyms(nyc, new should be evaluated in a real folksonomy-based recommender york city)and logic derivations(blog, blogs, blogging system. for a given concept [36], but also include tags that express per sonal assessments(funny, to print), or even are unintelligible to 1. 2. Contribution another person(####a)35]. We deal with these issues making us of a tag processing and filtering approaches presented in a previous In this work, we address the research questions listed above, works [36,37), and mapping the resultant tags to semantic concepts nd make the following contributions described in external knowledge bases(KB). similarly to 9]. where social tags are linked to ontology classes and instances We have developed a mechanism that automatically processes The purposes of tagging and consequently the types of social structured knowledge bases identify which are the social tags relevant for knowle e siming to and maps social tags to semantic concepts depicted in external tags are manifold. Recent works have analysed this fact, aiming to Exploiting the semantic relations provided by the above ment and information retrieval. Apart from describing the nowledge structures, we have designed a novel strategy to auto- of the items, social tags may represent contextual inform m matically infer the semantic classes of a given concept that allow subjective opinions and qualities [15, 30], or self-presentation and determining the intension of the corresponding social tag. organisation aspects 41]. We consider these purpose-oriented tag We have conducted an empirical study to evaluate the effect categories, and propose a more fine-grained categorisation within of various tag categories in photo recommendation. The experi- them, in order to study which types of tags are useful for content ments have been performed with a dataset obtained from Flickr, a retrieval tasks
2 I. Cantador et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 1–15 by giving keywords that appear in a Web document or a song lyric. They may also provide contextual information about the annotated item, e.g. by identifying the place a photo was taken, or the date a video was recorded. Furthermore, they may express subjective opinions and qualities (nice picture, rock music, dark movie scene, incomprehensible text), or self-references and personal tasks (my wife, to read, work). This suggests that users have different intentions when tagging, and not all the tags available in a folksonomy are related with the content of the annotated items [3]. Current folksonomy-based search and recommendation engines do not take into account the above distinction of tags, and run their content retrieval algorithms in the entire tag space. The problem is that although useful subjective and organisational tags are for the purposes (intentions) of an individual, still they may fail to be of benefit when recommending items to other users. As a result, mixing these with the rest content- and context-based tags may not add or even deteriorate the overall quality of the recommendations. We hypothesise that distinguishing and considering purpose-oriented categories of tags could be extremely valuable to improve the accuracy of recommendation approaches. Hence, the corroboration of this assumption represents the main challenge to address in the work presented herein. In order to achieve such tag categorisation, we first have to understand the meaning of each social tag. For example, to determine that kilt can be categorised as a “content-based” tag, it has to be identified that a kilt is a Scottish piece of cloth, i.e. a “physical entity”. Similarly, to categorise glasgow as a “context-based” tag, it has to be identified that Glasgow is a city in Scotland, i.e. a “location”. It is our objective to study the role of various tag categories for item recommendation. However, categorising a set of general purpose tags is not trivial. In this context, we have identified the following research questions: • RQ1: Is it possible to find out the underlyingmeanings of social tags in a general way? We should (1) identify the meanings of social tags independently of the domains covered by the folksonomies they belong to; and (2) be aware of contemporary terminology that continuously appears in our daily lives (web 2.0, podcast, diy). • RQ2: Is it possible to automatically categorise social tags based on their intention? The transformation of semantic concepts into purposeoriented categories could be done by exploiting external knowledge bases such as thesauri, taxonomies and ontologies. • RQ3: Is a purpose-oriented categorisation of social tags useful for folksonomy-based recommendation strategies? To validate the utility of the purpose-oriented categories, these should be evaluated in a real folksonomy-based recommender system. 1.2. Contributions In this work, we address the research questions listed above, and make the following contributions: • We have developed a mechanism that automatically processes and maps social tags to semantic concepts depicted in external structured knowledge bases. • Exploiting the semantic relations provided by the above knowledge structures, we have designed a novel strategy to automatically infer the semantic classes of a given concept that allow determining the intension of the corresponding social tag. • We have conducted an empirical study to evaluate the effect of various tag categories in photo recommendation. The experiments have been performed with a dataset obtained from Flickr, a multi-domain tagging system where photos are freely annotated by their owners. The results show there are certain tag categories that are superior to others in terms of recommendation performance, and even equivalent to that obtained when using the whole tag space. • We have collected a dataset from Flickr system, which we have made available to the research community. 1.3. Structure of the paper The rest of the paper is organised as follows. Section 2 describes related works that have motivated our research. Section 3 presents an overview of our approach to automatically categorise social tags based on their purpose. Section 4 explains in more detail the approach, describing how the semantic concepts underlying social tags are identified, and how they are mapped to a set of predefined purpose-oriented categories. Section 5 describes the folksonomy-based recommendation model with which we have evaluated our tag categorisation proposal. Section 6 presents the conducted experiments, and Section 7 provides a discussion of the obtained results. Finally, Section 8 contains conclusions and future research lines. 2. Related work 2.1. Categorisation of social tags A prior goal in many social tagging systems is to meet the needs of individual users, e.g. by allowing personal organisation of items and their subsequent retrieval. Nonetheless, social tags should help other people to browse and find items. Furthermore, being a mechanism of community-based item description, they should also facilitate information sharing and discovery (recommendation). Marlow et al. [22], and Ames and Naaman [3] discuss an exhaustive list of incentives expressing the range of potential motivations that influence tagging. Among them, content management and retrieval are shown as two of the most important incentives to tag resources. Our work is based on this observation, and aims to identify which social tags are more useful for content retrieval and recommendation. To provide such functionalities, it is not obvious how social tags can be best exploited. Suchanek et al.[35] show that user-generated tags present significant semantic noise more than terms extracted from Web page contents or search queries. When tagging, people not only introduce misspellings (barcelona, barclona), and use different synonyms (car, automobile), acronyms (nyc, new york city) and morphologic derivations (blog, blogs, blogging) for a given concept [36], but also include tags that express personal assessments (funny, to print), or even are unintelligible to another person (#####) [35]. We deal with these issues making use of a tag processing and filtering approaches presented in a previous works [36,37], and mapping the resultant tags to semantic concepts described in external knowledge bases (KB), similarly to [9], where social tags are linked to ontology classes and instances. The purposes of tagging and consequently the types of social tags are manifold. Recent works have analysed this fact, aiming to identify which are the social tags relevant for knowledge management and information retrieval. Apart from describing the content of the items, social tags may represent contextual information [3], subjective opinions and qualities [15,30], or self-presentation and organisation aspects [41]. We consider these purpose-oriented tag categories, and propose a more fine-grained categorisation within them, in order to study which types of tags are useful for content retrieval tasks
L Contador et al/Web Semantics: Science, Services and Agents on the World wide Web 9(2011)1-15 Motivated by the previous works, Bischoff et al. [7] manually Au Yeung et al. [6] describe a strategy that clusters the items lassify a number of tag collections obtained from different social tagged by the users. In the item-tag space, given a network of tagging systems( Flickr, Delicious, Last. fm)in several tag types, and items, a graph-based clustering algorithm to obtain sets of related study the distributions of tags assigned to each type, analysing items is applied. As the different clusters should contain items their usage implications on search tasks. The obtained results pro- that are related to similar topics, a cluster can be considered as vide insight into the use of different kinds of tags for improving corresponding to one of the interests of the user. Moreover, the search. Here we go a step beyond attempting to categorise the tags experiments presented in the paper show that the obtained groups utomatically. In this case, the evaluation of the tag categorise- of tags and items seem to correspond to the different meanings tion is assessed with a recommendation model [17] which does of ambiguous tags. In this work, we also use a graph-based algo- not depend on a specific domain. In this paper, we have conducted rithm on the item-tag space In our case, using a Random Walk s with a dataset obtained from flickr. w gy we aim to identify related tags and items relevant for the freely tagged by the owners in a multi-domain scenario. To achieve such tag categorisation, the meanings of social tags Similarly, Gemmell et al. explore in several works [14,31 ave to be found beforehand. We propose to map them to semantic strategies that cluster the entire space of tags to obtain sets of ncepts described in external KBs, such as thesauri and ontologies. (semantically)related tags. These clusters may represent coherent Halpin et al. [16] show that tagging distributions tend to stabili topic areas. By associating a user's interest to a particular cluster. into power law distributions, which is an essential aspect of what the user's interests in the topic are surmised. As discussed in the light be user consensus around the categorisation of information last section of the paper this type of clustering techniques could be riven by tagging. The authors state that it is quite plausible that incorporated into our approach in order to enhance the automatic folksonomies and ontologies are fundamentally compatible. We categorisation of ambiguous social tags, according to the context of follow this principle, and attempt to integrate social tags into YAGo the user profile in which a given tag appears. [34]. a large ontology that covers WordNet [24 and a significant Instead of implicit clusters, other personalisation and recom part of Wikipedia.5 mendation approaches aim to exploit explicit, and more structured Constructing and linking folksonomies with structured seman- representations of folksonomies. Quintarelli et al. [ 29] propose KBs is indeed a problem that has attracted much attention a personal multi-facet categorisation of tags, which allows the recently [28]. Mika [23 is recognised as one of the first authors to exploitation of taxonomic relations to enhance content retrieval tend the traditional bipartite resource-concept model of ontolo- In a series of previous works [8,9, 36] we have investigated rec- es with the social dimension. He presents a graph based approach ommendation approaches that make use of ontology-based user to construct a network of related tags, projected from either a profiles. Social tags are automatically transformed into ontology Iser-tag or resource-tag association graph. Applying clustering concepts(classes and instances)using semantic knowledge bases tags, and using their co-occurrence statistics, he like WordNet and Wikipedia. Arbitrary ontology relations between produces conceptual hierarchies. Specia and Motta[33] present a these concepts are exploited to expand the user profiles, and pe combination of pre-processing strategies and statistical techniques sonalise search and recom tion results. In this work, we together with knowledge provided by ontologies available on the attempt to map social tags to semantic concepts. As explained in Semantic Web to generate clusters of highly related tags that co subsequent sections, in this case, we propose to use YAGO onto respond to ontology concepts. As explained in subsequent sections, ogy, aiming to join and contribute to the w3C Linking Open Data we shall also make use of tag processing and filtering techniques, initiative similar to those presented in 33. Angeletou [4]proposes a seman- Recent works have focused on exploiting folksonomies tic enrichment of folksonomies by exploiting online ontologies, as sources of semantic information, integrating them with nesauri and other knowledge sources to make explicit the seman- content-based and collaborative filtering(CF)recommendation tic relations between social tags Instead of inferring such semantic approaches. relations between tags, we use those explicitly defined in YAGO De Gemmis et al. [11 present a hybrid strategy that learns the This is enough for our approach since our goal is to categorise the rofile of the user from both static content and tags associated with tags, and we only have to exploit hierarchical relations between items rated by him, instead of relying on tags only the authors pro- pose to include in the user profile not only his personal tags, but also the tags adopted by other users who rated the same items as him. Since the main problem lies in the fact that tags are freely 2. 2. Folksonomy-based recommender systems chosen by users, and their actual meaning is usually not very clear they suggest to semantically interpreting tags by means of Word- Collaborative tagging systems allow a user to search for the Net. Our tag categorisation also follows this idea, but extends the content that he has tagged using a personal vocabulary As users use of wordNet to wikipedia, allowing the consideration of social with similar interests tend to have a shared vocabulary tags cre- tags related to proper nouns and contemporary terms not available ted by one user may be useful to others, particularly those with in a dictionary such as WordNet. similar interests this is in Tso-Sutter et al. 38 describe a generic method that allows tags tems. In these systems, a user does not usually declare explicitly to be incorporated into standard heuristic-based CF algorithms his information needs(e.g, by means of a keyword-based query). such as user-and item-based CF, by reducing the three-dimensional In contrast, he is presented with items that may be interested for (user, item, tag) correlations to three two-dimensional correla him according to his profile(content-based approaches), or to the tions, and then applying a fusion method to re-associate these profiles of"similar"people(collaborative filtering approaches ) The correlations. The integration of folksonomy information into CF ader is referenced to [1 for an overview of the state of the art in is also studied by Zhen et al. [44 In this case, the authors pro- recommender systems. In the following, we focus our attention on pose to use the model-based CF algorithm based on probabilistic recommendation approaches that exploit folksonomy information. matrix factorization. Differently to these approaches, as explained in the paper, our tag-based recommendation model follows the CF paradigm by means of applying Random Walk algorithm on the global graph formed by users, items, tags and their explicit
I. Cantador et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 1–15 3 Motivated by the previous works, Bischoff et al. [7] manually classify a number of tag collections obtained from different social tagging systems (Flickr, Delicious, Last.fm) in several tag types, and study the distributions of tags assigned to each type, analysing their usage implications on search tasks. The obtained results provide insight into the use of different kinds of tags for improving search. Here we go a step beyond attempting to categorise the tags automatically. In this case, the evaluation of the tag categorisation is assessed with a recommendation model [17], which does not depend on a specific domain. In this paper, we have conducted experiments with a dataset obtained from Flickr, where photos are freely tagged by the owners in a multi-domain scenario. To achieve such tag categorisation, the meanings of social tags have to be found beforehand. We propose to map them to semantic concepts described in external KBs, such as thesauri and ontologies. Halpin et al. [16] show that tagging distributions tend to stabilise into power law distributions, which is an essential aspect of what might be user consensus around the categorisation of information driven by tagging. The authors state that it is quite plausible that folksonomies and ontologies are fundamentally compatible. We follow this principle, and attempt to integrate social tags into YAGO [34], a large ontology that covers WordNet [24] and a significant part of Wikipedia.5 Constructing and linking folksonomies with structured semantic KBs is indeed a problem that has attracted much attention recently [28]. Mika [23] is recognised as one of the first authors to extend the traditional bipartite resource-concept model of ontologies with the social dimension. He presents a graph based approach to construct a network of related tags, projected from either a user-tag or resource-tag association graph. Applying clustering techniques to tags, and using their co-occurrence statistics, he produces conceptual hierarchies. Specia and Motta [33] present a combination of pre-processing strategies and statistical techniques together with knowledge provided by ontologies available on the Semantic Web to generate clusters of highly related tags that correspond to ontology concepts. As explained in subsequent sections, we shall also make use of tag processing and filtering techniques, similar to those presented in [33]. Angeletou [4] proposes a semantic enrichment of folksonomies by exploiting online ontologies, thesauri and other knowledge sources to make explicit the semantic relations between social tags. Instead of inferring such semantic relations between tags, we use those explicitly defined in YAGO. This is enough for our approach since our goal is to categorise the tags, and we only have to exploit hierarchical relations between them. 2.2. Folksonomy-based recommender systems Collaborative tagging systems allow a user to search for the content that he has tagged using a personal vocabulary. As users with similar interests tend to have a shared vocabulary, tags created by one user may be useful to others, particularly those with similar interests. This is in fact the essence of recommender systems. In these systems, a user does not usually declare explicitly his information needs (e.g., by means of a keyword-based query). In contrast, he is presented with items that may be interested for him according to his profile (content-based approaches), or to the profiles of “similar” people (collaborative filtering approaches). The reader is referenced to [1] for an overview of the state of the art in recommender systems. In the following, we focus our attention on recommendation approaches that exploit folksonomy information. 5 Wikipedia encyclopaedia, http://wikipedia.org. Au Yeung et al. [6] describe a strategy that clusters the items tagged by the users. In the item-tag space, given a network of items, a graph-based clustering algorithm to obtain sets of related items is applied. As the different clusters should contain items that are related to similar topics, a cluster can be considered as corresponding to one of the interests of the user. Moreover, the experiments presented in the paper show that the obtained groups of tags and items seem to correspond to the different meanings of ambiguous tags. In this work, we also use a graph-based algorithm on the item-tag space. In our case, using a Random Walk strategy we aim to identify related tags and items relevant for the user. Similarly, Gemmell et al. explore in several works [14,31] strategies that cluster the entire space of tags to obtain sets of (semantically) related tags. These clusters may represent coherent topic areas. By associating a user’s interest to a particular cluster, the user’s interests in the topic are surmised. As discussed in the last section of the paper, this type of clustering techniques could be incorporated into our approach in order to enhance the automatic categorisation of ambiguous social tags, according to the context of the user profile in which a given tag appears. Instead of implicit clusters, other personalisation and recommendation approaches aim to exploit explicit, and more structured representations of folksonomies. Quintarelli et al. [29] propose a personal multi-facet categorisation of tags, which allows the exploitation of taxonomic relations to enhance content retrieval. In a series of previous works [8,9,36], we have investigated recommendation approaches that make use of ontology-based user profiles. Social tags are automatically transformed into ontology concepts (classes and instances) using semantic knowledge bases like WordNet and Wikipedia. Arbitrary ontology relations between these concepts are exploited to expand the user profiles, and personalise search and recommendation results. In this work, we also attempt to map social tags to semantic concepts. As explained in subsequent sections, in this case, we propose to use YAGO ontology, aiming to join and contribute to the W3C Linking Open Data initiative. Recent works have focused on exploiting folksonomies as sources of semantic information, integrating them with content-based and collaborative filtering (CF) recommendation approaches. De Gemmis et al. [11] present a hybrid strategy that learns the profile of the user from both static content and tags associated with items rated by him, instead of relying on tags only. The authors propose to include in the user profile not only his personal tags, but also the tags adopted by other users who rated the same items as him. Since the main problem lies in the fact that tags are freely chosen by users, and their actual meaning is usually not very clear, they suggest to semantically interpreting tags by means of WordNet. Our tag categorisation also follows this idea, but extends the use of WordNet to Wikipedia, allowing the consideration of social tags related to proper nouns and contemporary terms not available in a dictionary such as WordNet. Tso-Sutter et al. [38] describe a generic method that allows tags to be incorporated into standard heuristic-based CF algorithms, such as user- and item-based CF, by reducing the three-dimensional (user, item, tag) correlations to three two-dimensional correlations, and then applying a fusion method to re-associate these correlations. The integration of folksonomy information into CF is also studied by Zhen et al. [44]. In this case, the authors propose to use the model-based CF algorithm based on probabilistic matrix factorization. Differently to these approaches, as explained in the paper, our tag-based recommendation model follows the CF paradigm by means of applying Random Walk algorithm on the global graph formed by users, items, tags and their explicit relations
L Cantador et al Web Semantics: Science, Services and Agents on the world Wide Web 9(2011)1-15 Table 1 omparison of purpose-based categorisation of social tags Our categories Xu et al. [411 Sen et al. [301 Golder a d Huberman[15 Bischoff et al. [7 Attribute Factual Who ows Contextbased Context-based Refining other categories live Personal Self reference Self reference Adapted from [7]- y content-based m1()则如? category natural cc口 part-of-speech Fig. 1. Purpose-oriented categorisation of social tags 3. Overview of the approach 2a and 3a in the figure). In this case, we assume a semantic concept corresponds to a physical or non-physical entity related to con Our goal is to automatically categorise social tags based on their tent or contextual information of an item: objects, living entities. tention, considering the following four main categories locations, time references, etc On the other hand if the tag is not found in the available KBs, we employ Natural Language Process- Content-based Social tags that describe the content of the items, ing(NLP)techniques and categorisation heuristics to determine such as the objects and living things(animals, plants)that appear whether the tag can be assigned to subjective or organisational in a photo or video, or are mentioned in a text document or a categories(stages 1b, 2b and 3b in the figure). In the following. song lyric. Some examples of tags belonging to this category e briefly describe the above cases. More details are given in vehicle, dog and t Section 4 Context-based Social tags that provide contextual information about the items, such as the place where a photo was taken, ti date or period of time when a video was recorded, etc. Examples 3. 1. Content-based and contex of this kind of tags are madrid, mountain, summer and holidays Subjective Social tags that express opinions and qualities of the In principle, social tags belonging to content-and context-based items. Some examples of these tags are happy, sunny and con- categories are nouns denoting physical and non-physical entities whose definition can be found in dictionaries, encyclopaedias or Organisational Social tags that define personal usages and tasks, thesauri. Thus, the first step is to process and map an input tag to a or indicate self-references Examples within this category are look at, scan from print, myself and our best friend. input tag is nyc, which is the acronym for the city of New York, USA. Looking for this term in KBs, we could obtain references to seman- These tag categories are similar to those identified in the liter- tic entities related to that concept. For example, in Wikipedia. New ature.bischoffetal.[7compareseveralcategorisationschemasYorkcityisidentifiedbytheUrlshttpen.wikipedia.org/wiki/nyc Table 1 summarises this comparison, and includes our categorise tion, which fits with previous schemas. et us say the identified concept for nyc is [NewYorkcityl In contrast to previous studies, we attempt to automatically Once we have established the semantic concept underlying determine the most suitable category for a given social tag. Fig. 1 a social tag, and assuming the existence of taxonomic rela of the input tag, we distinguish two different cases. If the tag can relations expanding the concept towards its taxonomic ances- be mapped to a semantic concept of an external KB, then it will be assigned to either content-or context-based categories(stages la, which allows us to later categorise the concept as a content or a context-based tag. In Section 4, we present and justify the considered"reference"concepts. Here, continuing with the 6 In Section 4.2, we explain in detail how a tag is mapped to a semant example, we just mention that the concept [New-York-city] an external KB. At this point, the reader is asked to assume the existence of an al is an instance of the class USA__cities, and that expanding matic mechanism that links a tag with"names"of taxonomy categories, ontology USA- cities we might find out that [NewYorkcity] also belongs lasses or instances etc. available in the KB to the classes New-york_ state_ citites. Cities and locations
4 I. Cantador et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 1–15 Table 1 Comparison of purpose-based categorisation of social tags. Our categories Xu et al. [41] Sen et al. [30] Golder and Huberman [15] Bischoff et al. [7] Content-based Factual What or who is about Topic Content-based Attribute What it is Type Who owns it Author/owner Context-based Context-based Refining other categories Time Location Subjective Subjective Subjective Qualities/characteristics Opinions/qualities Organisational Organisational Personal Task organisation Usage context Self reference Self reference Adapted from [7]. Fig. 1. Purpose-oriented categorisation of social tags. 3. Overview of the approach Our goal is to automatically categorise social tags based on their intention, considering the following four main categories: • Content-based. Social tags that describe the content of the items, such as the objects and living things (animals, plants) that appear in a photo or video, or are mentioned in a text document or a song lyric. Some examples of tags belonging to this category are vehicle, dog and tree. • Context-based. Social tags that provide contextual information about the items, such as the place where a photo was taken, the date or period of time when a video was recorded, etc. Examples of this kind of tags are madrid, mountain, summer and holidays. • Subjective. Social tags that express opinions and qualities of the items. Some examples of these tags are happy, sunny and contemporary art. • Organisational. Social tags that define personal usages and tasks, or indicate self-references. Examples within this category are to look at, scan from print, myself and our best friend. These tag categories are similar to those identified in the literature. Bischoff et al. [7] compare several categorisation schemas. Table 1 summarises this comparison, and includes our categorisation, which fits with previous schemas. In contrast to previous studies, we attempt to automatically determine the most suitable category for a given social tag. Fig. 1 depicts the whole categorisation process. Depending on the nature of the input tag, we distinguish two different cases. If the tag can be mapped6 to a semantic concept of an external KB, then it will be assigned to either content- or context-based categories (stages 1a, 6 In Section 4.2, we explain in detail how a tag is mapped to a semantic concept of an external KB. At this point, the reader is asked to assume the existence of an automatic mechanism that links a tag with “names” of taxonomy categories, ontology classes or instances, etc. available in the KB. 2a and 3a in the figure). In this case, we assume a semantic concept corresponds to a physical or non-physical entity related to content or contextual information of an item: objects, living entities, locations, time references, etc. On the other hand, if the tag is not found in the available KBs, we employ Natural Language Processing (NLP) techniques and categorisation heuristics to determine whether the tag can be assigned to subjective or organisational categories (stages 1b, 2b and 3b in the figure). In the following, we briefly describe the above cases. More details are given in Section 4. 3.1. Content-based and context-based categories In principle, social tags belonging to content- and context-based categories are nouns denoting physical and non-physical entities whose definition can be found in dictionaries, encyclopaedias or thesauri. Thus, the first step is to process and map an input tag to a concept existing in a KB (stage 1a in Fig. 1). Let us suppose that the input tag is nyc, which is the acronym for the city of New York, USA. Looking for this term in KBs, we could obtain references to semantic entities related to that concept. For example, in Wikipedia, New York city is identified by the URLshttp://en.wikipedia.org/wiki/NYC and http://en.wikipedia.org/wiki/New York City, among others. Let us say the identified concept for nyc is [New York city]. Once we have established the semantic concept underlying a social tag, and assuming the existence of taxonomic relations among concepts in the KB, we propose to exploit such relations expanding the concept towards its taxonomic ancestors until reaching a “reference” ancestor (stage 2a in Fig. 1), which allows us to later categorise the concept as a contentor a context-based tag. In Section 4, we present and justify the considered “reference” concepts. Here, continuing with the example, we just mention that the concept [New York city] is an instance of the class USA cities, and that expanding USA cities we might find out that [New York city] also belongs to the classes New York state citites, Cities and Locations
L Contador et al/Web Semantics: Science, Services and Agents on the World wide Web 9(2011)1-15 Table 2 Proposed purpose-oriented categories and semantic subcategories, with examples of real Flickr tags Category Subcategory Flickr tag examples Physical entin food, glue, heart, ice react comb, finger, helicopter, table Living entity cell, clone, life,mushroom L-Animal caterpillar, frog, pigeon, pet Content-based -Person boy,daniel,friend, sister - Plant cactus, cereal flower, tree Non-physical entity cloud, feminism, noise, tennis Organisation bmw, ibm, religion, rolling stones Location california, rome, spain, wedding Context-based Time halloween, march, sixties, winter oh damn, so cute, unto Subjective golden picture, geometric elegance Self-reference i love you, her, missing you Organisational Task time for change, do not want to know Action avoid, hiking, explore page, sit In Wikipedia, this kind of classification is given by its"Wiki 4. Categorisation of social tags categories”:7 In the example, the semantic expansion is stopped at Locations 4.1. Tag categories class. This is a reference concept since it is uniquely associated te he context-based category(stage 3a in Fig. 1). For each of the four purpose-based categories proposed in this entities that can be assigned to those categories. Table 2 shows 3. 2. Subjective and organisational categorie these subcategories and examples of Flickr social tags automatically categorised by our approach a number of categorisation heuristics to determine whether it can ical entities, we do have artefacts and living entities. Living entities The first step is to tokenise the tags, and determine the part of can be split into animals and plants. Persons are considered as ani- be categorised as a subjective or an organisational tag. mals, and similarly, organisations are non-physical entities Related peech( PoS), i.e. noun, verb, adjective, adverb, preposition, etc of to the context of an item, we find location and time entities within each of the obtained tokens(stage 1b in Fig. 1). For example, let the subjective category. we distinguish between personal opinions us suppose that the input tag is to-read. This is tokenised as(to, and qualities. Finally, organisational entities are divided into self- read), and transformed into the tuple (to <preposition, read references, tasks and actions. <verb>). Instead of directly assigning a social tag to a purpose-based cat- Next, we analyse the PoS tuple in order to find a subset of tokens egory, we firstly identify the most suitable subcategory, and then that satisfies one of a set of patterns predefined for each category obtain the corresponding main category We detail this process in In the example, the pattern [<preposition>+ <verb>] may rep- Sections 4.2 and 4.3 esent a task(stage 2b in Fig. 1) Finally, through a heuristic approach, we assign the found pat- tern to a category(stage 3b in Fig. 1). Continuing with the example, 4.2. Categorising content- and context-based social tags task is assigned to the organisational category. In the next section, we will describe our tag categorisation approach in more mn, he categorisation of content-and context-based tags is based their mapping to semantic concepts described in an external KB. A major requirement imposed on the KB is that it has to pro- vide a classification hierarchy (taxonomy among its concepts. As described in Section 3, given a social tag. we should not only be . wikipedia. org/wiki/ Special/ Categories. able to map it to a semantic concept, but also to determine one of
I. Cantador et al. / Web Semantics: Science, Services and Agents on the World Wide Web 9 (2011) 1–15 5 Table 2 Proposed purpose-oriented categories and semantic subcategories, with examples of real Flickr tags categorisations. Category Subcategory Flickr tag examples Content-based Physical entity food, glue, heart, ice Artefact comb, finger, helicopter, table Living entity cell, clone, life, mushroom Animal caterpillar, frog, pigeon, pet Person boy, daniel, friend, sister Plant cactus, cereal flower, tree Non-physical entity cloud, feminism, noise, tennis Organisation bmw, ibm, religion, rolling stones Context-based Location california, rome, spain, wedding Time halloween, march, sixties, winter Subjective Opinion oh damn, so cute, unforgettable Quality golden picture, geometric elegance Organisational Self-reference i love you, her, missing you Task time for change, do not want to know Action avoid, hiking, explore page, sit In Wikipedia, this kind of classification is given by its “Wiki categories”.7 In the example, the semantic expansion is stopped at Locations class. This is a reference concept since it is uniquely associated to the context-based category (stage 3a in Fig. 1). 3.2. Subjective and organisational categories When a social tag is not a noun, we apply NLP techniques and a number of categorisation heuristics to determine whether it can be categorised as a subjective or an organisational tag. The first step is to tokenise the tags, and determine the part of speech (PoS), i.e. noun, verb, adjective, adverb, preposition, etc., of each of the obtained tokens (stage 1b in Fig. 1). For example, let us suppose that the input tag is to read. This is tokenised as (to, read), and transformed into the tuple (to <preposition>, read <verb>). Next, we analyse the PoS tuple in order to find a subset of tokens that satisfies one of a set of patterns predefined for each category. In the example, the pattern [<preposition> + <verb>] may represent a task (stage 2b in Fig. 1). Finally, through a heuristic approach, we assign the found pattern to a category (stage 3b in Fig. 1). Continuing with the example, a task is assigned to the organisational category. In the next section, we will describe our tag categorisation approach in more detail. 7 Wikipedia categories, http://en.wikipedia.org/wiki/Special/Categories. 4. Categorisation of social tags 4.1. Tag categories For each of the four purpose-based categories proposed in this work, we define a set of subcategories encompassing the types of entities that can be assigned to those categories. Table 2 shows these subcategories and examples of Flickr social tags automatically categorised by our approach. In contents, we find physical and non-physical entities. As physical entities, we do have artefacts and living entities. Living entities can be split into animals and plants. Persons are considered as animals, and similarly, organisations are non-physical entities. Related to the context of an item, we find location and time entities. Within the subjective category, we distinguish between personal opinions and qualities. Finally, organisational entities are divided into selfreferences, tasks and actions. Instead of directly assigning a social tag to a purpose-based category, we firstly identify the most suitable subcategory, and then obtain the corresponding main category. We detail this process in Sections 4.2 and 4.3. 4.2. Categorising content- and context-based social tags The categorisation of content- and context-based tags is based on their mapping to semantic concepts described in an external KB. A major requirement imposed on the KB is that it has to provide a classification hierarchy (taxonomy) among its concepts. As described in Section 3, given a social tag, we should not only be able to map it to a semantic concept, but also to determine one of