Chapter 2 Related Work 21 recommendation algorithm is suited to each information seeking task, as we argued before Music recommendation depends largely on the users personal taste, but more complicated tasks could require more complicated external criteria which might be better served by different algorithms(McNee, 2006) 2.2 Social Tagging Under the Web 2.0 label, the past decade has seen a proliferation of social websites focusing on facilitating communication, user-centered design, information sharing and collaboration Examples of successful and popular social websites include wikis, social networking services blogs, and websites that support content sharing, such as social bookmarking. An important component of many of these services is social tagging; giving the users the power to describe and categorize content for their own purposes using tags. Tags are keywords that describe characteristics of the object they are applied to, and can be made up of one or more words Tags are not imposed upon users in a top-down fashion by making users choose only from a re-determined set of terms; instead, users are free to apply any type and any number of tags to an object, resulting in true bottom-up classification. Users cannot make wrong choices when deciding to apply certain tags, since their main motivation is to tag for their own benefit: making it easier to manage their content and re- find specific items in the future Many other names have been proposed for social tagging, including collaborative tagging, folk classification, ethno-classification, distributed classification, social classification, open tagging, and free tagging(Mathes, 2004; Hammond et al. 2005). We use the term social tagging because it is common in the literature, and because it involves the activity of labeling objects on social websites. We do see a difference between collaborative tagging and social tagging, and explain this in more detail in Subsection 2.2.2. Although there is no inherent grouping or hierarchy in the tags assigned by users, some researchers have classified tags different categories. a popular classific by Golder and Huberman(2006) who divide tags into seven categories based on the function they perform. Table 2.1 lists the seven categories. The first four categories are extrinsic to the tagger and describe the actual item they annotate with significant overlap between individual users. The bottom three categories are user-intrinsic: the information they provide is relative to the user(Golder and Huberman, 2006) The foundations for social tagging can be said to have been laid by Goldberg et al. ( 1992) in their TAPESTRY system, which allowed users to annotate documents and e-mail messages These annotations could range from short keywords to longer textual descriptions, and could be shared among users. The use of social tagging as we know it now was pioneered by Joshua Schachter when he created the social bookmarking site Delicious in Septem- ber 2003(Mathes, 2004). Other social websites, such as the photo sharing website Flickr, adopted social tagging soon afterwards. Nowadays, most social content sharing website support tagging in one way or another. Social tagging has been applied to many differ entdomainssuchasbookmarks(http://www.delicious.com),photos(http://www ' Although Delicious were the first to popularize the use of social tagging to describe content, there are earlier examples of websites that allowed user-generated annotation of content. See Section 2.3 for some examples
Chapter 2. Related Work 21 recommendation algorithm is suited to each information seeking task, as we argued before. Music recommendation depends largely on the user’s personal taste, but more complicated tasks could require more complicated external criteria which might be better served by different algorithms (McNee, 2006). 2.2 Social Tagging Under the Web 2.0 label, the past decade has seen a proliferation of social websites focusing on facilitating communication, user-centered design, information sharing and collaboration. Examples of successful and popular social websites include wikis, social networking services, blogs, and websites that support content sharing, such as social bookmarking. An important component of many of these services is social tagging; giving the users the power to describe and categorize content for their own purposes using tags. Tags are keywords that describe characteristics of the object they are applied to, and can be made up of one or more words. Tags are not imposed upon users in a top-down fashion by making users choose only from a pre-determined set of terms; instead, users are free to apply any type and any number of tags to an object, resulting in true bottom-up classification. Users cannot make wrong choices when deciding to apply certain tags, since their main motivation is to tag for their own benefit: making it easier to manage their content and re-find specific items in the future. Many other names have been proposed for social tagging, including collaborative tagging, folk classification, ethno-classification, distributed classification, social classification, open tagging, and free tagging (Mathes, 2004; Hammond et al., 2005). We use the term social tagging because it is common in the literature, and because it involves the activity of labeling objects on social websites. We do see a difference between collaborative tagging and social tagging, and explain this in more detail in Subsection 2.2.2. Although there is no inherent grouping or hierarchy in the tags assigned by users, some researchers have classified tags into different categories. A popular classification is that by Golder and Huberman (2006), who divide tags into seven categories based on the function they perform. Table 2.1 lists the seven categories. The first four categories are extrinsic to the tagger and describe the actual item they annotate with significant overlap between individual users. The bottom three categories are user-intrinsic: the information they provide is relative to the user (Golder and Huberman, 2006). The foundations for social tagging can be said to have been laid by Goldberg et al. (1992) in their TAPESTRY system, which allowed users to annotate documents and e-mail messages. These annotations could range from short keywords to longer textual descriptions, and could be shared among users. The use of social tagging as we know it now was pioneered by Joshua Schachter when he created the social bookmarking site Delicious in September 2003 (Mathes, 2004). Other social websites, such as the photo sharing website Flickr, adopted social tagging soon afterwards7 . Nowadays, most social content sharing websites support tagging in one way or another. Social tagging has been applied to many different domains, such as bookmarks (http://www.delicious.com), photos (http://www. 7Although Delicious were the first to popularize the use of social tagging to describe content, there are earlier examples of websites that allowed user-generated annotation of content. See Section 2.3 for some examples
Chapter 2 Related Work 22 Table 2.1: A tag categorization scheme according tag function, taken directly from Golder and Huberman(2006). The first four categories are extrinsic to the tagger; the bottom three categories are user-intrinsic. Functi Aboutness Aboutness tags identify the topic of an item, and often include common and proper nouns. An example could be the tags crisis and bailout for an article about the current economic crisis Resource type This type of tag identifies what kind of object an item is. Exam- ples include recipe, book, and blog. Owners Ownership tags identify who owns or created the item. An ex- ample would be the tags golder or huberman for the article by Golder and Huberman(2006) Refining categories Golder and Huberman(2006) argue that some tags do not stand alone, but refine existing categories. Examples include years Qualities/characteristics Certain tags represent characteristics of the bookmarked item inspirational, or boring. Self-reference Self-referential tags identify content in terms of its relation t ch as Task organizing Items can also be tagged according to tasks they are related to pular example flickr.com),videos(http://www.youtubecom),books(http://www.librarything com),scientificarticles(http://www.citeulike.org),movies(http://www.movielens org),music(http://www.last.fm/),slides(http://www.slideshare.net/),newsarti cles(http://slashdot.org/),museumcollections(http://www.steve.museum/),activ ities(http://www.43things.com),people(http://www.consumating.com),blogs(http 7/www.technoraticom),andinenterprisesettings(farrellandLau,2006) Early research into social tagging focused on comparing tagging to the traditional method of cataloguing by library and information science professionals. We discuss these compar- isons in Subsection 2.2. 1, and describe under what conditions social tagging and other cat aloging methods are the best choice. Then, in Subsection 2.2.2, we distinguish two types of social tagging that result from the way social tagging is typically implemented on websites These choices can have an influence on the network structure of users, items, and tags that emerges,and thereby on recommendation algorithms. We complete the current section in ubsection 2.2. 3 by providing some insight into the use of a social graph for representing social tagging. 2.2.1 Indexing vS. Tagging Early academic work on social tagging focused mostly on the contrast between social tag- ging and other subject indexing schemes, i.e, describing a resource by index terms to in- dicate what the resource is about. Mathes(2004) distinguishes between three different groups that can be involved in this process: intermediaries, creators, and users. Intermedi ary indexing by professionals has been an integral part of the field of library and information science since its inception. It is aimed at classifying and indexing resources by using the sauri or hierarchical classification systems. By using such controlled vocabularies-sets of
Chapter 2. Related Work 22 Table 2.1: A tag categorization scheme according tag function, taken directly from Golder and Huberman (2006). The first four categories are extrinsic to the tagger; the bottom three categories are user-intrinsic. Function Description Aboutness Aboutness tags identify the topic of an item, and often include common and proper nouns. An example could be the tags crisis and bailout for an article about the current economic crisis. Resource type This type of tag identifies what kind of object an item is. Examples include recipe, book, and blog. Ownership Ownership tags identify who owns or created the item. An example would be the tags golder or huberman for the article by Golder and Huberman (2006). Refining categories Golder and Huberman (2006) argue that some tags do not stand alone, but refine existing categories. Examples include years and numbers such as 2009 or 25. Qualities/characteristics Certain tags represent characteristics of the bookmarked items, such as funny, inspirational, or boring. Self-reference Self-referential tags identify content in terms of its relation to the tagger, such as myown or mycomments. Task organizing Items can also be tagged according to tasks they are related to. Popular examples are toread and jobsearch. flickr.com), videos (http://www.youtube.com), books (http://www.librarything. com), scientific articles (http://www.citeulike.org), movies (http://www.movielens. org), music (http://www.last.fm/), slides (http://www.slideshare.net/), news articles (http://slashdot.org/), museum collections (http://www.steve.museum/), activities (http://www.43things.com), people (http://www.consumating.com), blogs (http: //www.technorati.com), and in enterprise settings (Farrell and Lau, 2006). Early research into social tagging focused on comparing tagging to the traditional methods of cataloguing by library and information science professionals. We discuss these comparisons in Subsection 2.2.1, and describe under what conditions social tagging and other cataloging methods are the best choice. Then, in Subsection 2.2.2, we distinguish two types of social tagging that result from the way social tagging is typically implemented on websites. These choices can have an influence on the network structure of users, items, and tags that emerges, and thereby on recommendation algorithms. We complete the current section in Subsection 2.2.3 by providing some insight into the use of a social graph for representing social tagging. 2.2.1 Indexing vs. Tagging Early academic work on social tagging focused mostly on the contrast between social tagging and other subject indexing schemes, i.e., describing a resource by index terms to indicate what the resource is about. Mathes (2004) distinguishes between three different groups that can be involved in this process: intermediaries, creators, and users. Intermediary indexing by professionals has been an integral part of the field of library and information science since its inception. It is aimed at classifying and indexing resources by using thesauri or hierarchical classification systems. By using such controlled vocabularies—sets of
Chapter 2 Related Work 23 pre-determined, allowed descriptive terms--indexers can first control for ambiguity by se- lecting which terms are the preferred ones and are appropriate to the context of the intended user,and then link synonyms to their favored variants(cf Kipp(2006)). According to Lan- caster(2003), indexing typically involves two steps: conceptual analysis and translation The conceptual analysis stage is concerned with determining the topic of a resource and which parts are relevant to the intended users. In the translation phase, the results of the conceptual analysis are then translated into selecting the appropriate index terms, which can be difficult to do consistently, even between professional indexers (Lancaster, 2003 VoB, 2007). Another problem of intermediary indexing is scalability: the explosive growth f content means it is intractable for the relatively small group of professional indexers to describe all content Index terms and keywords can also be assigned by the creators of a resource. This is com- mon practice in the world of scientific publishing, and popular initiatives, such as the dublin Core Metadata Initiative, have also been used with some success(Mathes, 2004). In gen- eral, however, creator indexing has not received much attention(Kipp, 2006). By shifting the annotation burden away from professional indexers to the resource creators themselves, scalability problems can be reduced, but the lack of communication between creators of dif- ferent resources makes it more difficult to select consistent index terms. A second problem is that the amount and quality of creator-supplied keywords is highly dependent on the do- main. For instance, Web pages have been shown to lack useful metadata on many occasions (Hawking and Zobel, 2007) in the context of supporting retrieval. With social tagging the ponsibility of describing resources is placed on the shoulders of the users. This increases scalability even further over creator indexing, as each user is made responsible for describ- ing his own resources. It also ensures that the keywords assigned to resources by a user are directly relevant to that user One of the key differences between the three different indexing schemes is the level of co- ordination between the people doing the indexing, i.e., what terms are allowed for describ ing resources. Intermediary indexing requires the highest degree of coordination, whereas social tagging requires no explicit coordination between users. The level of coordination required for creator indexing lies somewhere in between. These differences in coordination were confirmed by Kipp(2006), who investigated the distribution of index terms and tags for the three different indexing methods. For a small collection of scientific articles, Kipp (2006)compared tags, author-assigned keywords, and index terms assigned by professional indexers. She showed that the distribution of terms of the latter two indexing approaches was different from the tag distribution, which showed a considerably longer tail of tags that were assigned only once. The larger variety of tags is a direct results of the lower level of co- ordination. Her findings hinted at the presence of a power law in the tag distribution, which was later confirmed by, among others, Shen and Wu(2005) and Catutto et al.(2007 important to remark that, despite its growing popularity, social tagging is not neces- sarily the most appropriate method of describing resources in every situation. In certain situations, indexing by intermediaries is still the preferred approach according to Shirky (2005). Such situations are characterized by a small, stable collection of objects with clear, http://dublincore.org
Chapter 2. Related Work 23 pre-determined, allowed descriptive terms—indexers can first control for ambiguity by selecting which terms are the preferred ones and are appropriate to the context of the intended user, and then link synonyms to their favored variants (cf. Kipp (2006)). According to Lancaster (2003), indexing typically involves two steps: conceptual analysis and translation. The conceptual analysis stage is concerned with determining the topic of a resource and which parts are relevant to the intended users. In the translation phase, the results of the conceptual analysis are then translated into selecting the appropriate index terms, which can be difficult to do consistently, even between professional indexers (Lancaster, 2003; Voß, 2007). Another problem of intermediary indexing is scalability: the explosive growth of content means it is intractable for the relatively small group of professional indexers to describe all content. Index terms and keywords can also be assigned by the creators of a resource. This is common practice in the world of scientific publishing, and popular initiatives, such as the Dublin Core Metadata Initiative8 , have also been used with some success (Mathes, 2004). In general, however, creator indexing has not received much attention (Kipp, 2006). By shifting the annotation burden away from professional indexers to the resource creators themselves, scalability problems can be reduced, but the lack of communication between creators of different resources makes it more difficult to select consistent index terms. A second problem is that the amount and quality of creator-supplied keywords is highly dependent on the domain. For instance, Web pages have been shown to lack useful metadata on many occasions (Hawking and Zobel, 2007) in the context of supporting retrieval. With social tagging the responsibility of describing resources is placed on the shoulders of the users. This increases scalability even further over creator indexing, as each user is made responsible for describing his own resources. It also ensures that the keywords assigned to resources by a user are directly relevant to that user. One of the key differences between the three different indexing schemes is the level of coordination between the people doing the indexing, i.e., what terms are allowed for describing resources. Intermediary indexing requires the highest degree of coordination, whereas social tagging requires no explicit coordination between users. The level of coordination required for creator indexing lies somewhere in between. These differences in coordination were confirmed by Kipp (2006), who investigated the distribution of index terms and tags for the three different indexing methods. For a small collection of scientific articles, Kipp (2006) compared tags, author-assigned keywords, and index terms assigned by professional indexers. She showed that the distribution of terms of the latter two indexing approaches was different from the tag distribution, which showed a considerably longer tail of tags that were assigned only once. The larger variety of tags is a direct results of the lower level of coordination. Her findings hinted at the presence of a power law in the tag distribution, which was later confirmed by, among others, Shen and Wu (2005) and Catutto et al. (2007). It is important to remark that, despite its growing popularity, social tagging is not necessarily the most appropriate method of describing resources in every situation. In certain situations, indexing by intermediaries is still the preferred approach according to Shirky (2005). Such situations are characterized by a small, stable collection of objects with clear, 8http://dublincore.org/
Chapter 2 Related Work 24 formal categories. The cataloguers themselves have to be experts on the subject matter, but users have to be experts in using the classification system as well. Examples of such collec- tions include library collections or the periodic table of elements. In contrast, social tagging works best for large, dynamic, heterogeneous corpora where users cannot be expected to gain expertise in a coordinated classification scheme(Shirky, 2005). The World Wide Web is a good example of such a scenario with billions of Web pages that vary wildly in topic and 2.2.2 Broad vs Narrow folksonomies The aggregation of the tagging efforts of all users is commonly referred to as a folksonomy a portmanteau of folk and taxonomy', implying a classification scheme made by a group of users. Like the hierarchical classification schemes designed by professional indexers to organize knowledge, a folksonomy allows any user to navigate freely between items, tags and other users. The term was coined by Vander Wal(2005b), who defines a folksonomy as the result of "personal free tagging of information and objects(anything with a URL) for one's own retrieval". Different variations on this definition have been proposed in the past In some definitions, only tags that are assigned by a user for his own benefit as considered to be a part of the folksonomy. Consequently, tagging for the benefit of others is excluded from the folksonomy according to this definition(Lawley, 2006). We refer the reader to Chopin(2007) for an extensive overview and discussion of the different definitions. For the recommendation experiments described in this thesis, we do not take into account the different motivations that users might have when tagging items. We define a folksonomy as"an aggregated network of users, items, and tags from a single system supporting social tagging". Vander Wal (2005a) distinguishes between two types of folksonomies, depend ing on how social tagging is implemented on the website: broad folksonomies and narrow folksonomies. Figure 2. 4 illustrates these two types of folksonomies The essential difference between a broad and a narrow folksonomy is who is allowed to tag a resource: every user interested in the resource, or only the creator of the resource This dichotomy is caused by the nature of the resources being tagged in the system. broad folksonomies emerge in social tagging scenarios where the resources being tagged are pub licly available, and were not necessarily created by the people who tagged and added them. a good example are Web page bookmarks: any user can bookmark a Web page and many pages will be useful to more than one user. More often than not, the bookmarked Web pages were not created by the user who added them to his profile. This means that inter- ested users will add their own version of the bookmarked url with their own metadata and tags. Figure 2. 4(a) illustrates this case for a single example resource. The initiator, the first user to add the resource to the system, has tagged the resource with tags a and when he added it to the system. Users in group l added the post with tags A and B, group 2 with tags A and C, and group 3 with tags C and D. Notice that tags B and d were not added by the original creator, although it is possible that the initiator later retrieves the resource with tag D. The two users in group 4 never add the resource, but find it using tags B and D later. Figure 2.4(a) illustrates the collaborative nature of tagging the resource: a tag can
Chapter 2. Related Work 24 formal categories. The cataloguers themselves have to be experts on the subject matter, but users have to be experts in using the classification system as well. Examples of such collections include library collections or the periodic table of elements. In contrast, social tagging works best for large, dynamic, heterogeneous corpora where users cannot be expected to gain expertise in a coordinated classification scheme (Shirky, 2005). The World Wide Web is a good example of such a scenario with billions of Web pages that vary wildly in topic and quality. 2.2.2 Broad vs. Narrow Folksonomies The aggregation of the tagging efforts of all users is commonly referred to as a folksonomy, a portmanteau of ‘folk’ and ‘taxonomy’, implying a classification scheme made by a group of users. Like the hierarchical classification schemes designed by professional indexers to organize knowledge, a folksonomy allows any user to navigate freely between items, tags, and other users. The term was coined by Vander Wal (2005b), who defines a folksonomy as the result of “personal free tagging of information and objects (anything with a URL) for one’s own retrieval”. Different variations on this definition have been proposed in the past. In some definitions, only tags that are assigned by a user for his own benefit as considered to be a part of the folksonomy. Consequently, tagging for the benefit of others is excluded from the folksonomy according to this definition (Lawley, 2006). We refer the reader to Chopin (2007) for an extensive overview and discussion of the different definitions. For the recommendation experiments described in this thesis, we do not take into account the different motivations that users might have when tagging items. We define a folksonomy as “an aggregated network of users, items, and tags from a single system supporting social tagging”. Vander Wal (2005a) distinguishes between two types of folksonomies, depending on how social tagging is implemented on the website: broad folksonomies and narrow folksonomies. Figure 2.4 illustrates these two types of folksonomies. The essential difference between a broad and a narrow folksonomy is who is allowed to tag a resource: every user interested in the resource, or only the creator of the resource. This dichotomy is caused by the nature of the resources being tagged in the system. Broad folksonomies emerge in social tagging scenarios where the resources being tagged are publicly available, and were not necessarily created by the people who tagged and added them. A good example are Web page bookmarks: any user can bookmark a Web page and many pages will be useful to more than one user. More often than not, the bookmarked Web pages were not created by the user who added them to his profile. This means that interested users will add their own version of the bookmarked URL with their own metadata and tags. Figure 2.4(a) illustrates this case for a single example resource. The ‘initiator’, the first user to add the resource to the system, has tagged the resource with tags A and C when he added it to the system. Users in group 1 added the post with tags A and B, group 2 with tags A and C, and group 3 with tags C and D. Notice that tags B and D were not added by the original creator, although it is possible that the initiator later retrieves the resource with tag D. The two users in group 4 never add the resource, but find it using tags B and D later. Figure 2.4(a) illustrates the collaborative nature of tagging the resource: a tag can
Chapter 2 Related Work RESOURCE 1 8同×:88 CREATOR 8 TAGS TAGS (a)Broad folksonomy (b)Narrow folksonomy Figure 2.4: Two types of folksonomies: broad and narrow. The figure is slightly adapted from Vander Wal(2005a). The four user groups on the right side of each figure denote groups of users that share the same vocabulary. An arrow pointing from a user(group) to a tag means that the tag was added by that user (group). An arrow pointing from a tag to a user (group)means that the tag is part of the vocabulary of that user (group) for retrieving the resource. The creator/initiator is the user who was the first to create or add the resource to the system. be applied multiple times to the same resource. This is why we refer to this scenario as collaborative tagging In contrast, a narrow folksonomy, as illustrated in Figure 2.4(b), emerges when only the creator of a resource can tag the item. For example, in the case of the video sharing website YouTube, a user can create a video and upload it to the website. This user is the original creator of the video and adds tags A, B, and c to the video to describe it after uploading. As a consequence, each tag is applied only once to a specific resource in a narrow folksonomy. We therefore refer to this scenario as individual tagging. All other users are dependent on the creator's vocabulary in a narrow folksonomy: users in group 1 can use tag a to locate the resource, users in group 2 may use tags A and B, and group 4 users may use tag C User 3, however, cannot find the resource because his vocabulary does not overlap with that of the creator. In this thesis we look at recommendation of bookmarks and scientific articles These items are typically added and tagged by many different users, and result in broad folksonomies. We do not focus on recommendation for individual tagging systems, only for collaborative tagging 2.2.3 The Social Graph Figure 2. 4 illustrates the concept of a folksonomy for a single item, but this mini-scenario occurs for many different items on websites that support social tagging. With its myriad of connections between users, items, and tags, a folksonomy is most commonly represented as a undirected tripartite graph. Mika(2005) and Lambiotte and Ausloos(2006)were the first to do so. Figure 2.5 illustrates such a social graph
Chapter 2. Related Work 25 TAGS RESOURCE INITIATOR A B D 1 2 3 4 C (a) Broad folksonomy TAGS RESOURCE CREATOR A B C 1 2 3 4 (b) Narrow folksonomy Figure 2.4: Two types of folksonomies: broad and narrow. The figure is slightly adapted from Vander Wal (2005a). The four user groups on the right side of each figure denote groups of users that share the same vocabulary. An arrow pointing from a user (group) to a tag means that the tag was added by that user (group). An arrow pointing from a tag to a user (group) means that the tag is part of the vocabulary of that user (group) for retrieving the resource. The creator/initiator is the user who was the first to create or add the resource to the system. be applied multiple times to the same resource. This is why we refer to this scenario as collaborative tagging. In contrast, a narrow folksonomy, as illustrated in Figure 2.4(b), emerges when only the creator of a resource can tag the item. For example, in the case of the video sharing website YouTube, a user can create a video and upload it to the website. This user is the original creator of the video and adds tags A, B, and C to the video to describe it after uploading. As a consequence, each tag is applied only once to a specific resource in a narrow folksonomy. We therefore refer to this scenario as individual tagging. All other users are dependent on the creator’s vocabulary in a narrow folksonomy: users in group 1 can use tag A to locate the resource, users in group 2 may use tags A and B, and group 4 users may use tag C. User 3, however, cannot find the resource because his vocabulary does not overlap with that of the creator. In this thesis we look at recommendation of bookmarks and scientific articles. These items are typically added and tagged by many different users, and result in broad folksonomies. We do not focus on recommendation for individual tagging systems, only for collaborative tagging. 2.2.3 The Social Graph Figure 2.4 illustrates the concept of a folksonomy for a single item, but this mini-scenario occurs for many different items on websites that support social tagging. With its myriad of connections between users, items, and tags, a folksonomy is most commonly represented as a undirected tripartite graph. Mika (2005) and Lambiotte and Ausloos (2006) were the first to do so. Figure 2.5 illustrates such a social graph