INTRODUCTION For the past two decades, the World Wide Web has expanded at enormous rate. The first generation of the World Wide Web(www) enabled users to have instantaneous access to a large diversity of knowledge items. The second generation of the www is usually denoted by Web 2.0. It signifies a fundamental change in the way people interact with and through the World Wide Web Web 2.0 is also referred to as the participatory Web. It can be characterized as a paradigm that facilitates communication, interoperability, user-centered design, and information sharing and collaboration on the Web(O'Reilly, 2005; Sharma, 2008). Moreover, in the transition to Web 2.0 we see a paradigm shift from local and solitary to global and collaborative. Also, this shift coincides with a shift from accessing and creating information to understanding information and understanding the people who deal with this information. Instead of creating, storing, managing, and accessing information on only one specific computer or browser, information management and access has been moving to many distributed places on the Web. Collaboratively created websites such as Wikipedia are edited and accessed by anyone, and users can document and share any aspect of their lives online using blogs, social networking sites, and video and photo sharing sites This thesis deals with recommender systems, social tagging, and social bookmarking. What are the relations between these three elements, and can we build recommender systems that profit from the presence of the other two elements? Assuming that we can, what are the threats from the outside or inside of this new part of the www? In the thesis we deal with pam as the outside threat, and duplicates as the inside threat. The aim of the thesis is to understand the symbiosis of recommender systems, social tagging, and social bookmarking, and to design mechanisms that successfully counter the threats from the outside and from the inside The course of this chapter is as follows. We introduce social bookmarking in Section 1. 1. It is followed by a description of the scope of the thesis. The problem statement and five research questions are formulated in Section 1.3. Section 1.4 describes the research methodology The structure of the thesis is provided in Section 1.5. Finally, Section 1.6 points to the origins of the material
CHAPTER 1 INTRODUCTION For the past two decades, the World Wide Web has expanded at enormous rate. The first generation of the World Wide Web (WWW) enabled users to have instantaneous access to a large diversity of knowledge items. The second generation of the WWW is usually denoted by Web 2.0. It signifies a fundamental change in the way people interact with and through the World Wide Web. Web 2.0 is also referred to as the participatory Web. It can be characterized as a paradigm that facilitates communication, interoperability, user-centered design, and information sharing and collaboration on the Web (O’Reilly, 2005; Sharma, 2008). Moreover, in the transition to Web 2.0 we see a paradigm shift from local and solitary to global and collaborative. Also, this shift coincides with a shift from accessing and creating information to understanding information and understanding the people who deal with this information. Instead of creating, storing, managing, and accessing information on only one specific computer or browser, information management and access has been moving to many distributed places on the Web. Collaboratively created websites such as Wikipedia are edited and accessed by anyone, and users can document and share any aspect of their lives online using blogs, social networking sites, and video and photo sharing sites. This thesis deals with recommender systems, social tagging, and social bookmarking. What are the relations between these three elements, and can we build recommender systems that profit from the presence of the other two elements? Assuming that we can, what are the threats from the outside or inside of this new part of the WWW? In the thesis we deal with spam as the outside threat, and duplicates as the inside threat. The aim of the thesis is to understand the symbiosis of recommender systems, social tagging, and social bookmarking, and to design mechanisms that successfully counter the threats from the outside and from the inside. The course of this chapter is as follows. We introduce social bookmarking in Section 1.1. It is followed by a description of the scope of the thesis. The problem statement and five research questions are formulated in Section 1.3. Section 1.4 describes the research methodology. The structure of the thesis is provided in Section 1.5. Finally, Section 1.6 points to the origins of the material. 1
Chapter 1 Introduction 1.1 Social Bookmarking Social bookmarking is a rather new phenomenon: instead of keeping a local copy of point ers to favorite URLS, users can instead store and access their bookmarks online through a Web interface. The underlying application then makes all stored information shareable among users. Closely related to social bookmarking websites are the so-called social refer- ence managers, which follow the same principle, but with a focus on the online management and access of scientific articles and papers. Social bookmarking websites have seen a rapid owth in popularity and a high degree of activity by their users. For instance, Delicious is one of the most popular social bookmarking services. It received an average of 140,000 posts per day in 2008 according to the independently sampled data collected by Philipp Keller2. In addition to the aforementioned functionality, most social 'storage services also offer the user the opportunity to describe by keywords the content they added to their per sonal profile. These keywords are commonly referred to as tags. They are an addition to e.g., the title and summary metadata commonly used to annotate content, and to improve the access and retrievability of a users own bookmarked Web pages. These tags are then made available to all users, many of whom have annotated many of the same Web pages with possibly overlapping tags. This results in a rich network of users, bookmarks, and tags, commonly referred to as a folksonomy. This social tagging phenomenon and the resulting folksonomies have become a staple of many Web 2.0 websites and services (Golder and Huberman, 2006) The emerging folksonomy on a social bookmarking website can be used to enhance a variety of tasks, such as searching for specific content. It can also enable the active discovery of new content by allowing users to browse through the richly connected network. A user could select one of his tags to explore all bookmarks annotated with that tag by the other users in the network, or locate like- minded users by examining a list of all other users who added a particular bookmark, possibly resulting in serendipitously discovered content(Marlow et al 2006). Both browsing and searching the system, however, require active user participation to locate new and interesting content. As the system increases in popularity and more users as well as content enter the system, the access methods become less effective at finding all the interesting content present in the system. The information overload problem caused by this growing influx of users and content means that search and browsing, which require active participation, are not always the most practical or preferable ways of locating new and interesting content. Typically, users only have a limited amount of time to go through the search results. Assuming users know about the existence of the relevant content and know how to formulate the appropriate queries they may arrive in time at the preferred places. But what happens when the search and browse process becomes less effective? And what if the user does not know about all relevant content available in the system? Our interest, and the focus of this thesis, lies in using recommender systems to help the user with this information overload problem, and automatically find interesting content for the user. A recommender system is a type of personalized information filtering technology used to identify sets of items that are likely to be of interest to a certain user, using a variety of http://www.delicious.com/ 2avAilableathttp://deli.ckoma.net/stats;lastvisitedJAnuary2009 In this thesis, we use his and 'he to refer to both genders
Chapter 1. Introduction 2 1.1 Social Bookmarking Social bookmarking is a rather new phenomenon: instead of keeping a local copy of pointers to favorite URLs, users can instead store and access their bookmarks online through a Web interface. The underlying application then makes all stored information shareable among users. Closely related to social bookmarking websites are the so-called social reference managers, which follow the same principle, but with a focus on the online management and access of scientific articles and papers. Social bookmarking websites have seen a rapid growth in popularity and a high degree of activity by their users. For instance, Delicious1 is one of the most popular social bookmarking services. It received an average of 140,000 posts per day in 2008 according to the independently sampled data collected by Philipp Keller2 . In addition to the aforementioned functionality, most social ‘storage’ services also offer the user the opportunity to describe by keywords the content they added to their personal profile. These keywords are commonly referred to as tags. They are an addition to e.g., the title and summary metadata commonly used to annotate content, and to improve the access and retrievability of a user’s own bookmarked Web pages. These tags are then made available to all users, many of whom have annotated many of the same Web pages with possibly overlapping tags. This results in a rich network of users, bookmarks, and tags, commonly referred to as a folksonomy. This social tagging phenomenon and the resulting folksonomies have become a staple of many Web 2.0 websites and services (Golder and Huberman, 2006). The emerging folksonomy on a social bookmarking website can be used to enhance a variety of tasks, such as searching for specific content. It can also enable the active discovery of new content by allowing users to browse through the richly connected network. A user could select one of his3 tags to explore all bookmarks annotated with that tag by the other users in the network, or locate like-minded users by examining a list of all other users who added a particular bookmark, possibly resulting in serendipitously discovered content (Marlow et al., 2006). Both browsing and searching the system, however, require active user participation to locate new and interesting content. As the system increases in popularity and more users as well as content enter the system, the access methods become less effective at finding all the interesting content present in the system. The information overload problem caused by this growing influx of users and content means that search and browsing, which require active participation, are not always the most practical or preferable ways of locating new and interesting content. Typically, users only have a limited amount of time to go through the search results. Assuming users know about the existence of the relevant content and know how to formulate the appropriate queries they may arrive in time at the preferred places. But what happens when the search and browse process becomes less effective? And what if the user does not know about all relevant content available in the system? Our interest, and the focus of this thesis, lies in using recommender systems to help the user with this information overload problem, and automatically find interesting content for the user. A recommender system is a type of personalized information filtering technology used to identify sets of items that are likely to be of interest to a certain user, using a variety of 1http://www.delicious.com/ 2Available at http://deli.ckoma.net/stats; last visited January 2009. 3 In this thesis, we use ‘his’ and ’he’ to refer to both genders
Chapter 1 Introduction information sources related to bo user and the content items(Resnick and Varian 1997 1. 2 Scope of the Thesis In this thesis, we investigate how recommender systems can be applied to the domain of social bookmarking. More specifically, we want to investigate the task of item recommenda tion. For this purpose, interesting and relevant items-bookmarks or scientific articles-are retrieved and recommended to the user. Recommendations can be based on a variety of information sources about the user and the items. It is a difficult task as we are trying to predict which items out of a very large pool would be relevant given a user's interests represented by the items which the user has added in the past. In our experiments we dis- tinguish between two types of information sources. The first one is usage data contained in the folksonomy, which represents the d transactions of all users. ie. whe added which items, and with what tags. The second information source is the metadata describing the bookmarks or articles on a social bookmarking website, such as title, de- scription, authorship, tags, and temporal and publication-related metadata. We are among the first to investigate this content-based aspect of recommendation for social bookmark ing websites. We compare and combine the content-based aspect with the more common usage-based approaches Because of the novelty of applying recommender systems to social bookmarking websites there is not a large body of related work, results, and design principles to build on. We therefore take a system-based approach for the evaluation our work. We try to simulate, as realistically as possible, the reaction of the user to different variants of the recommenda tion algorithms in a controlled laboratory setting. We focus on two specific domains: (1) recommending bookmarks of Web pages and (2) recommending bookmarked references to cientific articles. It is important to remark, however, that a system-based evaluation can only provide us with a provisional estimate of how well our algorithms are doing. User sat isfaction is influenced by more than just recommendation accuracy(Herlocker et al., 2004) and it would be essential to follow up our work with an evaluation on real users in realistic situations. However, this is not the focus of the thesis, nor will we focus on tasks such as tag recommendation or finding like-minded users. We focus strictly on recommending items 1.3 Problem Statement and Research questions As stated above, the rich information contained in social bookmarking websites can be used to support a variety of tasks. We consider three important ones: browsing, search, and lendation. From these three we focus on (item) recommendation in this thesis. In this context we may identify two types of key characteristics of social bookmarking websites that can be used in the recommendation process. We remark that the information sources represented by these characteristics are not always simultaneously available in every rec- ommendation scenario. The resulting recommendations are produced by (1) collaborative
Chapter 1. Introduction 3 information sources related to both the user and the content items (Resnick and Varian, 1997). 1.2 Scope of the Thesis In this thesis, we investigate how recommender systems can be applied to the domain of social bookmarking. More specifically, we want to investigate the task of item recommendation. For this purpose, interesting and relevant items—bookmarks or scientific articles—are retrieved and recommended to the user. Recommendations can be based on a variety of information sources about the user and the items. It is a difficult task as we are trying to predict which items out of a very large pool would be relevant given a user’s interests, as represented by the items which the user has added in the past. In our experiments we distinguish between two types of information sources. The first one is usage data contained in the folksonomy, which represents the past selections and transactions of all users, i.e., who added which items, and with what tags. The second information source is the metadata describing the bookmarks or articles on a social bookmarking website, such as title, description, authorship, tags, and temporal and publication-related metadata. We are among the first to investigate this content-based aspect of recommendation for social bookmarking websites. We compare and combine the content-based aspect with the more common usage-based approaches. Because of the novelty of applying recommender systems to social bookmarking websites, there is not a large body of related work, results, and design principles to build on. We therefore take a system-based approach for the evaluation our work. We try to simulate, as realistically as possible, the reaction of the user to different variants of the recommendation algorithms in a controlled laboratory setting. We focus on two specific domains: (1) recommending bookmarks of Web pages and (2) recommending bookmarked references to scientific articles. It is important to remark, however, that a system-based evaluation can only provide us with a provisional estimate of how well our algorithms are doing. User satisfaction is influenced by more than just recommendation accuracy (Herlocker et al., 2004) and it would be essential to follow up our work with an evaluation on real users in realistic situations. However, this is not the focus of the thesis, nor will we focus on tasks such as tag recommendation or finding like-minded users. We focus strictly on recommending items. 1.3 Problem Statement and Research Questions As stated above, the rich information contained in social bookmarking websites can be used to support a variety of tasks. We consider three important ones: browsing, search, and recommendation. From these three, we focus on (item) recommendation in this thesis. In this context we may identify two types of key characteristics of social bookmarking websites that can be used in the recommendation process. We remark that the information sources represented by these characteristics are not always simultaneously available in every recommendation scenario. The resulting recommendations are produced by (1) collaborative
Chapter 1 Introduction 4 filtering algorithms and(2)content-based filtering algorithms. We briefly discuss both types of algorithms and the associated characteristics below. Collaborative filtering algorithms Much of the research in recommender systems has fo- cused on exploiting sets of usage patterns that represent user preferences and transac- tions. The class of algorithms that operate on this source of information are called Col laborative filtering(Cf)algorithms. They automate the process of "word -of-mou recommendation items are recommended to a user based on how like- minded users rated those items (Goldberg et al., 1992; Shardanand and Maes, 1995). In the so cial bookmarking domain, we have an extra layer of usage data at our disposal in the folksonomy in the form of tags. This extra layer of collaboratively generated tags binds the users and items of a system together in yet another way, opening up many possibilities for new algorithms that can take advantage of this data Content-based filtering algorithms Social bookmarking services and especially social ref- erence managers are also characterized by the rich metadata describing the content dded by their users. Recommendation on the basis of textual information is com- monly referred to as content-based filtering(Goldberg et al., 1992)and matches the item metadata against a representation of the users interest to produce new recom- mendations. The metadata available on social bookmarking services describe many different aspects of the items posted to the website. It may comprise both personal information, such as reviews and descriptions, as well as general metadata that is the same for all users. While the availability of metadata is not unique to social bookmarking--movie recommenders, for instance, also have a rich set of metadata at their disposal(Paulson and Tzanavari, 2003)-it might be an important information source for generating item recommendations Having distinguished the two types of characteristics of social bookmarking websites, we are now able to formulate our problem statement(PS) Ps How can the characteristics of social bookmarking websites be exploited to produce the best possible item recommendations for users? To address this problem statement, we formulate five research questions. The first two research questions belong together. They read as follows RQ 1 How can we use the information represented by the folksonomy to sup- ort and improve th RQ 2 How can we use the item metadata available in social bookmarking sys tems to provide accurate recommendations to users? After answering the first two questions, we know how to exploit in the best manner the two types of information sources-the folksonomy and item metadata-to produce accurate recommendations. This leads us to our third research question
Chapter 1. Introduction 4 filtering algorithms and (2) content-based filtering algorithms. We briefly discuss both types of algorithms and the associated characteristics below. Collaborative filtering algorithms Much of the research in recommender systems has focused on exploiting sets of usage patterns that represent user preferences and transactions. The class of algorithms that operate on this source of information are called Collaborative Filtering (CF) algorithms. They automate the process of “word-of-mouth” recommendation: items are recommended to a user based on how like-minded users rated those items (Goldberg et al., 1992; Shardanand and Maes, 1995). In the social bookmarking domain, we have an extra layer of usage data at our disposal in the folksonomy in the form of tags. This extra layer of collaboratively generated tags binds the users and items of a system together in yet another way, opening up many possibilities for new algorithms that can take advantage of this data. Content-based filtering algorithms Social bookmarking services and especially social reference managers are also characterized by the rich metadata describing the content added by their users. Recommendation on the basis of textual information is commonly referred to as content-based filtering (Goldberg et al., 1992) and matches the item metadata against a representation of the user’s interest to produce new recommendations. The metadata available on social bookmarking services describe many different aspects of the items posted to the website. It may comprise both personal information, such as reviews and descriptions, as well as general metadata that is the same for all users. While the availability of metadata is not unique to social bookmarking—movie recommenders, for instance, also have a rich set of metadata at their disposal (Paulson and Tzanavari, 2003)—it might be an important information source for generating item recommendations. Having distinguished the two types of characteristics of social bookmarking websites, we are now able to formulate our problem statement (PS). PS How can the characteristics of social bookmarking websites be exploited to produce the best possible item recommendations for users? To address this problem statement, we formulate five research questions. The first two research questions belong together. They read as follows. RQ 1 How can we use the information represented by the folksonomy to support and improve the recommendation performance? RQ 2 How can we use the item metadata available in social bookmarking systems to provide accurate recommendations to users? After answering the first two questions, we know how to exploit in the best manner the two types of information sources—the folksonomy and item metadata—to produce accurate recommendations. This leads us to our third research question
Chapter 1 Introduction 5 RQ 3 Can we improve performance by combining the recommendations gen- erated by different algorithms? These are the three main research questions. As mentioned earlier, we evaluate our answers to these questions by simulating the users interaction with our proposed recommendation algorithms in a laboratory setting. However, such an idealized perspective does not take into account the dynamic growth issues caused by the increasing popularity of social book- marking websites. Therefore, we focus on two of these growing pains. There is one pain attacking social bookmarking websites from the outside, spam. The other one, duplicate content, attacks a social bookmarking website from the inside. They lead to our final two research questions RQ 4 How big a problem is spam for social bookmarking services? RQ 5 How big a problem is the entry of duplicate content for social bookmark ing services? Wherever it is applicable and aids our investigation, we will break down these questions into separate and even more specific research questions 1.4 Research Methodology The research methodology followed in the thesis comprises five parts: (1) reviewing the literature,(2) analyzing the findings, (3) designing the recommendation algorithms, (4) evaluating the algorithms, and(5) designing protection mechanisms for two growing pains First, we conduct a literature review to identify the main techniques, characteristics, and issues in the fields of recommender systems, social tagging, and social bookmarking, and in the intersection of the three fields. In addition, Chapters 4 through 8 each contain short literature reviews specifically related to the work described in the respective chapters Second, we analyze the findings from the literature. We use these in the third part of our methodology to guide us in the development of recommendation algorithms specifically suited for item recommendation on social bookmarking websites Fourth, we evaluate our recommendation algorithms in a quantitative manner. The build ing blocks of our quantitative evaluation are described in more detail in Chapter 3.Our quantitative evaluation is based on a so-called backtesting approach to evaluation that is common in recommender systems literature(Breese et al., 1998; Herlocker et al., 2004 Baluja et al., 2008). In backtesting, we evaluate on a per-user basis. We withhold ran- domly selected items from each user profile, and generate recommendations by using the remaining data as training material. If a user's withheld items are predicted at the top of the ranked list of recommendations, then the algorithm is considered to perform well for that user. The performance of a recommendation algorithm is averaged over the performance for all individual users In our evaluation we employ cross-validation, which can provide a
Chapter 1. Introduction 5 RQ 3 Can we improve performance by combining the recommendations generated by different algorithms? These are the three main research questions. As mentioned earlier, we evaluate our answers to these questions by simulating the user’s interaction with our proposed recommendation algorithms in a laboratory setting. However, such an idealized perspective does not take into account the dynamic growth issues caused by the increasing popularity of social bookmarking websites. Therefore, we focus on two of these growing pains. There is one pain attacking social bookmarking websites from the outside, spam. The other one, duplicate content, attacks a social bookmarking website from the inside. They lead to our final two research questions. RQ 4 How big a problem is spam for social bookmarking services? RQ 5 How big a problem is the entry of duplicate content for social bookmarking services? Wherever it is applicable and aids our investigation, we will break down these questions into separate and even more specific research questions. 1.4 Research Methodology The research methodology followed in the thesis comprises five parts: (1) reviewing the literature, (2) analyzing the findings, (3) designing the recommendation algorithms, (4) evaluating the algorithms, and (5) designing protection mechanisms for two growing pains. First, we conduct a literature review to identify the main techniques, characteristics, and issues in the fields of recommender systems, social tagging, and social bookmarking, and in the intersection of the three fields. In addition, Chapters 4 through 8 each contain short literature reviews specifically related to the work described in the respective chapters. Second, we analyze the findings from the literature. We use these in the third part of our methodology to guide us in the development of recommendation algorithms specifically suited for item recommendation on social bookmarking websites. Fourth, we evaluate our recommendation algorithms in a quantitative manner. The building blocks of our quantitative evaluation are described in more detail in Chapter 3. Our quantitative evaluation is based on a so-called backtesting approach to evaluation that is common in recommender systems literature (Breese et al., 1998; Herlocker et al., 2004; Baluja et al., 2008). In backtesting, we evaluate on a per-user basis. We withhold randomly selected items from each user profile, and generate recommendations by using the remaining data as training material. If a user’s withheld items are predicted at the top of the ranked list of recommendations, then the algorithm is considered to perform well for that user. The performance of a recommendation algorithm is averaged over the performance for all individual users. In our evaluation we employ cross-validation, which can provide a