Chapter 1 Introduction 6 reliable estimate of the true generalization error of our recommendation algorithms (Weiss and Kulikowski, 1991). Our experimental setup is described in more detail in Section 3.4 The fifth and final part of our research methodology involves designing protection for social bookmarking websites and our proposed recommendation algorithms against two growing pains: spam and duplicate content. For both pains we analyze how extensive the problem is for one of our data sets. We then design algorithms to automatically detect this problematic content. Finally, we perform a robustness analysis of the recommendation algorithms we proposed in Part I against spam and duplicates 1.5 Organization of the thesis The thesis consists of two main parts. The first part is Recommending Bookmarks, which ranges from Chapter 3 to Chapter 6, both inclusive. The second main part is "Growing Pains, which covers Chapters 7 and 8. The two Parts are preceded by two introductor chapters. Chapter 1 contains an introduction of the thesis as well as the formulation of a problem statement and five research questions. Moreover, the research methodology is given. Chapter 2 contains a literature review. Below we provide a brief overview of the contents of parts i and il Part I The core of our recommendation experiments is contained in Part I. It starts in Chap ter 3 with an overview of the building blocks of our quantitative evaluation. We start y formally defining the recommendation task we are trying to solve: recommending pre-processing and filtering to our choice of evaluation metrics Then, in Chapter 4 we investigate the first of two important characteristics of social bookmarking systems: the presence of the folksonomy. We propose and compare different options for using the folksonomy in item recommendation for social book markIn ng (rQ 1) In Chapter 5, we investigate the usefulness of item metadata in the recommendation process, the other characteristic of social bookmarking websites, and examine how we can use this metadata to improve recommendation performance(RQ 2) Chapter 6 concludes Part I by examining different options for combining these two characteristics to see if we can improve upon our best-performing algorithms from Chapters 4 and 5 by combining the output of the different algorithms into a new list of recommendations(rQ 3) Part II In Part II, we dive into the periphery of recommendation, and zoom in on two spe cific growing pains that social bookmarking services encounter as they become more popular: spam and duplicate content. In Chapter 7 we take a look at the problem of spam in social bookmarking websites to attempt to quantify the problem. We pro- pose a spam detection algorithm based on a combination of language models and the
Chapter 1. Introduction 6 reliable estimate of the true generalization error of our recommendation algorithms (Weiss and Kulikowski, 1991). Our experimental setup is described in more detail in Section 3.4. The fifth and final part of our research methodology involves designing protection for social bookmarking websites and our proposed recommendation algorithms against two growing pains: spam and duplicate content. For both pains we analyze how extensive the problem is for one of our data sets. We then design algorithms to automatically detect this problematic content. Finally, we perform a robustness analysis of the recommendation algorithms we proposed in Part I against spam and duplicates. 1.5 Organization of the Thesis The thesis consists of two main parts. The first part is ‘Recommending Bookmarks’, which ranges from Chapter 3 to Chapter 6, both inclusive. The second main part is ‘Growing Pains’, which covers Chapters 7 and 8. The two Parts are preceded by two introductory chapters. Chapter 1 contains an introduction of the thesis as well as the formulation of a problem statement and five research questions. Moreover, the research methodology is given. Chapter 2 contains a literature review. Below we provide a brief overview of the contents of Parts I and II. Part I The core of our recommendation experiments is contained in Part I. It starts in Chapter 3 with an overview of the building blocks of our quantitative evaluation. We start by formally defining the recommendation task we are trying to solve: recommending interesting items to users based on their preferences. We then introduce our data sets and describe how they were collected. We discuss our experimental setup, from data pre-processing and filtering to our choice of evaluation metrics. Then, in Chapter 4 we investigate the first of two important characteristics of social bookmarking systems: the presence of the folksonomy. We propose and compare different options for using the folksonomy in item recommendation for social bookmarking (RQ 1). In Chapter 5, we investigate the usefulness of item metadata in the recommendation process, the other characteristic of social bookmarking websites, and examine how we can use this metadata to improve recommendation performance (RQ 2). Chapter 6 concludes Part I by examining different options for combining these two characteristics to see if we can improve upon our best-performing algorithms from Chapters 4 and 5 by combining the output of the different algorithms into a new list of recommendations (RQ 3). Part II In Part II, we dive into the periphery of recommendation, and zoom in on two specific growing pains that social bookmarking services encounter as they become more popular: spam and duplicate content. In Chapter 7 we take a look at the problem of spam in social bookmarking websites to attempt to quantify the problem. We propose a spam detection algorithm based on a combination of language models and the
Chapter 1 Introduction 7 nearest t-neighbor algorithm. We also investigate the influence spam can have on rec ommending items for social bookmarking websites in a case study on one of our data In Chapter 8 we take a similar approach to the problem of duplicate content. We start by quantifying the problem and then construct a classifier that can automatically iden- tify duplicate item pairs in one of our data sets. Finally, we investigate the influence of duplicates on recommendation performance in a case study on one of our data sets Part III concludes the thesis. We revisit our research questions and the answers we found. Then we answer the problem statement and formulate our conclusions. We also list future research directions, drawing on the work described in this thesis Additional information that would interrupt the narrative is placed in Appendices A, D, and C, and referred to in the text where applicable. We also list some mark-up conventions here Tags feature prominently in our thesis; for clarity we print them with a sans-serif font, e. g as information-retrieval or toread. We print metadata fields with a fixed-width font, e. g as TITLE or ABSTRACT. We experiment with many different combinations of algorithms and similarity metrics, each resulting in a set of recommendations for a group of test users. We will refer to such output as a recommendation run which is made up of a list of recom- mendations. Different variants of an algorithm are marked up as RUNNAMEl or RUNNAME2 where it helps to clarify the discussion 1.6 Origins of the Material This thesis is partly based on papers that have already been published or have been submit ted. Early versions of the experiments with the CiteULike data set in Part I were described in Bogers and Van den Bosch(2008a), and later expanded to include more algorithms and more data sets in Bogers and Van den Bosch(2009b). In an earlier stage, the foundations of the work on content-based recommendation were laid in Bogers and Van den Bosch(2007) Finally, the work on spam detection described in Part II was published first in Bogers and Van den Bosch(2008b) and in expanded form in Bogers and Van den Bosch(2009a)
Chapter 1. Introduction 7 nearest-neighbor algorithm. We also investigate the influence spam can have on recommending items for social bookmarking websites in a case study on one of our data sets (RQ 4). In Chapter 8 we take a similar approach to the problem of duplicate content. We start by quantifying the problem and then construct a classifier that can automatically identify duplicate item pairs in one of our data sets. Finally, we investigate the influence of duplicates on recommendation performance in a case study on one of our data sets. Part III concludes the thesis. We revisit our research questions and the answers we found. Then we answer the problem statement and formulate our conclusions. We also list future research directions, drawing on the work described in this thesis. Additional information that would interrupt the narrative is placed in Appendices A, D, and C, and referred to in the text where applicable. We also list some mark-up conventions here. Tags feature prominently in our thesis; for clarity we print them with a sans-serif font, e.g., as information-retrieval or toread. We print metadata fields with a fixed-width font, e.g., as TITLE or ABSTRACT. We experiment with many different combinations of algorithms and similarity metrics, each resulting in a set of recommendations for a group of test users. We will refer to such output as a recommendation run which is made up of a list of recommendations. Different variants of an algorithm are marked up as RUNNAME1 or RUNNAME2 where it helps to clarify the discussion. 1.6 Origins of the Material This thesis is partly based on papers that have already been published or have been submitted. Early versions of the experiments with the CiteULike data set in Part I were described in Bogers and Van den Bosch (2008a), and later expanded to include more algorithms and more data sets in Bogers and Van den Bosch (2009b). In an earlier stage, the foundations of the work on content-based recommendation were laid in Bogers and Van den Bosch (2007). Finally, the work on spam detection described in Part II was published first in Bogers and Van den Bosch (2008b) and in expanded form in Bogers and Van den Bosch (2009a)
RELATED WORK The work presented in this thesis is novel in (1) its application of recommendation algo rithms to social bookmarking websites, and(2) its incorporation of the information rep- resented by the folksonomy and the metadata on those websites. This chapter serves as an introduction into general related work covering some subareas of our work. All related work specifically relevant to the work described in each of the following chapters will be discussed in those respective chapters We start this chapter in Section 2.1 by introducing recommender systems: first, a brief his tory of the field will be given, followed by the most popular algorithms and applications, as well as the most common problems that the recommender systems are suffering. Then, we take a more detailed look at related work on recommending Web bookmarks and references to scientific articles. We briefly discuss the role of the user and context in recommendation. In Section 2.2 we closely consider the phenomenon of social tagging which has become a ig part of the Web 2.0 paradigm. We compare it to the traditional view on indexing by information specialists and we discuss two different types of social tagging. We establish that social bookmarking services are a class of social websites that lean heavily on social tagging for managing their content. Section 2.3 discusses the rise of social bookmarking services and their characteristics, as well as research into the different user tasks performed on social bookmarking websites 2.1 Recommender Systems The explosive growth of the World Wide Web in the 1990s resulted in a commensurate growth of the amount of information available online, outgrowing the capacity of individ- ual users to process all this information. This prompted a strong interest in the specific search fields and technology that could help manage this information overload. The most characteristic fields are information retrieval and information filtering. As a research field, in- formation retrieval (IR)originated in the 1950s and is concerned with automatically match- ing a users information need against a collection of documents. The 1990s saw a change
CHAPTER 2 RELATED WORK The work presented in this thesis is novel in (1) its application of recommendation algorithms to social bookmarking websites, and (2) its incorporation of the information represented by the folksonomy and the metadata on those websites. This chapter serves as an introduction into general related work covering some subareas of our work. All related work specifically relevant to the work described in each of the following chapters will be discussed in those respective chapters. We start this chapter in Section 2.1 by introducing recommender systems: first, a brief history of the field will be given, followed by the most popular algorithms and applications, as well as the most common problems that the recommender systems are suffering. Then, we take a more detailed look at related work on recommending Web bookmarks and references to scientific articles. We briefly discuss the role of the user and context in recommendation. In Section 2.2 we closely consider the phenomenon of social tagging which has become a big part of the Web 2.0 paradigm. We compare it to the traditional view on indexing by information specialists and we discuss two different types of social tagging. We establish that social bookmarking services are a class of social websites that lean heavily on social tagging for managing their content. Section 2.3 discusses the rise of social bookmarking services and their characteristics, as well as research into the different user tasks performed on social bookmarking websites. 2.1 Recommender Systems The explosive growth of the World Wide Web in the 1990s resulted in a commensurate growth of the amount of information available online, outgrowing the capacity of individual users to process all this information. This prompted a strong interest in the specific research fields and technology that could help manage this information overload. The most characteristic fields are information retrieval and information filtering. As a research field, information retrieval (IR) originated in the 1950s and is concerned with automatically matching a user’s information need against a collection of documents. The 1990s saw a change 9
Chapter 2 Related Work in focus from small document collections to the larger collections of realistic size needed to cope with the ever-growing amount of information on the Web. Information filtering (IF) systems aim to help the users make optimal use of their limited reading time by filtering ou unwanted information in order to expose only the relevant information from a large fl information, such as news articles(Hanani et al., 2001). Typically, such systems construct a model of a user's interests and match that against the incoming information objects. while IR and IF are considered separate research fields, they share many characteristics, such as a focus on analyzing textual content(Belkin and Croft, 1992) A third type of technology designed to combat information overload are recommender sys- tems, which have their origin in the field of information filtering(Hanani et al., 2001).A recommender system is used to identify sets of items that are likely to be of interest to a certain user, exploiting a variety of information sources related to both the user and the content items. In contrast to information filtering, recommender systems actively predict which items the user might be interested in and add them to the information flowing to the user,whereas information filtering technology is aimed at removing items from the infor- mation stream(Hanani et al., 2001). Over the past two decades many different recommen- dation algorithms have been proposed for many different domains. In Subsections 2.1.1 through 2.1.3, we discuss the three most common classes of algorithms: collaborative fil- tering, content-based filtering, and knowledge-based recommendation. We explain how the algorithms work, and discuss the most important advantages and problems of each algo rithm. Then, in Subsection 2.1.4, we discuss related work on recommendation Web pages and scientific articles. We conclude this section by briefly discussing the role of the user and context in the recommendation process in Subsection 2.1.5 Finally, we would like to remark that recommender systems technology has also been ap- plied to a variety of different domains, such as online stores(Linden et al., 2003), movies (Herlocker et al, 1999), music(Celma, 2008), Web pages (Joachims et al., 1997), e-mail (Goldberg et al., 1992), books(Mooney and Roy, 2000), news articles (Das et al., 2007) scientific articles(Budzik and Hammond, 1999), and even jokes(Goldberg et al., 2001). In October 2006, it also spawned a million dollar competition in the form of the Netflix Prize aimed at designing a recommender system that significantly outperforms Netflix's own Cin eMatch system. We refer the reader to Montaner et al.(2003) for a more comprehensive overview of recommender systems and domains 2.1.1 Collaborative Filtering Arguably the most popular class of recommendation algorithms, collaborative filtering(Cf) descend from work in the area of information filtering. CF algorithms try to automate the process of"word-of-mouth"recommendation: items are recommended to a user based on how like-minded users rated those items(Shardanand and Maes, 1995). The term ' collabo- rative filtering was first used by Goldberg et al.(1992), to describe their Tapestry filtering system, which allowed users to annotate documents and e-mail messages. Other users could then request documents annotated by certain people, but identifying these people was left http://www.netflixprize.com/
Chapter 2. Related Work 10 in focus from small document collections to the larger collections of realistic size needed to cope with the ever-growing amount of information on the Web. Information filtering (IF) systems aim to help the users make optimal use of their limited reading time by filtering out unwanted information in order to expose only the relevant information from a large flow of information, such as news articles (Hanani et al., 2001). Typically, such systems construct a model of a user’s interests and match that against the incoming information objects. While IR and IF are considered separate research fields, they share many characteristics, such as a focus on analyzing textual content (Belkin and Croft, 1992). A third type of technology designed to combat information overload are recommender systems, which have their origin in the field of information filtering (Hanani et al., 2001). A recommender system is used to identify sets of items that are likely to be of interest to a certain user, exploiting a variety of information sources related to both the user and the content items. In contrast to information filtering, recommender systems actively predict which items the user might be interested in and add them to the information flowing to the user, whereas information filtering technology is aimed at removing items from the information stream (Hanani et al., 2001). Over the past two decades many different recommendation algorithms have been proposed for many different domains. In Subsections 2.1.1 through 2.1.3, we discuss the three most common classes of algorithms: collaborative filtering, content-based filtering, and knowledge-based recommendation. We explain how the algorithms work, and discuss the most important advantages and problems of each algorithm. Then, in Subsection 2.1.4, we discuss related work on recommendation Web pages and scientific articles. We conclude this section by briefly discussing the role of the user and context in the recommendation process in Subsection 2.1.5. Finally, we would like to remark that recommender systems technology has also been applied to a variety of different domains, such as online stores (Linden et al., 2003), movies (Herlocker et al., 1999), music (Celma, 2008), Web pages (Joachims et al., 1997), e-mail (Goldberg et al., 1992), books (Mooney and Roy, 2000), news articles (Das et al., 2007), scientific articles (Budzik and Hammond, 1999), and even jokes (Goldberg et al., 2001). In October 2006, it also spawned a million dollar competition in the form of the Netflix Prize1 , aimed at designing a recommender system that significantly outperforms Netflix’s own CineMatch system. We refer the reader to Montaner et al. (2003) for a more comprehensive overview of recommender systems and domains. 2.1.1 Collaborative Filtering Arguably the most popular class of recommendation algorithms, collaborative filtering (CF), descend from work in the area of information filtering. CF algorithms try to automate the process of “word-of-mouth” recommendation: items are recommended to a user based on how like-minded users rated those items (Shardanand and Maes, 1995). The term ‘collaborative filtering’ was first used by Goldberg et al. (1992), to describe their Tapestry filtering system, which allowed users to annotate documents and e-mail messages. Other users could then request documents annotated by certain people, but identifying these people was left 1http://www.netflixprize.com/