J. L. Herlocker et al good items in order to screen out many bad ones. Our discussions with firms in the legal databases industry, however, led in the opposite direction. Lawyers searching for precedents feel it is very important not to overlook a single possible case. Indeed, they are willing to invest large amounts of time(and their client's money) searching for that case. To use recommenders in their practice, they first need to be assured that the false negative rate can be made sufficiently low. As with annotation in context, coverage becomes particularly important in this task Recommend Sequence. We first noticed this task when using the personal ized radio web site Launch (launch. yahoo. com)which streams music based on a ariety of recommender algorithms. Launch has several interesting factors, in- cluding the desirability of recommending"already rated"items, though not too often What intrigued us, though, is the challenge of moving from recommend- ing one song at a time to recommending a sequence that is pleasing as a whole This same task can apply to recommending research papers to learn about a field (read this introduction, then that survey, ..) While data mining research has explored product purchase timing and sequences, we are not aware of any recommender applications or research that directly address this task. Just Browsing Recommenders are usually evaluated based on how well they help the user make a consumption decision In talking with users of our MovieLens system, of Amazon. com, and of several other sites, we discovered that many of them use the site even when they have no purchase imminent. They find it pleasant to browse. Whether one models this activity as learning or simply as entertainment, it seems that a substantial use of recommenders is simply using them without an ulterior motive. For those cases, the accuracy of algorithms may be less important than the interface, the ease of use, and the level and nature of information provided Find Credible recommender: This is another task gleaned from discussions with users. It is not surprising that users do not automatically trust a recom- mender. Many of them "play around" for a while to see if the recommender matches their tastes well. We,ve heard many complaints from users who are looking up their favorite(or least favorite) movies on MovieLens-they dont do this to learn about the movie, but to check up on us. Some users even go further. Especially on commercial sites, they try changing their profiles to see how the recommended items change. They explore the recommendations to try to find any hints of bias. a recommender optimized to produce"useful"recom- nendations(e. g, recommendations for items that the user does not already know about) may fail to appear trustworthy because it does not recommen movies the user is sure to enjoy but probably already knows about. We are not ware of any research on how to make a recommender appear credible, though there is more general research on making websites evoke trust [Bailey et al 2001] Most evaluations of recommender systems focus on the recommendations; however if users don,'t rate items, then collaborative filtering recommender sys tems can't provide recommendations. Thus, evaluating if and why users would ACM Transactions on Information Systems, VoL 22, No. 1, January 2004
10 • J. L. Herlocker et al. good items in order to screen out many bad ones. Our discussions with firms in the legal databases industry, however, led in the opposite direction. Lawyers searching for precedents feel it is very important not to overlook a single possible case. Indeed, they are willing to invest large amounts of time (and their client’s money) searching for that case. To use recommenders in their practice, they first need to be assured that the false negative rate can be made sufficiently low. As with annotation in context, coverage becomes particularly important in this task. Recommend Sequence. We first noticed this task when using the personalized radio web site Launch (launch.yahoo.com) which streams music based on a variety of recommender algorithms. Launch has several interesting factors, including the desirability of recommending “already rated” items, though not too often. What intrigued us, though, is the challenge of moving from recommending one song at a time to recommending a sequence that is pleasing as a whole. This same task can apply to recommending research papers to learn about a field (read this introduction, then that survey, ... ). While data mining research has explored product purchase timing and sequences, we are not aware of any recommender applications or research that directly address this task. Just Browsing. Recommenders are usually evaluated based on how well they help the user make a consumption decision. In talking with users of our MovieLens system, of Amazon.com, and of several other sites, we discovered that many of them use the site even when they have no purchase imminent. They find it pleasant to browse. Whether one models this activity as learning or simply as entertainment, it seems that a substantial use of recommenders is simply using them without an ulterior motive. For those cases, the accuracy of algorithms may be less important than the interface, the ease of use, and the level and nature of information provided. Find Credible Recommender. This is another task gleaned from discussions with users. It is not surprising that users do not automatically trust a recommender. Many of them “play around” for a while to see if the recommender matches their tastes well. We’ve heard many complaints from users who are looking up their favorite (or least favorite) movies on MovieLens—they don’t do this to learn about the movie, but to check up on us. Some users even go further. Especially on commercial sites, they try changing their profiles to see how the recommended items change. They explore the recommendations to try to find any hints of bias. A recommender optimized to produce “useful” recommendations (e.g., recommendations for items that the user does not already know about) may fail to appear trustworthy because it does not recommend movies the user is sure to enjoy but probably already knows about. We are not aware of any research on how to make a recommender appear credible, though there is more general research on making websites evoke trust [Bailey et al. 2001]. Most evaluations of recommender systems focus on the recommendations; however if users don’t rate items, then collaborative filtering recommender systems can’t provide recommendations. Thus, evaluating if and why users would ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems contribute ratings may be important to communicate that a recommender sys- tem is likely to be successful. We will briefly introduce several different rating tasks Improve Profile. the rating task that most recommender systems have sume ed Users contribute ratings because they believe that they are improv their profile and thus improving the quality of the recommendations that they will receive Express Self Some users may not care about the recommendations-what is important to them is that they be allowed to contribute their ratings. Many users simply want a forum for expressing their opinions. We conducted inter- views with"power users"of MovieLens that had rated over 1000 movies(some over 2000 movies). What we learned was that these users were not rating to improve their recommendations. They were rating because it felt good. We par ticularly see this effect on sites like Amazon. com, where users can post reviews (ratings) of items sold by Amazon. For users with this task, issues may in clude the level of anonymity(which can be good or bad, depending on the user the feeling of contribution, and the ease of making the contribution. while recommender algorithms themselves may not evoke self-expression, encourag ing self-expression may provide more data which can improve the quality of commendations Help Others. Some users are happy to contribute ratings in recommender systems because they believe that the community benefits from their contribu tion. In many cases, they are also entering ratings in order to express them- selves(see previous task). However, the two do not always go togethe Influence Others. An unfortunate fact that we and other implementers of web-based recommender systems have encountered is that there are users recommender systems whose goal is to explicitly influence others into viewing or purchasing particular items. For example, advocates of particular movie genres (or movie studios) will frequently rate movies high on the MovieLens web site right before the movie is released to try and push others to go and see the movie This task is particularly interesting, because we may want to evaluate how well the system prevents this task. While we have briefly mentioned tasks involved in contributing ratings, we will not discuss them in depth in this paper, and rather focus on the task related to recommendation We must once again say that the list of tasks in this section is not compre- hensive. Rather, we have used our experience in the field to filter out the task categories that (a)have been most significant in the previously published work, and(b)that we feel are significant, but have not been considered sufficiently In the field of Human-Computer Interaction, it has been strongly that the evaluation process should begin with an understanding of the user tasks that the system will serve. When we evaluate recommender systems from the perspective of benefit to the user, we should also start by identifying the most important task for which the recommender will be used In this section, ge have provided descriptions of the most significant tasks that have been ACM Transactions on Information Systems, VoL 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems • 11 contribute ratings may be important to communicate that a recommender system is likely to be successful. We will briefly introduce several different rating tasks. Improve Profile. the rating task that most recommender systems have assumed. Users contribute ratings because they believe that they are improving their profile and thus improving the quality of the recommendations that they will receive. Express Self. Some users may not care about the recommendations—what is important to them is that they be allowed to contribute their ratings. Many users simply want a forum for expressing their opinions. We conducted interviews with “power users” of MovieLens that had rated over 1000 movies (some over 2000 movies). What we learned was that these users were not rating to improve their recommendations. They were rating because it felt good. We particularly see this effect on sites like Amazon.com, where users can post reviews (ratings) of items sold by Amazon. For users with this task, issues may include the level of anonymity (which can be good or bad, depending on the user), the feeling of contribution, and the ease of making the contribution. While recommender algorithms themselves may not evoke self-expression, encouraging self-expression may provide more data which can improve the quality of recommendations. Help Others. Some users are happy to contribute ratings in recommender systems because they believe that the community benefits from their contribution. In many cases, they are also entering ratings in order to express themselves (see previous task). However, the two do not always go together. Influence Others. An unfortunate fact that we and other implementers of web-based recommender systems have encountered is that there are users of recommender systems whose goal is to explicitly influence others into viewing or purchasing particular items. For example, advocates of particular movie genres (or movie studios) will frequently rate movies high on the MovieLens web site right before the movie is released to try and push others to go and see the movie. This task is particularly interesting, because we may want to evaluate how well the system prevents this task. While we have briefly mentioned tasks involved in contributing ratings, we will not discuss them in depth in this paper, and rather focus on the tasks related to recommendation. We must once again say that the list of tasks in this section is not comprehensive. Rather, we have used our experience in the field to filter out the task categories that (a) have been most significant in the previously published work, and (b) that we feel are significant, but have not been considered sufficiently. In the field of Human-Computer Interaction, it has been strongly argued that the evaluation process should begin with an understanding of the user tasks that the system will serve. When we evaluate recommender systems from the perspective of benefit to the user, we should also start by identifying the most important task for which the recommender will be used. In this section, we have provided descriptions of the most significant tasks that have been ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
J. L. Herlocker et al identified. Evaluators should consider carefully which of the tasks described may be appropriate for their environment Once the proper tasks have been identified, the evaluator must select a dataset to which evaluation methods can be applied, a process that will most likely be constrained by the user tasks identified 3. SELECTING DATA SETS FOR EVALUATION Several key decisions regarding data sets underlie successful evaluation of a recommender system algorithm Can the evaluation be carried out offine on an xisting data set or does it require live user tests? If a data set is not currently available, can evaluation be performed on simulated data? What properties should the dataset have in order to best model the tasks for which the recom- mender is being evaluated? A few examples help clarify these decisions -When designing a recommender algorithm designed to recommend word pro- cessing commands(e. g, Lintonet al. [1998)), one can expect users to have ex- perienced 5-10%(or more)of the candidates. Accordingly, it would be unwise to select recommender algorithms based on evaluation results from movie or e-commerce datasets where ratings sparsity is much worse When evaluating a recommender algorithm in the context of the Find Good Items task where novel items are desired, it may be inappropriate to use only offline evaluation. Since the recommender algorithm is generating red ommendations for items that the user does not already know about, it probable that the data set will not provide enough information to evaluate the quality of the items being recommended. If an item was truly unknown to the user, then it is probable that there is no rating for that user in the database. If we perform a live user evaluation, ratings can be gained on the spot for each item recommended When evaluating a recommender in a new domain where there is significant research on the structure of user preferences, but no data sets, it may be ap- propriate to first evaluate algorithms against synthetic data sets to identify the promising ones for further study. We will examine in the following subsections each of the decisions that we posed in the first paragraph of this section, and then discuss the past and current trends in research with respect to collaborative filtering data sets 3.1 Live User Experiments vs Offline Analyses Evaluations can be completed using offline analysis, a variety of live user exper- mental methods, or a combination of the two. Much of the work in algorithm evaluation has focused on off-line analysis of predictive accuracy. In such an evaluation, the algorithm is used to predict certain withheld values from a dataset, and the results are analyzed using one or more of the metrics dis cussed in the following section. Such evaluations have the advantage that it is quick and economical to conduct large evaluations, often on several different datasets or algorithms at once. Once a dataset is available, conducting such an experiment simply requires running the algorithm on the appropriate subset of ACM Transactions on Information Systems, VoL 22, No. 1, January 2004
12 • J. L. Herlocker et al. identified. Evaluators should consider carefully which of the tasks described may be appropriate for their environment. Once the proper tasks have been identified, the evaluator must select a dataset to which evaluation methods can be applied, a process that will most likely be constrained by the user tasks identified. 3. SELECTING DATA SETS FOR EVALUATION Several key decisions regarding data sets underlie successful evaluation of a recommender system algorithm. Can the evaluation be carried out offline on an existing data set or does it require live user tests? If a data set is not currently available, can evaluation be performed on simulated data? What properties should the dataset have in order to best model the tasks for which the recommender is being evaluated? A few examples help clarify these decisions: —When designing a recommender algorithm designed to recommend word processing commands (e.g., Linton et al. [1998]), one can expect users to have experienced 5–10% (or more) of the candidates. Accordingly, it would be unwise to select recommender algorithms based on evaluation results from movie or e-commerce datasets where ratings sparsity is much worse. —When evaluating a recommender algorithm in the context of the Find Good Items task where novel items are desired, it may be inappropriate to use only offline evaluation. Since the recommender algorithm is generating recommendations for items that the user does not already know about, it is probable that the data set will not provide enough information to evaluate the quality of the items being recommended. If an item was truly unknown to the user, then it is probable that there is no rating for that user in the database. If we perform a live user evaluation, ratings can be gained on the spot for each item recommended. —When evaluating a recommender in a new domain where there is significant research on the structure of user preferences, but no data sets, it may be appropriate to first evaluate algorithms against synthetic data sets to identify the promising ones for further study. We will examine in the following subsections each of the decisions that we posed in the first paragraph of this section, and then discuss the past and current trends in research with respect to collaborative filtering data sets. 3.1 Live User Experiments vs. Offline Analyses Evaluations can be completed using offline analysis, a variety of live user experimental methods, or a combination of the two. Much of the work in algorithm evaluation has focused on off-line analysis of predictive accuracy. In such an evaluation, the algorithm is used to predict certain withheld values from a dataset, and the results are analyzed using one or more of the metrics discussed in the following section. Such evaluations have the advantage that it is quick and economical to conduct large evaluations, often on several different datasets or algorithms at once. Once a dataset is available, conducting such an experiment simply requires running the algorithm on the appropriate subset of ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems that data. When the dataset includes timestamps, it is even possible to"replay a series of ratings and recommendations offine. Each time a rating was made, the researcher first computes the prediction for that item based on all prior data; then, after evaluating the accuracy of that prediction, the actual rating is entered so the next item can be evaluated Offline analyses have two important weaknesses. First, the natural sparsity of ratings data sets limits the set of items that can be evaluated. We cannot evaluate the appropriateness of a recommended item for a user if we do not have a rating from that user for that item in the dataset. Second, they are limited to objective evaluation of prediction results. No offine analysis can determine whether users will prefer a particular system, either because of its predictions or because of other less objective criteria such as the aesthetics of user interface An alternative approach is to conduct a live user experiment. Such experi- ments may be controlled(e. g, with random assignment of subjects to different conditions), or they may be field studies where a particular system is made available to a community of users that is then observed to ascertain the effects of the system. As we discuss later in Section 5.5, live user experiments can evaluate user performance, satisfaction, participation, and other measures. 3.2 Synthesized Vs Natural Data Sets Another choice that researchers face is whether to use an existing dataset that may imperfectly match the properties of the target domain and task, or to instead synthesize a dataset specifically to match those properties. In our own early work designing recommender algorithms for Usenet News [Konstan et al. 1997; Miller et al. 1997 we experimented with a variety of synthesized datasets. We modeled news articles as having a fixed number of"properties and users as having preferences for those properties. Our data set genera- tor could cluster users together, spread them evenly, or present other distri butions. While these simulated data sets gave us an easy way to test algo rithms for obvious flaws, they in no way accurately modeled the nature of real users and real data. In their research on horting as an approach for collabora- tive filtering, Aggarwal et al. [1999] used a similar technique, noting however that such synthetic data is"unfair to other algorithms"because it fits their approach too well, and that this is a placeholder until they can deploy their Synthesized data sets may be required in some limited cases, but only as early steps while gathering data sets or constructing complete systems. Drawing comparative conclusions from synthetic datasets is risky, because the data may fit one of the algorithms better than the others. On the other hand, there is new opportunity now to explore more advanced techniques for modeling user interest and generating synthetic data from those models, now that there exists data on which to evaluate the synthetically gen- erated data and tune the models. Such research could also lead to the develop ment of more accurate recommender algorithms with clearly defined theoretical pr ACM Transactions on Information Systems, VoL 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems • 13 that data. When the dataset includes timestamps, it is even possible to “replay” a series of ratings and recommendations offline. Each time a rating was made, the researcher first computes the prediction for that item based on all prior data; then, after evaluating the accuracy of that prediction, the actual rating is entered so the next item can be evaluated. Offline analyses have two important weaknesses. First, the natural sparsity of ratings data sets limits the set of items that can be evaluated. We cannot evaluate the appropriateness of a recommended item for a user if we do not have a rating from that user for that item in the dataset. Second, they are limited to objective evaluation of prediction results. No offline analysis can determine whether users will prefer a particular system, either because of its predictions or because of other less objective criteria such as the aesthetics of the user interface. An alternative approach is to conduct a live user experiment. Such experiments may be controlled (e.g., with random assignment of subjects to different conditions), or they may be field studies where a particular system is made available to a community of users that is then observed to ascertain the effects of the system. As we discuss later in Section 5.5, live user experiments can evaluate user performance, satisfaction, participation, and other measures. 3.2 Synthesized vs. Natural Data Sets Another choice that researchers face is whether to use an existing dataset that may imperfectly match the properties of the target domain and task, or to instead synthesize a dataset specifically to match those properties. In our own early work designing recommender algorithms for Usenet News [Konstan et al. 1997; Miller et al. 1997], we experimented with a variety of synthesized datasets. We modeled news articles as having a fixed number of “properties” and users as having preferences for those properties. Our data set generator could cluster users together, spread them evenly, or present other distributions. While these simulated data sets gave us an easy way to test algorithms for obvious flaws, they in no way accurately modeled the nature of real users and real data. In their research on horting as an approach for collaborative filtering, Aggarwal et al. [1999] used a similar technique, noting however that such synthetic data is “unfair to other algorithms” because it fits their approach too well, and that this is a placeholder until they can deploy their trial. Synthesized data sets may be required in some limited cases, but only as early steps while gathering data sets or constructing complete systems. Drawing comparative conclusions from synthetic datasets is risky, because the data may fit one of the algorithms better than the others. On the other hand, there is new opportunity now to explore more advanced techniques for modeling user interest and generating synthetic data from those models, now that there exists data on which to evaluate the synthetically generated data and tune the models. Such research could also lead to the development of more accurate recommender algorithms with clearly defined theoretical properties. ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
3.3Pr f Data The final question we address in this section on data sets is"what properties should the dataset have in order to best model the tasks for which the rec- ommender is being evaluated? "We find it useful to divide data set properties into three categories: Domain features reflect the nature of the content being ecommended, rather than any particular system Inherent features reflect the nature of the specific recommender system from which data was drawn(and possibly from its data collection practices). Sample features reflect distribution properties of the data, and often can be manipulated by selecting the appropri ate subset of a larger data set. We discuss each of these three categories here dentifying specific features within each category. Domain Features of interest include (a) the content topic being recommended/rated and the associated context in which rating/recommendation takes place (b) the user tasks supported by the recommender: (d) the cost/benefit ratio of false/true positives/negatives (e) the granularity of true user preferences. Most commonly, recommender systems have been built for entertainment content domains(movies, music, etc. ) though some testbeds exist for filtering document collections(Usenet news, for example). Within a particular topic, there may be many contexts. Movie recommenders may operate on the web, or may operate entirely within a video rental store or as part of a set-top box digital video recorder. In our experience, one of the most important generic domain features to con- ider lies in the tradeoff between desire for novelty and desire for high quality In certain domains, the user goal is dominated by finding recommendations for things she doesn,'t already know about McNee et al. [2002] evaluated recom- menders for research papers and found that users were generally happy with a set of recommendations if there was a single item in the set that appeared be useful and that the user wasnt already familiar with. In some ways, this matches the conventional wisdom about supermarket recommenders-it would be almost always correct, but useless, to recommend bananas, bread, milk, and eggs. The recommendations might be correct, but they don't change the shop- per's behavior Opposite the desire for novelty is the desire for high quality. In- tuitively, this end of the tradeoff reflects the user's desire to rely heavily upon the recommendation for a consumption decision, rather than simply as one decision-support factor among many. At the extreme, the availability of high- confidence recommendations could enable automatic purchase decisions such as personalized book- or music-of-the-month clubs Evaluations of recommenders for this task must evaluate the success of high-confidence recommendations, and perhaps consider the opportunity costs of excessively low confidence Another important domain feature is the cost/benefit ratio faced by users in the domain from which items are being recommended. In the video recom- mender domain, the cost of false positives is low(S3 and two to three hours of ACM Transactions on Information Systems, VoL 22, No. 1, January 2004
14 • J. L. Herlocker et al. 3.3 Properties of Data Sets The final question we address in this section on data sets is “what properties should the dataset have in order to best model the tasks for which the recommender is being evaluated?” We find it useful to divide data set properties into three categories: Domain features reflect the nature of the content being recommended, rather than any particular system. Inherent features reflect the nature of the specific recommender system from which data was drawn (and possibly from its data collection practices). Sample features reflect distribution properties of the data, and often can be manipulated by selecting the appropriate subset of a larger data set. We discuss each of these three categories here, identifying specific features within each category. Domain Features of interest include (a) the content topic being recommended/rated and the associated context in which rating/recommendation takes place; (b) the user tasks supported by the recommender; (c) the novelty need and the quality need; (d) the cost/benefit ratio of false/true positives/negatives; (e) the granularity of true user preferences. Most commonly, recommender systems have been built for entertainment content domains (movies, music, etc.), though some testbeds exist for filtering document collections (Usenet news, for example). Within a particular topic, there may be many contexts. Movie recommenders may operate on the web, or may operate entirely within a video rental store or as part of a set-top box or digital video recorder. In our experience, one of the most important generic domain features to consider lies in the tradeoff between desire for novelty and desire for high quality. In certain domains, the user goal is dominated by finding recommendations for things she doesn’t already know about. McNee et al. [2002] evaluated recommenders for research papers and found that users were generally happy with a set of recommendations if there was a single item in the set that appeared to be useful and that the user wasn’t already familiar with. In some ways, this matches the conventional wisdom about supermarket recommenders—it would be almost always correct, but useless, to recommend bananas, bread, milk, and eggs. The recommendations might be correct, but they don’t change the shopper’s behavior. Opposite the desire for novelty is the desire for high quality. Intuitively, this end of the tradeoff reflects the user’s desire to rely heavily upon the recommendation for a consumption decision, rather than simply as one decision-support factor among many. At the extreme, the availability of highconfidence recommendations could enable automatic purchase decisions such as personalized book- or music-of-the-month clubs. Evaluations of recommenders for this task must evaluate the success of high-confidence recommendations, and perhaps consider the opportunity costs of excessively low confidence. Another important domain feature is the cost/benefit ratio faced by users in the domain from which items are being recommended. In the video recommender domain, the cost of false positives is low ($3 and two to three hours of ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004