Evaluating Collaborative Filtering Recommender Systems JONATHAN L HERLOCKER Oregon State University JOSEPH A KONSTAN, LOREN G. TERVEEN. and JOHN T RIEDL University of Minnesota Recommender systems have been evaluated in many, often incomparable, ways. In this article, we review the key decisions in evaluating collaborative filtering recommender systems: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction evaluation of the system as a whole. In addition to reviewing the evaluation strategies used by prior esearchers, we present empirical results from the analysis of various accuracy metrics on one con- ent domain where all the tested metrics collapsed roughly into three equivalence classes. Metrics within each equivalency class were strongly correlated, while metrics from different equivalency lasses were uncorrelated Categories and Subject Descriptors: H.3.4 [Information Storage and Retrieval]: Systems and Software-performance Evaluation (efficiency and effectiveness) General Terms: Experimentation, Measurement, Performance Additional Key Words and Phrases: Collaborative filtering, recommender systems, metrics, 1 INTRODUCTION Recommender systems use the opinions of a community of users to help indi- viduals in that community more effectively identify content of interest from a potentially overwhelming set of choices [Resnick and Varian 1997]. One of This research was supported by the National Science Foundation (NSF) under grants DGE 95- 54517,Is96-13960.IS9734442,Is99-78717,Is01-02229, and IIS01-3394, and by Net Perceptions, Inc. Authors'addresses: J. L. Herlocker, School of Electrical Engineering and Computer Science, Oregon State University, 102 Dearborn Hall, Corvallis, OR 97331: email: herlock @cs. orst. edu; J. A Konstan, L. G. Terveen, and J. T. Riedl, Department of Computer Science and Engineering. Uni- ersity of Minnesota, 4-192 EE/CS Building, 200 Union Street SE, Minneapolis, MN 55455: email: Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial dvantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specifi permission and/or a fee. Permissions may be requested from Publications Dept, ACM, Inc, 151 Broadway. New York, NY 10036 USA, fax: +1 (212)869-0481, or permissions@acm. org 2004ACM1046-818804/0100-0005S500 ACM Transactions on Information Systems, VoL 22, No. 1, January 2004, Pages 5-53
Evaluating Collaborative Filtering Recommender Systems JONATHAN L. HERLOCKER Oregon State University and JOSEPH A. KONSTAN, LOREN G. TERVEEN, and JOHN T. RIEDL University of Minnesota Recommender systems have been evaluated in many, often incomparable, ways. In this article, we review the key decisions in evaluating collaborative filtering recommender systems: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction quality is measured, the evaluation of prediction attributes other than quality, and the user-based evaluation of the system as a whole. In addition to reviewing the evaluation strategies used by prior researchers, we present empirical results from the analysis of various accuracy metrics on one content domain where all the tested metrics collapsed roughly into three equivalence classes. Metrics within each equivalency class were strongly correlated, while metrics from different equivalency classes were uncorrelated. Categories and Subject Descriptors: H.3.4 [Information Storage and Retrieval]: Systems and Software—performance Evaluation (efficiency and effectiveness) General Terms: Experimentation, Measurement, Performance Additional Key Words and Phrases: Collaborative filtering, recommender systems, metrics, evaluation 1. INTRODUCTION Recommender systems use the opinions of a community of users to help individuals in that community more effectively identify content of interest from a potentially overwhelming set of choices [Resnick and Varian 1997]. One of This research was supported by the National Science Foundation (NSF) under grants DGE 95- 54517, IIS 96-13960, IIS 97-34442, IIS 99-78717, IIS 01-02229, and IIS 01-33994, and by Net Perceptions, Inc. Authors’ addresses: J. L. Herlocker, School of Electrical Engineering and Computer Science, Oregon State University, 102 Dearborn Hall, Corvallis, OR 97331; email: herlock@cs.orst.edu; J. A. Konstan, L. G. Terveen, and J. T. Riedl, Department of Computer Science and Engineering, University of Minnesota, 4-192 EE/CS Building, 200 Union Street SE, Minneapolis, MN 55455; email: {konstan, terveen, riedl}@cs.umn.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or permissions@acm.org. C 2004 ACM 1046-8188/04/0100-0005 $5.00 ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004, Pages 5–53
6 J. L. Herlocker et al the most successful technologies for recommender systems, called collabora tive filtering, has been developed and improved over the past decade to the point where a wide variety of algorithms exist for generating recommenda- tions. Each algorithmic approach has adherents who claim it to be superior for some purpose. Clearly identifying the best algorithm for a given purpose has proven challenging, in part because researchers disagree on which attributes should be measured. and on which metrics should be used for each attribute. Re. searchers who survey the literature will find over a dozen quantitative metrics and additional qualitative evaluation techniques Evaluating recommender systems and their algorithms is inherently diffi- cult for several reasons. First, different algorithms may be better or worse on different data sets. Many collaborative filtering algorithms have been designed specifically for data sets where there are many more users than (e. g the MovieLens data set has 65, 000 users and 5,000 movies). Such algorithms may be entirely inappropriate in a domain where there are many more items than users(e. g, a research paper recommender with thousands of users but tens or hundreds of thousands of articles to recommend). Similar differences exist for ratings density, ratings scale, and other properties of data sets The second reason that evaluation is difficult is that the goals for which n evaluation is performed may differ. Much early evaluation work focused specifically on the"accuracy"of collaborative filtering algorithms in"predict ing"withheld ratings. Even early researchers recognized, however, that when recommenders are used to support decisions, it can be more valuable to measure how often the system leads its users to wrong choices. Shardanand and Maes [1995] measured "reversals-large errors between the predicted and actual rating: we have used the signal-processing measure of the Receiver Operating Characteristic curve [Swets 1963] to measure a recommender's potential as a filter[Konstan et al. 1997. Other work has speculated that there are properties different from accuracy that have a larger effect on user satisfaction and perfor- nance. A range of research and systems have looked at measures including the legree to which the recommendations cover the entire set of items [Mobasher et al. 2001 the degree to which recommendations made are nonobvious [ McNee et al. 2002, and the ability of recommenders to explain their recommendations to users [Sinha and Swearingen 2002. A few researchers have argued that these issues are all details, and that the bottom-line measure of recommender system success should be user satisfaction. Commercial systems measure user satisfaction by the number of products purchased(and not returned ) while noncommercial systems may just ask users how satisfied they are Finally, there is a significant challenge in deciding what combination of mea sures to use in comparative evaluation. We have noticed a trend recently-many researchers find that their newest algorithms yield a mean absolute error of 0.73(on a five-point rating scale)on movie rating datasets. Though the new al- gorithms often appear to do better than the older algorithms they are compared to, we find that when each algorithm is tuned to its optimum, they all produce similar measures of quality. We-and others-have speculated that we may be reaching some"magic barrier"where natural variability may prevent us from getting much more accurate. In support of this, Hill et al. [1995] have shown ACM Transactions on Information Systems, VoL 22, No. 1, January 2004
6 • J. L. Herlocker et al. the most successful technologies for recommender systems, called collaborative filtering, has been developed and improved over the past decade to the point where a wide variety of algorithms exist for generating recommendations. Each algorithmic approach has adherents who claim it to be superior for some purpose. Clearly identifying the best algorithm for a given purpose has proven challenging, in part because researchers disagree on which attributes should be measured, and on which metrics should be used for each attribute. Researchers who survey the literature will find over a dozen quantitative metrics and additional qualitative evaluation techniques. Evaluating recommender systems and their algorithms is inherently diffi- cult for several reasons. First, different algorithms may be better or worse on different data sets. Many collaborative filtering algorithms have been designed specifically for data sets where there are many more users than items (e.g., the MovieLens data set has 65,000 users and 5,000 movies). Such algorithms may be entirely inappropriate in a domain where there are many more items than users (e.g., a research paper recommender with thousands of users but tens or hundreds of thousands of articles to recommend). Similar differences exist for ratings density, ratings scale, and other properties of data sets. The second reason that evaluation is difficult is that the goals for which an evaluation is performed may differ. Much early evaluation work focused specifically on the “accuracy” of collaborative filtering algorithms in “predicting” withheld ratings. Even early researchers recognized, however, that when recommenders are used to support decisions, it can be more valuable to measure how often the system leads its users to wrong choices. Shardanand and Maes [1995] measured “reversals”—large errors between the predicted and actual rating; we have used the signal-processing measure of the Receiver Operating Characteristic curve [Swets 1963] to measure a recommender’s potential as a filter [Konstan et al. 1997]. Other work has speculated that there are properties different from accuracy that have a larger effect on user satisfaction and performance. A range of research and systems have looked at measures including the degree to which the recommendations cover the entire set of items [Mobasher et al. 2001], the degree to which recommendations made are nonobvious [McNee et al. 2002], and the ability of recommenders to explain their recommendations to users [Sinha and Swearingen 2002]. A few researchers have argued that these issues are all details, and that the bottom-line measure of recommender system success should be user satisfaction. Commercial systems measure user satisfaction by the number of products purchased (and not returned!), while noncommercial systems may just ask users how satisfied they are. Finally, there is a significant challenge in deciding what combination of measures to use in comparative evaluation. We have noticed a trend recently—many researchers find that their newest algorithms yield a mean absolute error of 0.73 (on a five-point rating scale) on movie rating datasets. Though the new algorithms often appear to do better than the older algorithms they are compared to, we find that when each algorithm is tuned to its optimum, they all produce similar measures of quality. We—and others—have speculated that we may be reaching some “magic barrier” where natural variability may prevent us from getting much more accurate. In support of this, Hill et al. [1995] have shown ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems that users provide inconsistent ratings when asked to rate the same movie at different times. They suggest that an algorithm cannot be more accurate than the variance in a user's ratings for the same item. Even when accuracy differences are measurable, they are usually tiny. On a five-point rating scale, are users sensitive to a change in mean absolute error of 0.01? These observations suggest that algorithmic improvements in collab rative filtering systems may come from different directions than just continued improvements in mean absolute error. Perhaps the best algorithms should be measured in accordance with how well they can communicate their reasoning to users, or with how little data they can yield accurate recommendations. If this is true, new metrics will be needed to evaluate these new algorithms This article presents six specific contributions towards evaluation of recom- mender systems. (1) We introduce a set of recommender tasks that categorize the user goals for recommender system. (2)We discuss the selection of appropriate datasets for evaluation. We explore when evaluation can be completed off-line using existing datasets and when it requires on-line experimentation. We briefly discuss synthetic data sets and more extensively review the properties of datasets that should be con- sidered in selecting them for evaluation. (3)We survey evaluation metrics that have been used to evaluation recom- mender systems in the past, conceptually analyzing their strengths and (4)We report on experimental results comparing the outcomes of a set of differ- ent accuracy evaluation metrics on one data set. We show that the metrics ollapse roughly into three equivalence classes (5)By evaluating a wide set of metrics on a dataset, we show that for some datasets, while many different metrics are strongly correlated, the ere are classes of metrics that are uncorrelated (6)We review a wide range of nonaccuracy metrics, including measures of the degree to which recommendations cover the set of items, the novelty and serendipity of recommendations, and user satisfaction and behavior in the mender syste o Throughout our discussion, we separate out our review of what has been ne before in the literature from the introduction of new tasks and methods We expect that the primary audience of this article will be collaborative fil tering researchers who are looking to evaluate new algorithms against previous research and collaborative filtering practitioners who are evaluating algorithms before deploying them in recommender systems There are certain aspects of recommender systems that we have specifically ft out of the scope of this paper. In particular, we have decided to avoid the large area of marketing-inspired evaluation. There is extensive work on evaluating marketing campaigns based on such measures as offer acceptance and sales lift [Rogers 2001. While recommenders are widely used in this area, we can- not add much to existing coverage of this topic. We also do not address general ACM Transactions on Information Systems, VoL 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems • 7 that users provide inconsistent ratings when asked to rate the same movie at different times. They suggest that an algorithm cannot be more accurate than the variance in a user’s ratings for the same item. Even when accuracy differences are measurable, they are usually tiny. On a five-point rating scale, are users sensitive to a change in mean absolute error of 0.01? These observations suggest that algorithmic improvements in collaborative filtering systems may come from different directions than just continued improvements in mean absolute error. Perhaps the best algorithms should be measured in accordance with how well they can communicate their reasoning to users, or with how little data they can yield accurate recommendations. If this is true, new metrics will be needed to evaluate these new algorithms. This article presents six specific contributions towards evaluation of recommender systems. (1) We introduce a set of recommender tasks that categorize the user goals for a particular recommender system. (2) We discuss the selection of appropriate datasets for evaluation. We explore when evaluation can be completed off-line using existing datasets and when it requires on-line experimentation. We briefly discuss synthetic data sets and more extensively review the properties of datasets that should be considered in selecting them for evaluation. (3) We survey evaluation metrics that have been used to evaluation recommender systems in the past, conceptually analyzing their strengths and weaknesses. (4) We report on experimental results comparing the outcomes of a set of different accuracy evaluation metrics on one data set. We show that the metrics collapse roughly into three equivalence classes. (5) By evaluating a wide set of metrics on a dataset, we show that for some datasets, while many different metrics are strongly correlated, there are classes of metrics that are uncorrelated. (6) We review a wide range of nonaccuracy metrics, including measures of the degree to which recommendations cover the set of items, the novelty and serendipity of recommendations, and user satisfaction and behavior in the recommender system. Throughout our discussion, we separate out our review of what has been done before in the literature from the introduction of new tasks and methods. We expect that the primary audience of this article will be collaborative filtering researchers who are looking to evaluate new algorithms against previous research and collaborative filtering practitioners who are evaluating algorithms before deploying them in recommender systems. There are certain aspects of recommender systems that we have specifically left out of the scope of this paper. In particular, we have decided to avoid the large area of marketing-inspired evaluation. There is extensive work on evaluating marketing campaigns based on such measures as offer acceptance and sales lift [Rogers 2001]. While recommenders are widely used in this area, we cannot add much to existing coverage of this topic. We also do not address general ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
J. L. Herlocker et al usability evaluation of the interfaces. That topic is well covered in the research and practitioner literature(e.g, Helander [1988] and Nielsen [1994])We have chosen not to discuss computation performance of recommender algorithms Such performance is certainly important, and in the future we expect there to be work on the quality of time-limited and memory-limited recommendations This area is just emerging, however(see for example Miller et al.'s recent work on recommendation on handheld devices [Miller et al. 2003), and there is not yet enough research to survey and synthesize. Finally, we do not address the emerging question of the robustness and transparency of recommender algo- rithms. We recognize that recommender system robustness to manipulation by attacks(and transparency that discloses manipulation by system operators)is important, but substantially more work needs to occur in this area before there will be accepted metrics for evaluating such robustness and transparency The remainder of the article is arranged as follows Section 2 We identify the key user tasks from which evaluation methods have been determined and suggest new tasks that have not been evaluated nsively Section 3. a discussion regarding the factors that can affect selection of a data set on which to perform evaluation Section 4. An investigation of metrics that have been used in evaluating the accuracy of collaborative filtering predictions and recommendations. Accu- racy has been by far the most commonly published evaluation method for collaborative filtering systems. This section also includes the results from an empirical study of the correlations between metrics Section 5. a discussion of metrics that evaluate dimensions other than ac- curacy. In addition to covering the dimensions and methods that have been used in the literature. we introduce new dimensions on which we believe evaluation should be done Section 6. Final conclusions, including a list of areas were we feel future work is particularly warranted Sections 2-5 are ordered to discuss the steps of evaluation in roughly the order that we would expect an evaluator to take. Thus, Section 2 describes the selec- tion of appropriate user tasks, Section 3 discusses the selection of a dataset, and Sections 4 and 5 discuss the alternative metrics that may be applied to the dataset chosen. We begin with the discussion of user tasks-the user task sets the entire context for evaluation 2 USER TASKS FOR RECOMMENDER SYSTEMS To properly evaluate a recommender system, it is important to understand the oals and tasks for which it is being used. In this article, we focus on end-user oals and tasks (as opposed to goals of marketers and other system stakehold- rs). We derive these tasks from the research literature and from deployed sys tems. For each task, we discuss its implications for evaluation. While the tasks weve identified are important ones, based on our experience in recommender systems research and from our review of published research, we recognize that ACM Transactions on Information Systems, VoL 22, No. 1, January 2004
8 • J. L. Herlocker et al. usability evaluation of the interfaces. That topic is well covered in the research and practitioner literature (e.g., Helander [1988] and Nielsen [1994]) We have chosen not to discuss computation performance of recommender algorithms. Such performance is certainly important, and in the future we expect there to be work on the quality of time-limited and memory-limited recommendations. This area is just emerging, however (see for example Miller et al.’s recent work on recommendation on handheld devices [Miller et al. 2003]), and there is not yet enough research to survey and synthesize. Finally, we do not address the emerging question of the robustness and transparency of recommender algorithms. We recognize that recommender system robustness to manipulation by attacks (and transparency that discloses manipulation by system operators) is important, but substantially more work needs to occur in this area before there will be accepted metrics for evaluating such robustness and transparency. The remainder of the article is arranged as follows: —Section 2. We identify the key user tasks from which evaluation methods have been determined and suggest new tasks that have not been evaluated extensively. —Section 3. A discussion regarding the factors that can affect selection of a data set on which to perform evaluation. —Section 4. An investigation of metrics that have been used in evaluating the accuracy of collaborative filtering predictions and recommendations. Accuracy has been by far the most commonly published evaluation method for collaborative filtering systems. This section also includes the results from an empirical study of the correlations between metrics. —Section 5. A discussion of metrics that evaluate dimensions other than accuracy. In addition to covering the dimensions and methods that have been used in the literature, we introduce new dimensions on which we believe evaluation should be done. —Section 6. Final conclusions, including a list of areas were we feel future work is particularly warranted. Sections 2–5 are ordered to discuss the steps of evaluation in roughly the order that we would expect an evaluator to take. Thus, Section 2 describes the selection of appropriate user tasks, Section 3 discusses the selection of a dataset, and Sections 4 and 5 discuss the alternative metrics that may be applied to the dataset chosen. We begin with the discussion of user tasks—the user task sets the entire context for evaluation. 2. USER TASKS FOR RECOMMENDER SYSTEMS To properly evaluate a recommender system, it is important to understand the goals and tasks for which it is being used. In this article, we focus on end-user goals and tasks (as opposed to goals of marketers and other system stakeholders). We derive these tasks from the research literature and from deployed systems. For each task, we discuss its implications for evaluation. While the tasks we’ve identified are important ones, based on our experience in recommender systems research and from our review of published research, we recognize that ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems the list is necessarily incomplete As researchers and developers move into new recommendation domains, we expect they will find it useful to supplement this list and/or modify these tasks with domain-specific ones. Our goal is pri to identify domain-independent task descriptions to help distinguish among different evaluation measures We have identified two user tasks that have been discussed at length within the collaborative filtering literature Annotation in Context. The original recommendation scenario was filtering through structured discussion postings to decide which ones were worth read ing. Tapestry [Goldberg et al. 1992] and GroupLens [Resnick et al 1994] both applied this to already structured message databases. This task required re- taining the order and context of messages, and accordingly used predictions to annotate messages in context. In some cases the"worst"messages were filtered out. This same scenario, which uses a recommender in an existing context, has Iso been used by web recommenders that overlay prediction information on top of existing links [ Wexelblat and Maes 1999]. Users use the displayed predic tions to decide which messages to read (or which links to follow), and therefore the most important factor to evaluate is how successfully the predictions help users distinguish between desired and undesired content. a major factor is the whether the recommender can generate predictions for the items that the user Find Good Items. Soon after Tapestry and GroupLens, several systems ere developed with a more direct focus on actual recommendation. Ring Shardanand and Maes 1995 and the bellcore video Recommender hill et al 995] both provided interfaces that would suggest specific items to their users, providing users with a ranked list of the recommended items, along with predic. tions for how much the users would like them This is the core recommendation task and it recurs in a wide variety of research and commercial systems. In many commercial systems, the"best bet"recommendations are shown, but the predicted rating values are not While these two tasks can be identified quite generally across many different domains, there are likely to be many specializations of the above tasks within each domain We introduce some of the characteristics of domains that influence hose specializations in Section 3.3 While the annotation in Context and the find good items are overwhelm ingly the most commonly evaluated tasks in the literature, there are other important generic tasks that are not well described in the research literature Below we describe several other user tasks that we have encountered in our in- terviews with users and our discussions with recommender system designers We mention these tasks because we believe that they should be evaluated, but because they have not been addressed in the recommender systems literature, we do not discuss them further Find All Good Items. Most recommender tasks focus on finding some good items. This is not surprising: the problem that led to recommender systems was one of overload, and most users seem willing to live with overlooking some ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems • 9 the list is necessarily incomplete. As researchers and developers move into new recommendation domains, we expect they will find it useful to supplement this list and/or modify these tasks with domain-specific ones. Our goal is primarily to identify domain-independent task descriptions to help distinguish among different evaluation measures. We have identified two user tasks that have been discussed at length within the collaborative filtering literature: Annotation in Context. The original recommendation scenario was filtering through structured discussion postings to decide which ones were worth reading. Tapestry [Goldberg et al. 1992] and GroupLens [Resnick et al. 1994] both applied this to already structured message databases. This task required retaining the order and context of messages, and accordingly used predictions to annotate messages in context. In some cases the “worst” messages were filtered out. This same scenario, which uses a recommender in an existing context, has also been used by web recommenders that overlay prediction information on top of existing links [Wexelblat and Maes 1999]. Users use the displayed predictions to decide which messages to read (or which links to follow), and therefore the most important factor to evaluate is how successfully the predictions help users distinguish between desired and undesired content. A major factor is the whether the recommender can generate predictions for the items that the user is viewing. Find Good Items. Soon after Tapestry and GroupLens, several systems were developed with a more direct focus on actual recommendation. Ringo [Shardanand and Maes 1995] and the Bellcore Video Recommender [Hill et al. 1995] both provided interfaces that would suggest specific items to their users, providing users with a ranked list of the recommended items, along with predictions for how much the users would like them. This is the core recommendation task and it recurs in a wide variety of research and commercial systems. In many commercial systems, the “best bet” recommendations are shown, but the predicted rating values are not. While these two tasks can be identified quite generally across many different domains, there are likely to be many specializations of the above tasks within each domain. We introduce some of the characteristics of domains that influence those specializations in Section 3.3. While the Annotation in Context and the Find Good Items are overwhelmingly the most commonly evaluated tasks in the literature, there are other important generic tasks that are not well described in the research literature. Below we describe several other user tasks that we have encountered in our interviews with users and our discussions with recommender system designers. We mention these tasks because we believe that they should be evaluated, but because they have not been addressed in the recommender systems literature, we do not discuss them further. Find All Good Items. Most recommender tasks focus on finding some good items. This is not surprising; the problem that led to recommender systems was one of overload, and most users seem willing to live with overlooking some ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004