J. L. Herlocker et al performance of different metrics when applied to results from one class of algo- rithm in one domain. The empirical results demonstrate that some conceptual differences among accuracy evaluation metrics can be more significant than 4.1 Evaluation of Previously Used Metrics Recommender system accuracy has been evaluated in the research literature since 1994 [Resnick et al. 1994]. Many of the published evaluations of rec- ommender systems used different metrics. We will examine some of the most popular metrics used in those publications, identifying the strengths and the weaknesses of the metrics We broadly classify recommendation accuracy met- rics into three classes: predictive accuracy metrics, classification accuracy met rics, and rank accuracy metrics 4.1.1 Predictive Accuracy Metrics. Predictive accuracy metrics measure ow close the recommender systems predicted ratings are to the true user ratings. Predictive accuracy metrics are particularly important for evaluating tasks in which the predicting rating will be displayed to the user such as An notation in Context. For example, the Movie Lens movie recommender [Dahlen al. 1998] predicts the number of stars that a user will give each movie and displays that prediction to the user. Predictive accuracy metrics will evaluate how close MovieLens predictions are to the user's true number of stars given to each movie. Even if a recommender system was able to correctly rank a users movie recommendations, the system could fail if the predicted ratings it displays to the user are incorrect. Because the predicted rating values create an ordering across the items, predictive accuracy can also be used to measure the ability of a recommender system to rank items with respect to user prefer- ence On the other hand, evaluators who wish to measure predictive accuracy are necessarily limited to a metric that computes the difference between the predicted rating and true rating such as mean absolute error Mean Absolute Error and Related Metrics. Mean absolute error(often re ferred to as MAE)measures the average absolute deviation between a predicted rating and the user's true rating. Mean absolute error(Eq (1)) has been used to evaluate recommender systems in several cases [Breese et al. 1998, Herlocker et al. 1999, Shardanand and Maes 1995 E 1 Ipl Mean absolute error may be less appropriate for tasks such as Find Good Items where a ranked result is returned to the user, who then only views items at the top of the ranking. For these tasks, users may only care about errors in items that are ranked high, or that should be ranked high. It may be unimportant how accurate predictions are for items that the system correctly knows the user will have no interest in Mean absolute error may be less appropriate when the I This ly implementations of recommender systems in a commercial setting only display a recommended-items list and do not display predicted values ACM Transactions on Information Systems, VoL 22, No. 1, January 2004
20 • J. L. Herlocker et al. performance of different metrics when applied to results from one class of algorithm in one domain. The empirical results demonstrate that some conceptual differences among accuracy evaluation metrics can be more significant than others. 4.1 Evaluation of Previously Used Metrics Recommender system accuracy has been evaluated in the research literature since 1994 [Resnick et al. 1994]. Many of the published evaluations of recommender systems used different metrics. We will examine some of the most popular metrics used in those publications, identifying the strengths and the weaknesses of the metrics. We broadly classify recommendation accuracy metrics into three classes: predictive accuracy metrics, classification accuracy metrics, and rank accuracy metrics. 4.1.1 Predictive Accuracy Metrics. Predictive accuracy metrics measure how close the recommender system’s predicted ratings are to the true user ratings. Predictive accuracy metrics are particularly important for evaluating tasks in which the predicting rating will be displayed to the user such as Annotation in Context. For example, the MovieLens movie recommender [Dahlen et al. 1998] predicts the number of stars that a user will give each movie and displays that prediction to the user. Predictive accuracy metrics will evaluate how close MovieLens’ predictions are to the user’s true number of stars given to each movie. Even if a recommender system was able to correctly rank a user’s movie recommendations, the system could fail if the predicted ratings it displays to the user are incorrect.1 Because the predicted rating values create an ordering across the items, predictive accuracy can also be used to measure the ability of a recommender system to rank items with respect to user preference. On the other hand, evaluators who wish to measure predictive accuracy are necessarily limited to a metric that computes the difference between the predicted rating and true rating such as mean absolute error. Mean Absolute Error and Related Metrics. Mean absolute error (often referred to as MAE) measures the average absolute deviation between a predicted rating and the user’s true rating. Mean absolute error (Eq. (1)) has been used to evaluate recommender systems in several cases [Breese et al. 1998, Herlocker et al. 1999, Shardanand and Maes 1995]. |E| = N i=1 |pi − ri| N (1) Mean absolute error may be less appropriate for tasks such as Find Good Items where a ranked result is returned to the user, who then only views items at the top of the ranking. For these tasks, users may only care about errors in items that are ranked high, or that should be ranked high. It may be unimportant how accurate predictions are for items that the system correctly knows the user will have no interest in. Mean absolute error may be less appropriate when the 1This is a primary reason that many implementations of recommender systems in a commercial setting only display a recommended-items list and do not display predicted values. ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems granularity of true preference(a domain feature)is small, since errors will only affect the task if they result in erroneously classifying a good item as a bad one or vice versa; for example, if 3.5 stars is the cut-off between good and bad, then a one-star error that predicts a 4 as 5 (or a 3 as 2)makes no difference to the user Beyond measuring the accuracy of the predictions at every rank, there are two other advantages to mean absolute error. First, the mechanics of the com- putation are simple and easy to understand. Second, mean absolute error has well studied statistical properties that provide for testing the significance of a difference between the mean absolute errors of two systems. Three measures related to mean absolute error are mean squared error, root mean squared error, and normalized mean absolute error. The first two varia ons square the error before summing it. The result is more emphasis on large errors. For example, an error of one point increases the sum of error by one, but an error of two points increases the sum by four. The third related measure, normalized mean absolute error [Goldberg et al. 2001 is mean absolute error normalized with respect to the range of rating values, in theory allowing com- arison between prediction runs on different datasets (although the utility of this has not yet been investigated) In addition to mean absolute error across all predicted ratings, Shardanand nd Maes [1995] measured separately mean absolute error over items to which users gave extreme ratings. They partitioned their items into two groups, based on user rating (ascale of 1 to 7). Items rated below three or greater than five were considered extremes. The intuition was that users would be much more aware of a recommender systems performance on items that they felt strongly about. From Shardanand and Maes results. the mean absolute error of the extremes provides a different ranking of algorithms than the normal mean absolute error Measuring the mean absolute error of the extremes can be valuable. However, unless users are concerned only with how their extremes are predicted, it should not be used in isolation 4.1.2 Classification Accuracy Metrics. Classification metrics measure the frequency with which a recommender system makes correct or incorrect deci- sions about whether an item is good. Classification metrics are thus appropriate for tasks such as Find Good Items when users have true binary preferences When applied to nonsynthesized data in offine experiments, classification ccuracy metrics may be challenged by data sparsity. The problem occurs when the collaborative filtering system being evaluated is generating a list of top recommended items. When the quality of the list is evaluated, recommendations may be encountered that have not been rated. How those items are treated in he evaluation can lead to certain biases One approach to evaluation using sparse data sets is to ignore recommenda- tions for items for which there are no ratings. The recommendation list is first processed to remove all unrated items. The recommendation task has been al tered to "predict the top recommended items that have been rated. " In tasks where the user only observes the top few recommendations, this could lead to inaccurate evaluations of recommendation systems with respect to the users ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems • 21 granularity of true preference (a domain feature) is small, since errors will only affect the task if they result in erroneously classifying a good item as a bad one or vice versa; for example, if 3.5 stars is the cut-off between good and bad, then a one-star error that predicts a 4 as 5 (or a 3 as 2) makes no difference to the user. Beyond measuring the accuracy of the predictions at every rank, there are two other advantages to mean absolute error. First, the mechanics of the computation are simple and easy to understand. Second, mean absolute error has well studied statistical properties that provide for testing the significance of a difference between the mean absolute errors of two systems. Three measures related to mean absolute error are mean squared error, root mean squared error, and normalized mean absolute error. The first two variations square the error before summing it. The result is more emphasis on large errors. For example, an error of one point increases the sum of error by one, but an error of two points increases the sum by four. The third related measure, normalized mean absolute error [Goldberg et al. 2001], is mean absolute error normalized with respect to the range of rating values, in theory allowing comparison between prediction runs on different datasets (although the utility of this has not yet been investigated). In addition to mean absolute error across all predicted ratings, Shardanand and Maes [1995] measured separately mean absolute error over items to which users gave extreme ratings. They partitioned their items into two groups, based on user rating (a scale of 1 to 7). Items rated below three or greater than five were considered extremes. The intuition was that users would be much more aware of a recommender system’s performance on items that they felt strongly about. From Shardanand and Maes’ results, the mean absolute error of the extremes provides a different ranking of algorithms than the normal mean absolute error. Measuring the mean absolute error of the extremes can be valuable. However, unless users are concerned only with how their extremes are predicted, it should not be used in isolation. 4.1.2 Classification Accuracy Metrics. Classification metrics measure the frequency with which a recommender system makes correct or incorrect decisions about whether an item is good. Classification metrics are thus appropriate for tasks such as Find Good Items when users have true binary preferences. When applied to nonsynthesized data in offline experiments, classification accuracy metrics may be challenged by data sparsity. The problem occurs when the collaborative filtering system being evaluated is generating a list of top recommended items. When the quality of the list is evaluated, recommendations may be encountered that have not been rated. How those items are treated in the evaluation can lead to certain biases. One approach to evaluation using sparse data sets is to ignore recommendations for items for which there are no ratings. The recommendation list is first processed to remove all unrated items. The recommendation task has been altered to “predict the top recommended items that have been rated.” In tasks where the user only observes the top few recommendations, this could lead to inaccurate evaluations of recommendation systems with respect to the user’s ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004