Evaluating Collaborative Filtering Recommender Systems your evening), the cost of false negatives is almost zero, and the benefit of recom- mendations is huge(an enormous quantity of movies have been released over the years, and browsing in the video store can be quite stressful-particularl for families). This analysis explains to a large extent why video recommenders have been so successful. Other domains with similar domain fea such as books of fiction, are likely to have datasets similar to the video domain and re- sults demonstrated on video data may likely translate somewhat well to those her domains(although books of fiction are likely to have different sample features-see below). See Konstan et al. [1997] for a slightly more detailed dis cussion of cost/benefit tradeoff analysis in collaborative filtering recommender systems. Another subtle but important domain feature is the granularity of true user preferences. How many different levels of true user preference exist? With nary preferences, users only care to distinguish between good and bad items ci dont necessarily need the best movie, only a movie I will enjoy"). In such a case, distinguishing among good items is not important, nor is distinguishing among bad items. Note that the granularity of user preference may be different than the range and granularity of the ratings(which is an inherent feature of data sets). Users may rank movies on a 1-10 scale, but then only really care if recommendations are good (I had a good time watching the movie)or bad(I was bored out of my mind! Overall, it would probably be a mistake to evaluate an algorithm on data with significantly different domain features. In particular, it is very important that the tasks your algorithm is designed to support are similar to the tasks supported by the system from which the data was collected. If the user task are mismatched, then there are likely to be many other feature mismatches For example, the MovieLens system supported primarily the Find Good Items user task. as the result, the user was always shown the"best bets "and thus there are many more ratings for good items than bad items(the user had to explicitly request to rate a poor item in most cases). So MovieLens data is less likely to have many ratings for less popular items. It would probably be inappropriate to use this data to evaluate a new algorithm whose goal was to support Annotation In Context. Of course, if an algorithm is being proposed for general use, it is best to select data sets that span a variety of topics and contexts Inherent features include several features about ratings (a) whether ratings are explicit, implicit, or both; (b) the scale on which items are (d)the presence or absence of a timestamp on ratings. Explicit ratings are entered by a user directly (i.e. "Please rate this on a scale of 1-5"), while implicit ratings are inferred from other user behavior. For example, a music recommender may use implicit data such as play lists or music listened to or it may use explicit scores for songs or artists, or a combi- nation of both. The ratings scale is the range and granularity of ratings. TH ACM Transactions on Information Systems, VoL 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems • 15 your evening), the cost of false negatives is almost zero, and the benefit of recommendations is huge (an enormous quantity of movies have been released over the years, and browsing in the video store can be quite stressful—particularly for families). This analysis explains to a large extent why video recommenders have been so successful. Other domains with similar domain features, such as books of fiction, are likely to have datasets similar to the video domain and results demonstrated on video data may likely translate somewhat well to those other domains (although books of fiction are likely to have different sample features—see below). See Konstan et al. [1997] for a slightly more detailed discussion of cost/benefit tradeoff analysis in collaborative filtering recommender systems. Another subtle but important domain feature is the granularity of true user preferences. How many different levels of true user preference exist? With binary preferences, users only care to distinguish between good and bad items (“I don’t necessarily need the best movie, only a movie I will enjoy”). In such a case, distinguishing among good items is not important, nor is distinguishing among bad items. Note that the granularity of user preference may be different than the range and granularity of the ratings (which is an inherent feature of data sets). Users may rank movies on a 1–10 scale, but then only really care if recommendations are good (I had a good time watching the movie) or bad (I was bored out of my mind!). Overall, it would probably be a mistake to evaluate an algorithm on data with significantly different domain features. In particular, it is very important that the tasks your algorithm is designed to support are similar to the tasks supported by the system from which the data was collected. If the user tasks are mismatched, then there are likely to be many other feature mismatches. For example, the MovieLens system supported primarily the Find Good Items user task. As the result, the user was always shown the “best bets” and thus there are many more ratings for good items than bad items (the user had to explicitly request to rate a poor item in most cases). So MovieLens data is less likely to have many ratings for less popular items. It would probably be inappropriate to use this data to evaluate a new algorithm whose goal was to support Annotation In Context. Of course, if an algorithm is being proposed for general use, it is best to select data sets that span a variety of topics and contexts. Inherent features include several features about ratings: (a) whether ratings are explicit, implicit, or both; (b) the scale on which items are rated; (c) the dimensions of rating; and (d) the presence or absence of a timestamp on ratings. Explicit ratings are entered by a user directly (i.e., “Please rate this on a scale of 1–5”), while implicit ratings are inferred from other user behavior. For example, a music recommender may use implicit data such as play lists or music listened to, or it may use explicit scores for songs or artists, or a combination of both. The ratings scale is the range and granularity of ratings. The ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
16 J. L. Herlocker et al simplest scale is unary-liked items are marked, all others are unknown. Unary is common in commerce applications, where all that is known is whether the user purchased an item or not. We call the data unary instead of binary because a lack of purchase of item X does not necessarily mean that the user would not like X Binary items include a separate designation for disliked. Systems that operate on explicit ratings often support 5-point, 7-point, or 100-point scales While most recommenders have had only a single rating dimension(described oy Miller et al. [1997] as"what predictions should we have displayed for this item?"), both research and commercial systems are exploring systems where users can enter several ratings for a single item. Zagat's restaurant guides, for example, traditionally use food, service, and decor as three independent dimen- sions. Movie recommenders may separate story, acting, and special effects. Data sets with multiple dimensions are still difficult to find, but we expect several to become available in the future. Timestamps are a property of the data col- lection, and are particularly important in areas where user tastes are expected to change or where user reactions to items likely depend on their history of interaction with other items Other inherent features concern the data collection practices (e)whether the recommendations displayed to th user were r ( the availability of user demographic information or information Unfortunately, few datasets recorded the recommendations that were dis- played, making it difficult to retrospectively isolate, for example, ratings that could not have been biased by previously displayed predictions. Some logs may keep time-stamped queries, which could be used to reconstruct recommenda tions if the algorithm is known and fully deterministic. The availability of demo- graphic data varies with the specific system, and with the specific data collected. The EachMovie and MovieLens datasets both collected limited demographics Researchers speculate, however, that a large percentage of the demographic answers may be false(based on user suspicion of"marketing questions"). We would expect greater reliability for demographic data that users believe actually serves a constructive purpose in the recommender(either for recommendation or for related purposes). A film recommender that uses zip code to narrow the theater search, such as Milleret al s[2003 MovieLens Unplugged, seems more kely to provide meaningful data. Finally, we consider (g) the biases involved in data collection Most data sets have biases based on the mechanism by which users have the opportunity to rate items. For example, Jester [Goldberg et al. 2001] asked all users to rate the same initial jokes, creating a set of dense ratings for those jokes which would not otherwise occur. MovieLens has experimented with dif- ferent methods to select items to have the first-time user rate before using the recommender system[Rashid et al. 2002], and in the process demonstrated that each method leads to a different bias in initial ratings ACM Transactions on Information Systems, VoL 22, No. 1, January 2004
16 • J. L. Herlocker et al. simplest scale is unary-liked items are marked, all others are unknown. Unary is common in commerce applications, where all that is known is whether the user purchased an item or not. We call the data unary instead of binary because a lack of purchase of item X does not necessarily mean that the user would not like X. Binary items include a separate designation for disliked. Systems that operate on explicit ratings often support 5-point, 7-point, or 100-point scales. While most recommenders have had only a single rating dimension (described by Miller et al. [1997] as “what predictions should we have displayed for this item?”), both research and commercial systems are exploring systems where users can enter several ratings for a single item. Zagat’s restaurant guides, for example, traditionally use food, service, and decor as three independent dimen- ´ sions. Movie recommenders may separate story, acting, and special effects. Data sets with multiple dimensions are still difficult to find, but we expect several to become available in the future. Timestamps are a property of the data collection, and are particularly important in areas where user tastes are expected to change or where user reactions to items likely depend on their history of interaction with other items. Other inherent features concern the data collection practices: (e) whether the recommendations displayed to the user were recorded; and (f) the availability of user demographic information or item content information. Unfortunately, few datasets recorded the recommendations that were displayed, making it difficult to retrospectively isolate, for example, ratings that could not have been biased by previously displayed predictions. Some logs may keep time-stamped queries, which could be used to reconstruct recommendations if the algorithm is known and fully deterministic. The availability of demographic data varies with the specific system, and with the specific data collected. The EachMovie and MovieLens datasets both collected limited demographics. Researchers speculate, however, that a large percentage of the demographic answers may be false (based on user suspicion of “marketing questions”). We would expect greater reliability for demographic data that users believe actually serves a constructive purpose in the recommender (either for recommendation or for related purposes). A film recommender that uses zip code to narrow the theater search, such as Miller et al.’s [2003] MovieLens Unplugged, seems more likely to provide meaningful data. Finally, we consider: (g) the biases involved in data collection. Most data sets have biases based on the mechanism by which users have the opportunity to rate items. For example, Jester [Goldberg et al. 2001] asked all users to rate the same initial jokes, creating a set of dense ratings for those jokes which would not otherwise occur. MovieLens has experimented with different methods to select items to have the first-time user rate before using the recommender system [Rashid et al. 2002], and in the process demonstrated that each method leads to a different bias in initial ratings. ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems Sample features include many of the statistical properties that are commonly in evaluating (a) the density of the ratings set overall, sometimes measured as the average percentage of items that have been rated per user; since many datasets have uneven popularity distributions, density may be artificially manipulated by including or excluding items; (b) the number or density of ratings from the users for whom recommendations are being made, which represents the experience of the user in the system at the time of recommendation; ratings from users with significant experience can be withheld to simulate the condition when they were new users; and (c) the general size and distribution properties of the data set-some data sets have more items than users, though most data sets have many more users than items Each of these sample features can have substantial effect on the success of different algorithms, and can reflect specific policies of the recommender. Density(both individual and overall) reflects both the overall size of the recom- mender's item space and the degree to which users have explored it. One policy decision that significantly affects density is the level of rating required to par- ticipate in the community. Systems that either require an extensive level of start-up rating or require recurring ratings to maintain membership or status levels will generally have greater density than low-commitment recommenders in the same domain. Density also interacts with the type of rating-implicit rat ngs are likely to lead to greater density, since less effort is needed by the user Finally, system that allow automated software"agents" to participate may have a significantly higher density than other systems, even if the underlying item space is similar(see, e.g, Good et al. [1999). Because software agents are not limited in attention, they can rate much more extensively than humans Two particular distribution properties are known to be highly important The relationship between the numbers of users and numbers of items can de- termine whether it is easier to build correlations among users or among items- this choice can lead to different relative performance among algorithms. The ratings distribution of items and users also may affect both algorithm and eval uation metric choice. Systems where there is an exponential popularity curve (some items have exponentially more ratings than others) may be able to find agreement among people or items in the dense subregion and use that agree- ment to recommend in the sparse space. (Jester, mentioned above, does this directly by creating a highly dense region of jokes rated by all users ) Systems with a more even ratings distribution may be more challenged to cope with sparsity unless they incorporate dimensionality reduction techniques To complete the discussion of domain features, inherent features, and sample features, it is important to note that there are significant interactions between these categories of features. For example, the type of task supported by a rec- ommender system(a domain feature) will significantly affect the distribution of ratings collected (a sample feature). However, each of these features represents a dimension which may be useful in explaining differences in evaluation results ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems • 17 Sample features include many of the statistical properties that are commonly considered in evaluating a data set: (a) the density of the ratings set overall, sometimes measured as the average percentage of items that have been rated per user; since many datasets have uneven popularity distributions, density may be artificially manipulated by including or excluding items; (b) the number or density of ratings from the users for whom recommendations are being made, which represents the experience of the user in the system at the time of recommendation; ratings from users with significant experience can be withheld to simulate the condition when they were new users; and (c) the general size and distribution properties of the data set—some data sets have more items than users, though most data sets have many more users than items. Each of these sample features can have substantial effect on the success of different algorithms, and can reflect specific policies of the recommender. Density (both individual and overall) reflects both the overall size of the recommender’s item space and the degree to which users have explored it. One policy decision that significantly affects density is the level of rating required to participate in the community. Systems that either require an extensive level of start-up rating or require recurring ratings to maintain membership or status levels will generally have greater density than low-commitment recommenders in the same domain. Density also interacts with the type of rating—implicit ratings are likely to lead to greater density, since less effort is needed by the user. Finally, system that allow automated software “agents” to participate may have a significantly higher density than other systems, even if the underlying item space is similar (see, e.g., Good et al. [1999]). Because software agents are not limited in attention, they can rate much more extensively than humans. Two particular distribution properties are known to be highly important. The relationship between the numbers of users and numbers of items can determine whether it is easier to build correlations among users or among items— this choice can lead to different relative performance among algorithms. The ratings distribution of items and users also may affect both algorithm and evaluation metric choice. Systems where there is an exponential popularity curve (some items have exponentially more ratings than others) may be able to find agreement among people or items in the dense subregion and use that agreement to recommend in the sparse space. (Jester, mentioned above, does this directly by creating a highly dense region of jokes rated by all users.) Systems with a more even ratings distribution may be more challenged to cope with sparsity unless they incorporate dimensionality reduction techniques. To complete the discussion of domain features, inherent features, and sample features, it is important to note that there are significant interactions between these categories of features. For example, the type of task supported by a recommender system (a domain feature) will significantly affect the distribution of ratings collected (a sample feature). However, each of these features represents a dimension which may be useful in explaining differences in evaluation results. ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
J. L. Herlocker et al Evaluation of a recommender algorithm on a data set with features that conflict with the end goal of the recommender algorithm could still be useful By explicitly identifying the features that conflict, we can reason about whether those conflicts will unreasonably bias the evaluation results 3.4 Past and Current Trends in Datasets ThemostwidelyusedcommondatasetwastheEachmovieDataset(http:// research. compaq. com/SRC/eachmovie/). This extensive dataset has over 2.8 million ratings from over 70,000 users, and it includes information such as timestamps and basic demographic data for some of the users. In addition toseedingourMovielenssystem(http://www.movielens.org),theEachmovie Dataset was used in dozens of machine learning and algorithmic research projects to study new and potentially better ways to predict user ratings Examples include Canny's [2002] factor analysis algorithm, Domingos and Richardson,'s [2003]algorithm for computing network value, and Pennock et als [2000] work on recommending through personality diagnosis algorithms Extracts(100,000 ratings and 1 million ratings)of the Movie Lens dataset have also been released for research use; these extracts have been used by several researchers, including Schein et al. [2001] in their investigation of cold- start recommendations, Sarwar et al. [2001 in their evaluation of item-based algorithms, Reddy et al. [2002] in their community extraction research, and Mui et al. [2001] in their work on"collaborative sanctioning More recently, several researchers have been using the Jester dataset, which as collected from the Jester joke recommendation website [Goldberg et al 2001]. Statistically, the Jester dataset has different characteristics than the MovieLens and Eachmovie data. First of all, there is a set of training items Gjokes) that are rated by every single user, providing complete data on that sub- set of items. Second in the jester user interface, the user clicks on a unlabeled scale bar to rate a joke, so the ratings are much less discrete and may suffer from different kinds of biases since it is hard for the user to intentionally create a ranking among their rated items. The majority of publications related to collaborative filtering recommender algorithms have used one of the three data sets described above. A few other data sets have been used, but most of them are not publicly available for ver- ification. The lack of variety in publicly available collaborative filtering data numbers of ratings) remains one of the gnificant challenges in the field. Most researchers do not have the resources to build production-quality systems that are capable of collecting enough data to validate research hypotheses, and thus are often forced to constrain their research to hypotheses that can be explored using the few existing datasets. With the maturation of collaborative filtering recommender technology, more live systems have been built that incorporate recommender algorithms As are- sult, we have recently seen an increased number of studies that have used live systems. Herlocker's explanation experiments [Herlocker et al. 2000] explored the use of 23 different graphical displays to"explain" why each recommenda tion was given. Schafer's MetaLens [Schafer et al. 2002 was built to incorporate ACM Transactions on Information Systems, VoL 22, No. 1, January 2004
18 • J. L. Herlocker et al. Evaluation of a recommender algorithm on a data set with features that conflict with the end goal of the recommender algorithm could still be useful. By explicitly identifying the features that conflict, we can reason about whether those conflicts will unreasonably bias the evaluation results. 3.4 Past and Current Trends in Datasets The most widely used common dataset was the EachMovie Dataset (http:// research.compaq.com/SRC/eachmovie/). This extensive dataset has over 2.8 million ratings from over 70,000 users, and it includes information such as timestamps and basic demographic data for some of the users. In addition to seeding our MovieLens system (http://www.movielens.org), the EachMovie Dataset was used in dozens of machine learning and algorithmic research projects to study new and potentially better ways to predict user ratings. Examples include Canny’s [2002] factor analysis algorithm, Domingos and Richardson’s [2003] algorithm for computing network value, and Pennock et al’s [2000] work on recommending through personality diagnosis algorithms. Extracts (100,000 ratings and 1 million ratings) of the MovieLens dataset have also been released for research use; these extracts have been used by several researchers, including Schein et al. [2001] in their investigation of coldstart recommendations, Sarwar et al. [2001] in their evaluation of item-based algorithms, Reddy et al. [2002] in their community extraction research, and Mui et al. [2001] in their work on “collaborative sanctioning.” More recently, several researchers have been using the Jester dataset, which was collected from the Jester joke recommendation website [Goldberg et al. 2001]. Statistically, the Jester dataset has different characteristics than the MovieLens and Eachmovie data. First of all, there is a set of training items (jokes) that are rated by every single user, providing complete data on that subset of items. Second, in the Jester user interface, the user clicks on a unlabeled scale bar to rate a joke, so the ratings are much less discrete and may suffer from different kinds of biases since it is hard for the user to intentionally create a ranking among their rated items. The majority of publications related to collaborative filtering recommender algorithms have used one of the three data sets described above. A few other data sets have been used, but most of them are not publicly available for verification. The lack of variety in publicly available collaborative filtering data sets (particularly with significant numbers of ratings) remains one of the most significant challenges in the field. Most researchers do not have the resources to build production-quality systems that are capable of collecting enough data to validate research hypotheses, and thus are often forced to constrain their research to hypotheses that can be explored using the few existing datasets. With the maturation of collaborative filtering recommender technology, more live systems have been built that incorporate recommender algorithms. As a result, we have recently seen an increased number of studies that have used live systems. Herlocker’s explanation experiments [Herlocker et al. 2000] explored the use of 23 different graphical displays to “explain” why each recommendation was given. Schafer’s MetaLens [Schafer et al. 2002] was built to incorporate ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems MovieLens and other systems into a new interface; his evaluation focused en- tirely on the interface and user experience. Other recent work has combined different evaluations. Our work on"value of information"[Rashid et al. 2002 leads users through different sign-up processes, and then evaluates both the quality of resulting predictions and the subjective user experience In the near future, we expect to see a lot more results from live experiments, as recommender algorithms make their way into more production systems. We also hope that new datasets will be released with data from new domains, caus- ing new explosions in collaborative filtering recomm ender algorithm research similar to what happened with the release of the EachMovie data 4. ACCURACY METRICS Establishing the user tasks to be supported by a system, and selecting a dat set on which performance enables empirical experimentation-scientifically re- eatable evaluations of recommender system utility. A majority of the published empirical evaluations of recommender systems to date has focused on the eval uation of a recommender systems accuracy. We assume that if a user could examine all items available, they could place those items in a ordering of pref- erence. An accuracy metric empirically measures how close a recommender systems predicted ranking of items for a user differs from the user's true rank- ing of preference. Accuracy measures may also measure how well a system can predict an exact rating value for a specific item Researchers who want to quantitatively compare the accuracy of different ecommender systems must first select one or more metrics In selecting a met ic, researchers face a range of questions. Will a given metric measure the effec- tiveness of a system with respect to the user tasks for which it was designed? Are results with the chosen metric comparable to other published research work in the field? Are the assumptions that a metric is based on true? will a metric be sensitive enough to detect real differences that exist? How large a difference does there have to be in the value of a metric for a statistically sig- nificant difference to exist? Complete answers to these questions have not yet been substantially addressed in the published literature The challenge of selecting an appropriate metric is compounded by the large diversity of published metrics that have been used to quantitatively evaluate the accuracy of recommender systems. This lack of standardization is damag- ing to the progress of knowledge related to collaborative filtering recommender systems. With no standardized metrics within the field, researchers have con- tinued to introduce new metrics when they evaluate their systems. with a large diversity of evaluation metrics in use, it becomes difficult to compare re- sults from one publication to the results in another publication. As a result, it becomes hard to integrate these diverse publications into a coherent body of knowledge regarding the quality of recommender system algorithms. To address these challenges, we examine in the advantages and disadvan- tages of past metrics with respect to the user tasks and data set features that have been introduced in Sections 2 and 3. We follow up the conceptual discus- sion of advantages and disadvantages with empirical results comparing the ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems • 19 MovieLens and other systems into a new interface; his evaluation focused entirely on the interface and user experience. Other recent work has combined different evaluations. Our work on “value of information” [Rashid et al. 2002] leads users through different sign-up processes, and then evaluates both the quality of resulting predictions and the subjective user experience. In the near future, we expect to see a lot more results from live experiments, as recommender algorithms make their way into more production systems. We also hope that new datasets will be released with data from new domains, causing new explosions in collaborative filtering recommender algorithm research similar to what happened with the release of the EachMovie data. 4. ACCURACY METRICS Establishing the user tasks to be supported by a system, and selecting a data set on which performance enables empirical experimentation—scientifically repeatable evaluations of recommender system utility. A majority of the published empirical evaluations of recommender systems to date has focused on the evaluation of a recommender system’s accuracy. We assume that if a user could examine all items available, they could place those items in a ordering of preference. An accuracy metric empirically measures how close a recommender system’s predicted ranking of items for a user differs from the user’s true ranking of preference. Accuracy measures may also measure how well a system can predict an exact rating value for a specific item. Researchers who want to quantitatively compare the accuracy of different recommender systems must first select one or more metrics. In selecting a metric, researchers face a range of questions. Will a given metric measure the effectiveness of a system with respect to the user tasks for which it was designed? Are results with the chosen metric comparable to other published research work in the field? Are the assumptions that a metric is based on true? Will a metric be sensitive enough to detect real differences that exist? How large a difference does there have to be in the value of a metric for a statistically significant difference to exist? Complete answers to these questions have not yet been substantially addressed in the published literature. The challenge of selecting an appropriate metric is compounded by the large diversity of published metrics that have been used to quantitatively evaluate the accuracy of recommender systems. This lack of standardization is damaging to the progress of knowledge related to collaborative filtering recommender systems. With no standardized metrics within the field, researchers have continued to introduce new metrics when they evaluate their systems. With a large diversity of evaluation metrics in use, it becomes difficult to compare results from one publication to the results in another publication. As a result, it becomes hard to integrate these diverse publications into a coherent body of knowledge regarding the quality of recommender system algorithms. To address these challenges, we examine in the advantages and disadvantages of past metrics with respect to the user tasks and data set features that have been introduced in Sections 2 and 3. We follow up the conceptual discussion of advantages and disadvantages with empirical results comparing the ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004