② Electronic Commerce Research. 5: 51-74(2005) @2005 Springer Science Business Media, Inc. Manufactured in the Netherlands. CrOC: A New evaluation criterion for Recommender Systems ANDREW I SCHEIN. ALEXANDRIN POPESCUL and LYLE H. UNGAR ent of Computer and Information Science, Levine Hall, 3330 Walnut Street DAVID M. PENNOCK S davidpennock@overture.com Overture Services, Inc, 74 N. Pasadena Ave, 3rd floor, Pasadena, CA 91103, USA abstract Evaluation of a recommender system algorithm is a challenging task due to the many possible scenarios in which systems may be deployed. We have designed a new performance plot called the CROC with an associated statistic: the area under the curve. Our CROC curve supplements the widely used ROC recommender system evaluation by discovering performance characteristics that standard ROC evaluation often ignores. Empirical studies on two domains and including several recommender system algorithms demonstrate that combining ROC and CROC curves in evaluation can lead to a more informed characterization of performance than using either curve alone Keywords: recommender systems, collaborative filtering, performance evaluation 1. Introduction Recommender systems(RS) suggest items of interest to users based on their explicit and implicit preferences, the preferences of other users, and user and item attributes. Methods for conveying recommendations to users are virtually unlimited. Recommendations can sent through Email solicitation(push mechanism), through explicit user request(pull mechanism)or as a suggestion in a web page the user is viewing(push mechanism). In many applications the number of recommendations made to each user must be limited to a handful in order to prevent information fatigue among the user base. However, traditional evaluation metrics for recommender systems--including ROC curves and mean absolute error(MAE) statistics--do not take this phenomenon into account, and may reward rec- ommender systems that skew the number of recommendations to the users who rate most frequently or most highly In this paper we will explore a new performance plot, called the CROC curve, which evaluates system performance in situations where each user receives the same number of *PortionsofthisworkappearedearlierIsCheinetal.,32],@2002Acm,http://doi.acm.org/10 1145/564376.564421 *i This work conducted while at NEC Laboratories America. Princeton NJ
Electronic Commerce Research, 5: 51–74 (2005) 2005 Springer Science + Business Media, Inc. Manufactured in the Netherlands. CROC: A New Evaluation Criterion for Recommender Systems ∗ ANDREW I. SCHEIN, ALEXANDRIN POPESCUL and LYLE H. UNGAR {ais;popescul;ungar}@cis.upenn.edu University of Pennsylvania, Department of Computer and Information Science, Levine Hall, 3330 Walnut Street, Philadelphia, PA 19104-6389, USA DAVID M. PENNOCK ∗∗ david.pennock@overture.com Overture Services, Inc., 74 N. Pasadena Ave, 3rd floor, Pasadena, CA 91103, USA Abstract Evaluation of a recommender system algorithm is a challenging task due to the many possible scenarios in which such systems may be deployed. We have designed a new performance plot called the CROC curve with an associated statistic: the area under the curve. Our CROC curve supplements the widely used ROC curve in recommender system evaluation by discovering performance characteristics that standard ROC evaluation often ignores. Empirical studies on two domains and including several recommender system algorithms demonstrate that combining ROC and CROC curves in evaluation can lead to a more informed characterization of performance than using either curve alone. Keywords: recommender systems, collaborative filtering, performance evaluation 1. Introduction Recommender systems (RS) suggest items of interest to users based on their explicit and implicit preferences, the preferences of other users, and user and item attributes. Methods for conveying recommendations to users are virtually unlimited. Recommendations can be sent through Email solicitation (push mechanism), through explicit user request (pull mechanism) or as a suggestion in a web page the user is viewing (push mechanism). In many applications the number of recommendations made to each user must be limited to a handful in order to prevent information fatigue among the user base. However, traditional evaluation metrics for recommender systems—including ROC curves and mean absolute error (MAE) statistics—do not take this phenomenon into account, and may reward recommender systems that skew the number of recommendations to the users who rate most frequently or most highly. In this paper we will explore a new performance plot, called the CROC curve, which evaluates system performance in situations where each user receives the same number of ∗ Portions of this work appeared earlier: [Schein et al., 32], © 2002 ACM, http://doi.acm.org/10. 1145/564376.564421 ∗∗ This work conducted while at NEC Laboratories America, Princeton, NJ
SCHEIN ET AL recommendations. While traditional ROC curves uncover the raw performance of an Rs algorithm, our CROC curve measures the ability of the rs to provide reasonable recom- mendations across a wide user base thus we find significant advantages in using both CROC and ROC metrics together. ROC curves are formed by creating one large ranked list of recommendations(Pi, m j ), indicating user i likes/rates/purchases item j. Opt timid ing performance on an ROC curve can mean dramatically skewing recommendations to the most active users or the users who rate more highly on average. In contrast, the Croc curve creates one ranked list for each user, interleaving recommendations evenly among users. In this way recommendations are constrained in the CroC evaluation so that each user receives the same number of recommendations, though different users will receive a separate set of recommendations according to their predicted preferences. We demonstrate our two-metric evaluation strategy by recommending web pages and movies: predicting both explicit ratings in the form of a rating in the scale 1-5 in addition to implicit preference information such as a web site visitation. In order to ascertain the value of the CRoC curve, we evaluate three machine learning algorithms while includ- ing baseline popularity-ranked recommendations. We evaluate in settings where we must recommend items nobody has rated before (i. e, recommending from a cold-start)as well as in the hot-start setting. In all of these situations we pinpoint advantages and disadvan- tages of the different recommender system algorithms tested, leading to some surprisin conclusions about RS performance. Our insights into algorithm performance demonstrate the advantage of combining ROC and CroC metrics in evaluation, compared to current practices that ignore the consequences of recommendation constraints in performance sum hanization 2. Background and related work To date, most comparisons among algorithms have been empirical or qualitative in nature [Herlocker et al., 13: Sarwar et al., 27 ], though some worst-case performance bounds have been derived [Freund et al., 8: Nakamura and Abe, 20), some general principles advocated Freund et al., 8, and some fundamental limitations explicated [Pennock et al., 21]. Tech- niques suggested in evaluating recommender system performance include mean absolute error, receiver operator characteristic (ROC)curves, ranked list techniques [Breese et al., 3: Herlocker et al., 13: Schein et al., 31] and variants of precision/recall statistics [Sarwar et al., 27]. Our CROC curve defined and developed in this paper can be viewed as a rOC curve created through constraints on the recommendations. Sarwar et al. [27 define analo- gous constraints on precision/recall statistics. Our contribution adds to the work of Sarwar et al. by demonstrating why traditional metrics such as ROC curves and precision/recall must be adjusted for the recommender system domain; we show that ROC curves and CROC curves frequently have conflicting statements about which recommender systems re best. Breese et al. [3] propose another method to combine performance information about individual users. Their R metric averages performance over the entire user base sing an exponential decay in the user's attention as the user peruses further and further down a list of recommendations. Unlike the r rank metric. the roc and croc methods
52 SCHEIN ET AL. recommendations. While traditional ROC curves uncover the raw performance of an RS algorithm, our CROC curve measures the ability of the RS to provide reasonable recommendations across a wide user base; thus we find significant advantages in using both CROC and ROC metrics together. ROC curves are formed by creating one large ranked list of recommendations (pi, mj ), indicating user i likes/rates/purchases item j . Optimizing performance on an ROC curve can mean dramatically skewing recommendations to the most active users or the users who rate more highly on average. In contrast, the CROC curve creates one ranked list for each user, interleaving recommendations evenly among users. In this way recommendations are constrained in the CROC evaluation so that each user receives the same number of recommendations, though different users will receive a separate set of recommendations according to their predicted preferences. We demonstrate our two-metric evaluation strategy by recommending web pages and movies: predicting both explicit ratings in the form of a rating in the scale 1–5 in addition to implicit preference information such as a web site visitation. In order to ascertain the value of the CROC curve, we evaluate three machine learning algorithms while including baseline popularity-ranked recommendations. We evaluate in settings where we must recommend items nobody has rated before (i.e., recommending from a cold-start) as well as in the hot-start setting. In all of these situations we pinpoint advantages and disadvantages of the different recommender system algorithms tested, leading to some surprising conclusions about RS performance. Our insights into algorithm performance demonstrate the advantage of combining ROC and CROC metrics in evaluation, compared to current practices that ignore the consequences of recommendation constraints in performance summarization. 2. Background and related work To date, most comparisons among algorithms have been empirical or qualitative in nature [Herlocker et al., 13; Sarwar et al., 27], though some worst-case performance bounds have been derived [Freund et al., 8; Nakamura and Abe, 20], some general principles advocated [Freund et al., 8], and some fundamental limitations explicated [Pennock et al., 21]. Techniques suggested in evaluating recommender system performance include mean absolute error, receiver operator characteristic (ROC) curves, ranked list techniques [Breese et al., 3; Herlocker et al., 13; Schein et al., 31] and variants of precision/recall statistics [Sarwar et al., 27]. Our CROC curve defined and developed in this paper can be viewed as a ROC curve created through constraints on the recommendations. Sarwar et al. [27] define analogous constraints on precision/recall statistics. Our contribution adds to the work of Sarwar et al. by demonstrating why traditional metrics such as ROC curves and precision/recall must be adjusted for the recommender system domain; we show that ROC curves and CROC curves frequently have conflicting statements about which recommender systems are best. Breese et al. [3] propose another method to combine performance information about individual users. Their R metric averages performance over the entire user base using an exponential decay in the user’s attention as the user peruses further and further down a list of recommendations. Unlike the R rank metric, the ROC and CROC methods
CROC: A NEW EVALUATION CRITERION FOR RECOMMENDER SYSTEMS employed in our evaluation have no parameters that must be set prior to evaluation, such as a decay rate. Recommender system algorithms can be classified into collaborative filtering, content filtering, or methods that combine both forms of information. Pure collaborative filter ing methods Breese et al., 3: Hill et al., 14; Konstan et al., 17; Resnick and Varian, 25 Shardanand and Maes, 34] base their recommendations on community preferences(e.g user ratings and purchase histories), ignoring user and item attributes(e. g, demograph ics and product descriptions). On the other hand, pure content-based filtering or infor- mation filtering methods [Mooney and Roy, 19; Salton and McGill, 26] typically match query words or other user data with item attribute information, ignoring data from other users. Several hybrid algorithms combine both techniques[Basu et al., 1; Claypool et al., 4 Condliff et al., 6: Good et al., 10; Popescul et al., 23: Schein et al., 32]. Though"content sually refers to descriptive words associated with an item, we use the term more generally to refer to any form of item attribute information including, for example, the list of actors in a movie Early recommender systems were pure collaborative filters that computed pairwise sim- ilarities among users and recommended items according to a similarity-weighted aver- age [ Resnick et al., 24: Shardanand and Maes, 34]. Breese et al. [ 3] refer to this class of algorithms as memory-based algorithms. Subsequent authors employed a variety of techniques for collaborative filtering, including hard-clustering users into classes [Breese et al., 3], simultaneously hard-clustering users and items [Ungar and Foster, 36], soft clustering users and items[Hofmann and Puzicha, 16: Popescul et al., 23], singular value decomposition [Sarwar et al., 28], inferring item- item similarities [Sarwar et al., 291, prob- abilistic modeling [ Breese et al., 3: Condliff et al., 6: Heckerman et al., 12; Pennock et aL., 22; Popescul et al., 23: Schein et al., 31, 32], machine learning [Basu et al Billsus and Pazzani, 2: Nakamura and Abe, 20], and list-ranking Cohen et al, 5 Freund et al., 8; Pennock et al., 21]. More recently, authors have turned toward designing brid recommender systems that combine both collaborative and content information in various ways [Basu et al., 1 Claypool et al., 4; Condliff et al., 6; Good et al., 10; Popescul et al., 23: Schein et al, 31, 32 One difficult, though common, problem for a recommender system is the cold-start problem, where recommendations are required for items that no one(in our data set)has yet rated. Cold-start recommending is particularly hard since pure collaborative filtering approaches fail: new users have no history, yet history is the sole training information of a pure collaborative filtering system. In this work we will benchmark recommender systems in hot-start settings in addition to the new-item cold-start setting. Benchmarking in the cold-start setting requires slight changes in the experimental design, but exposes convenient theoretical properties of the CROC metric. In the sections that follow we define the ROC and CroC curves for recommender sys- tem evaluation, and then describe the algorithms used in the empirical evaluation and the experimental setup Results and discussion follow
CROC: A NEW EVALUATION CRITERION FOR RECOMMENDER SYSTEMS 53 employed in our evaluation have no parameters that must be set prior to evaluation, such as a decay rate. Recommender system algorithms can be classified into collaborative filtering, content filtering, or methods that combine both forms of information. Pure collaborative filtering methods [Breese et al., 3; Hill et al., 14; Konstan et al., 17; Resnick and Varian, 25; Shardanand and Maes, 34] base their recommendations on community preferences (e.g., user ratings and purchase histories), ignoring user and item attributes (e.g., demographics and product descriptions). On the other hand, pure content-based filtering or information filtering methods [Mooney and Roy, 19; Salton and McGill, 26] typically match query words or other user data with item attribute information, ignoring data from other users. Several hybrid algorithms combine both techniques [Basu et al., 1; Claypool et al., 4; Condliff et al., 6; Good et al., 10; Popescul et al., 23; Schein et al., 32]. Though “content” usually refers to descriptive words associated with an item, we use the term more generally to refer to any form of item attribute information including, for example, the list of actors in a movie. Early recommender systems were pure collaborative filters that computed pairwise similarities among users and recommended items according to a similarity-weighted average [Resnick et al., 24; Shardanand and Maes, 34]. Breese et al. [3] refer to this class of algorithms as memory-based algorithms. Subsequent authors employed a variety of techniques for collaborative filtering, including hard-clustering users into classes [Breese et al., 3], simultaneously hard-clustering users and items [Ungar and Foster, 36], softclustering users and items [Hofmann and Puzicha, 16; Popescul et al., 23], singular value decomposition [Sarwar et al., 28], inferring item-item similarities [Sarwar et al., 29], probabilistic modeling [Breese et al., 3; Condliff et al., 6; Heckerman et al., 12; Pennock et al., 22; Popescul et al., 23; Schein et al., 31, 32], machine learning [Basu et al., 1; Billsus and Pazzani, 2; Nakamura and Abe, 20], and list-ranking [Cohen et al., 5; Freund et al., 8; Pennock et al., 21]. More recently, authors have turned toward designing hybrid recommender systems that combine both collaborative and content information in various ways [Basu et al., 1; Claypool et al., 4; Condliff et al., 6; Good et al., 10; Popescul et al., 23; Schein et al., 31, 32]. One difficult, though common, problem for a recommender system is the cold-start problem, where recommendations are required for items that no one (in our data set) has yet rated.1 Cold-start recommending is particularly hard since pure collaborative filtering approaches fail: new users have no history, yet history is the sole training information of a pure collaborative filtering system. In this work we will benchmark recommender systems in hot-start settings in addition to the new-item cold-start setting. Benchmarking in the cold-start setting requires slight changes in the experimental design, but exposes convenient theoretical properties of the CROC metric. In the sections that follow we define the ROC and CROC curves for recommender system evaluation, and then describe the algorithms used in the empirical evaluation and the experimental setup. Results and discussion follow
SCHEIN ET AL 3. Evaluation metrics In this section we define constrained and unconstrained modes of allocating recommenda- ions and explore their consequences for evaluation of algorithms. Formally, we define a recommendation as a pair(p, m)interpreted to mean"recommend m to p". We can imag ine creating a ranked list out of all possible recommendations(a PllMI sized list), and pulling off the top n to actually recommend. There are no constraints introduced as of yet theoretically a recommender system may choose to recommend all items to a particular user before recommending items to any other user. In fact, we will find in practice that cer- tain algorithms will recommend more often to some users than to others if unconstrained Alternatively, we may wish to know the performance of our system under circumstances where we recommend the same number of items to each user We refer to these two models of recommender system usage as unconstrained and constrained recommending respectively, and develop evaluation metrics for each case. In the constrained mode of recommending, we create I PI separate ranked lists of recommendations: one list per user To make n total constrained recommendations we pull Pl/n recommendations off of the top of each of the PI lists. The following sections describe metrics for both unconstrained and constrained re 3.. Unconstrained evaluation. the roc curve For unconstrained recommending, we employ the receiver operator characteristic (ROC) urve [Swets, 35] advocated for recommender system evaluation by Herlocker et al. [13] ROC curves are suited for tracking performance in binary classification tasks while vary ing a classification threshold. RS applications are cast as binary classification when we classify a person/movie pair as like/does-not-like(rating prediction) or purchased/did-not purchase(implicit rating prediction). Instead of varying a threshold, we vary the num- ber of top(p, m) tuples in a ranked list that we use as recommendations. ROC curves plot the false-alarm rate on the x-axis against the hit rate- on the y-axis where we ate= 中+f false alarm rate=sp fp+tm The number of true positives, denoted tp, are the number of positive examples we correctly lentify as such. The number of false positives, denoted fp, are the number of negatives that we miss-classify as positive. The definitions for true negatives(m)and false negatives fn)are analogous. ROC curves have the convenient property that random recommenda- tion is characterized by an expected plot of a forty-five degree line from the lower-left to upper-right corners of the graph, and perfect performance is characterized by a horizontal line one unit above the origin a useful statistic for summarizing the curve is the area under
54 SCHEIN ET AL. 3. Evaluation metrics In this section we define constrained and unconstrained modes of allocating recommendations and explore their consequences for evaluation of algorithms. Formally, we define a recommendation as a pair (p, m) interpreted to mean “recommend m to p”. We can imagine creating a ranked list out of all possible recommendations (a |P||M| sized list), and pulling off the top n to actually recommend. There are no constraints introduced as of yet; theoretically a recommender system may choose to recommend all items to a particular user before recommending items to any other user. In fact, we will find in practice that certain algorithms will recommend more often to some users than to others if unconstrained. Alternatively, we may wish to know the performance of our system under circumstances where we recommend the same number of items to each user. We refer to these two models of recommender system usage as unconstrained and constrained recommending respectively, and develop evaluation metrics for each case. In the constrained mode of recommending, we create |P| separate ranked lists of recommendations: one list per user. To make n total constrained recommendations we pull |P|/n recommendations off of the top of each of the |P| lists. The following sections describe metrics for both unconstrained and constrained recommending. 3.1. Unconstrained evaluation: the ROC curve For unconstrained recommending, we employ the receiver operator characteristic (ROC) curve [Swets, 35] advocated for recommender system evaluation by Herlocker et al. [13]. ROC curves are suited for tracking performance in binary classification tasks while varying a classification threshold. RS applications are cast as binary classification when we classify a person/movie pair as like/does-not-like (rating prediction) or purchased/did-notpurchase (implicit rating prediction). Instead of varying a threshold, we vary the number of top (p, m) tuples in a ranked list that we use as recommendations. ROC curves plot the false-alarm rate on the x-axis against the hit rate2 on the y-axis where we have: hit rate = tp tp + fn, false alarm rate = fp fp + tn. The number of true positives, denoted tp, are the number of positive examples we correctly identify as such. The number of false positives, denoted fp, are the number of negatives that we miss-classify as positive. The definitions for true negatives (tn) and false negatives (fn) are analogous. ROC curves have the convenient property that random recommendation is characterized by an expected plot of a forty-five degree line from the lower-left to upper-right corners of the graph, and perfect performance is characterized by a horizontal line one unit above the origin. A useful statistic for summarizing the curve is the area under
CROC: A NEW EVALUATION CRITERION FOR RECOMMENDER SYSTEMS the curve. Here, perfect performance is characterized by area 1.0 and random performance with expected area 0.5. ROC curves are constructed in the following manner 1. Order the predictions pred(Pi, mj) in a list by magnitude, imposing an ordering: 2. Pick k', calculate hit/false-alarm rates caused by predicting the top k'(p, m)k in the list, By selecting different k'(e.g, incrementing k by a fixed amount) we draw a curve on the 3. 2. Constrained evaluation the Croc curve For the constrained mode of recommending we introduce the CroC, or Customer ROC curve. The curve plots hit rates and false-alarm rates as defined in the standard ROC curve setting, however a constraint is imposed so that each user gets the same number of recommendations. This is achieved by creating one separate ranked list for each user in the data set and plotting points along the curve by recommending the top k items on each list. Note that the |PI ranked lists can have independent orderings of items, so users are potentially recommended separate sets of items In most forms of recommender system evaluation, users have different numbers of ob- servations in the testing set; it is common practice to remove training set data from consid- this rule, or ting. In this case the I PI lists have different lengths. There are exceptions to instance when performing certain types of cold-start evaluation where none of the test set events occur in training. Also, there may be alternative scenarios where a data set contains information about repeated purchases of an item, and therefore it should be permitted to recommend all person/item pairs in testing. However, the evaluations in this paper assume that we do not want to recommend an item if we know that a user has previously seen, purchased or rated that item. Let n(p) be the number of items available to user p in the test set observations. Accord ing to the CroC curve evaluation, the recommendation problem is essentially to guess which k of the n(p) items each user p will like/purchase. When n(p) varies by p(i.e, the I PI lists have different lengths), we make at most n(p) recommendations for user p. The orithm for generating the CROC curve is 1. For each person Pi, order the predictions pred(Pi, mj)in a list by magnitude imposing an ordering:(m)k 2. Pick k, calculate hit/false-alarm rates caused by recommending the top predicted min(k, n(p) movies to each person p from their own list and plot the hit rate against g k we generate the different points along the curve There are some important differences between the Croc curve and the roC curve For example the perfect recommender in a CROC curve is not necessarily a horizontal line ne unit above the x-axis. To see why, imagine evaluating an omniscient (i.e, perfect)
CROC: A NEW EVALUATION CRITERION FOR RECOMMENDER SYSTEMS 55 the curve. Here, perfect performance is characterized by area 1.0 and random performance with expected area 0.5. ROC curves are constructed in the following manner: 1. Order the predictions pred(pi, mj ) in a list by magnitude, imposing an ordering: (p, m)k. 2. Pick k , calculate hit/false-alarm rates caused by predicting the top k (p, m)k in the list, and plot the point. By selecting different k (e.g., incrementing k by a fixed amount) we draw a curve on the graph. 3.2. Constrained evaluation: the CROC curve For the constrained mode of recommending we introduce the CROC, or Customer ROC curve. The curve plots hit rates and false-alarm rates as defined in the standard ROC curve setting, however a constraint is imposed so that each user gets the same number of recommendations. This is achieved by creating one separate ranked list for each user in the data set and plotting points along the curve by recommending the top k items on each list. Note that the |P| ranked lists can have independent orderings of items, so users are potentially recommended separate sets of items. In most forms of recommender system evaluation, users have different numbers of observations in the testing set; it is common practice to remove training set data from consideration in testing. In this case the |P| lists have different lengths. There are exceptions to this rule, for instance when performing certain types of cold-start evaluation where none of the test set events occur in training. Also, there may be alternative scenarios where a data set contains information about repeated purchases of an item, and therefore it should be permitted to recommend all person/item pairs in testing. However, the evaluations in this paper assume that we do not want to recommend an item if we know that a user has previously seen, purchased or rated that item. Let n(p) be the number of items available to user p in the test set observations. According to the CROC curve evaluation, the recommendation problem is essentially to guess which k of the n(p) items each user p will like/purchase. When n(p) varies by p (i.e., the |P| lists have different lengths), we make at most n(p) recommendations for user p. The algorithm for generating the CROC curve is: 1. For each person pi, order the predictions pred(pi, mj ) in a list by magnitude imposing an ordering: (m)k. 2. Pick k , calculate hit/false-alarm rates caused by recommending the top predicted min(k , n(p)) movies to each person p from their own list and plot the hit rate against the false-alarm rate. By varying k we generate the different points along the curve. There are some important differences between the CROC curve and the ROC curve. For example the perfect recommender in a CROC curve is not necessarily a horizontal line one unit above the x-axis. To see why, imagine evaluating an omniscient (i.e., perfect)