SCHEIN ET AL recommender on a data set with three people each with six observations: likes only four out of six movies, person b likes only two out of six movies, and person c likes all six movies. When we recommend four movies to each person, we end up with two false-positives from person b, lowering the area of the curve. However, for any particular data set, we can plot the curve and calculate the area of the omniscient recommender in order to facilitate comparison Random performance in the CROC curve is another important behavior to understand A natural question to ask is whether a random recommender gives an expected plot consist ing of a forty-five degree line as in the roC curve. When each user has the same number of test set observations this quality holds for the CroC plot, as we will prove. An exper iment where k items are sampled without replacement from a population of size N with s successes in the population follows the hypergeometric distribution[Feller, 7]. Conse- quently, the expected number of successes is sk/N. Assume for the moment that each user has the same number of observations(items) for recommendation in a test set: n (p)=N Further, we denote the number of items each user has rated highly(success events) in the entire test set pool as s(p). In what follows, I PI is the total number of users we make recommendations for. and s is the total number of successes. summed over all users. Theorem 1(CROC random recommendation). The expected CROC curve of a random recommender on a test set where each user has the same number of observations is a forty five degree line Proof: Making k recommendations to user p leads to expected number of hits ks(P)/N Summing over all users we obtain expected number of hits (1) ading to expected hit rate k/N. Making k recommendations to user p leads to expected number of false alarms k(N-s(p))/N. Summing over all users we obtain ∑[N-s(p)] k(PN一S N leading to expected false-alarm rate k/N. The expected CROC point coordinates after k random reccomendations is therefore(k/N,k/N), leading to a forty-five degree line. O The theorem applies for cold-start implicit rating and cold-start explicit rating prediction as defined in Section 5. Section 6.2 provides opportunity to confirm that the forty-five degree line described in the theorem holds up well in practice. Our other evaluations do not invoke this theorem due to having different test set observation counts for different users. In cases where the number of observations varies for each user, we can provide no theoretical guarantee on the random recommender at the moment. With a little thought it becomes clear that in circumstances where having fewer observations is correlated with
56 SCHEIN ET AL. recommender on a data set with three people each with six observations: person a likes only four out of six movies, person b likes only two out of six movies, and person c likes all six movies. When we recommend four movies to each person, we end up with two false-positives from person b, lowering the area of the curve. However, for any particular data set, we can plot the curve and calculate the area of the omniscient recommender in order to facilitate comparison. Random performance in the CROC curve is another important behavior to understand. A natural question to ask is whether a random recommender gives an expected plot consisting of a forty-five degree line as in the ROC curve. When each user has the same number of test set observations this quality holds for the CROC plot, as we will prove. An experiment where k items are sampled without replacement from a population of size N with s successes in the population follows the hypergeometric distribution [Feller, 7]. Consequently, the expected number of successes is sk/N. Assume for the moment that each user has the same number of observations (items) for recommendation in a test set: n(p) = N. Further, we denote the number of items each user has rated highly (success events) in the entire test set pool as s(p). In what follows, |P| is the total number of users we make recommendations for, and S is the total number of successes, summed over all users. Theorem 1 (CROC random recommendation). The expected CROC curve of a random recommender on a test set where each user has the same number of observations is a forty- five degree line. Proof: Making k recommendations to user p leads to expected number of hits ks(p)/N. Summing over all users we obtain expected number of hits: k N p s(p) = kS N (1) leading to expected hit rate k/N. Making k recommendations to user p leads to expected number of false alarms k(N − s(p))/N. Summing over all users we obtain: k N p N − s(p) = k(|P|N − S) N (2) leading to expected false-alarm rate k/N. The expected CROC point coordinates after k random reccomendations is therefore (k/N, k/N), leading to a forty-five degree line. The theorem applies for cold-start implicit rating and cold-start explicit rating prediction, as defined in Section 5. Section 6.2 provides opportunity to confirm that the forty-five degree line described in the theorem holds up well in practice. Our other evaluations do not invoke this theorem due to having different test set observation counts for different users. In cases where the number of observations varies for each user, we can provide no theoretical guarantee on the random recommender at the moment. With a little thought it becomes clear that in circumstances where having fewer observations is correlated with
CROC: A NEW EVALUATION CRITERION FOR RECOMMENDER SYSTEMS having a greater number of positive outcomes, the expected curve will be above the forty five degree line. The lack of theoretical guarantee has not been an impediment to other performance measures such as mean absolute error or precision/recall curves. In cases where we can not invoke the theorem, we can simulate random recommendation and plot the CRoC curve to provide this baseline 3.3. Interpreting ROC and CROC curves On both ROC and CROC curves the portion of the curve of special interest in evaluation is the left-hand side of the curve. The left-hand side corresponds to making k recommen- dations where k is small in comparison to the total number of recommendations we could make. In most applications we have only the opportunity to recommend a few items before a user gets tired of looking at the list of suggestions, and so performance on the left-hand side of the curve is critical. In our present experiments we plot the entire curve and calcu late the area under the entire curve. Alternatively, we might have truncated the curve at ow false-alarm rate such as 0.3 and calculated the area under this portion of the curve to 4. Recommender system algorithms tested In this paper, we apply and evaluate three probabilistic methods for describing the lationship between people and items: a logistic principal component analysis, an aspect yes model. We will also employ popularity ranked recommenda- tion methods that effectively recommend the most popular items or recommend only to users that seem to like everything. Popularity rank methods are interesting because they are the simplest to implement and as we will see in the evaluations sometimes outperform other algorithms. For consistency's sake, the models are denoted in the form of a movie recommendation task with the symbol m standing for a particular movie. However, in our evaluation we will recommend web pages as well. 4.1. Logistic principal component analysis Logistic principal component analysis(LPCA)[Schein et al., 33] is a dimensionality re- duction technique for binary data that is derived from a generalization of standard principal component analysis in the same way that logistic regression is derived from the general ized linear model framework. In the movie recommendation domain we use Pl, MI and ILl to denote the number of people, movies and the latent dimensions, respectively. Given a Person Movie matrix X of binary occurrence data indicating whether person p sees movie m, we hypothesize a latent space of dimension LI where |L< MI. We reduce the original data X to a set of coordinates(U=Pl x LI matrix)in a lower dimensional xis(V=lLl x [MI matrix). We calculate coordinates and axis(U and v) by maximizing the likelihood
CROC: A NEW EVALUATION CRITERION FOR RECOMMENDER SYSTEMS 57 having a greater number of positive outcomes, the expected curve will be above the forty- five degree line. The lack of theoretical guarantee has not been an impediment to other performance measures such as mean absolute error or precision/recall curves. In cases where we can not invoke the theorem, we can simulate random recommendation and plot the CROC curve to provide this baseline. 3.3. Interpreting ROC and CROC curves On both ROC and CROC curves the portion of the curve of special interest in evaluation is the left-hand side of the curve. The left-hand side corresponds to making k recommendations where k is small in comparison to the total number of recommendations we could make. In most applications we have only the opportunity to recommend a few items before a user gets tired of looking at the list of suggestions, and so performance on the left-hand side of the curve is critical. In our present experiments we plot the entire curve and calculate the area under the entire curve. Alternatively, we might have truncated the curve at a low false-alarm rate such as 0.3 and calculated the area under this portion of the curve to lend greater emphasis to the region of interest. 4. Recommender system algorithms tested In this paper, we apply and evaluate three probabilistic methods for describing the relationship between people and items: a logistic principal component analysis, an aspect model, and a naïve Bayes model. We will also employ popularity ranked recommendation methods that effectively recommend the most popular items or recommend only to users that seem to like everything. Popularity rank methods are interesting because they are the simplest to implement and as we will see in the evaluations sometimes outperform other algorithms. For consistency’s sake, the models are denoted in the form of a movie recommendation task with the symbol m standing for a particular movie. However, in our evaluation we will recommend web pages as well. 4.1. Logistic principal component analysis Logistic principal component analysis (LPCA) [Schein et al., 33] is a dimensionality reduction technique for binary data that is derived from a generalization of standard principal component analysis in the same way that logistic regression is derived from the generalized linear model framework. In the movie recommendation domain we use |P|, |M| and |L| to denote the number of people, movies and the latent dimensions, respectively. Given a Person×Movie matrix X of binary occurrence data indicating whether person p sees movie m, we hypothesize a latent space of dimension |L| where |L||M|. We reduce the original data X to a set of coordinates (U = |P|×|L| matrix) in a lower dimensional axis (V = |L|×|M| matrix). We calculate coordinates and axis (U and V ) by maximizing the likelihood:
SCHEIN ET AL Table/. Notation for the logistic pca model dimensionality of data(movies) dimensionality of latent spac coefficients(IPI x lLI matrix) basis vectors( LI x IMI matrix) bias vector(1 x MI vector) 6pm=(Uvpm+△m P2M Figure I. Graphical model of (a) person/movie aspect model and(b) person/actor aspect model. These graphs terpreted precisely as Bayesian belief networks C=∑∑[ Xpm log o(om)+(1-Xm)loga(-pm p=l m=l where e factors into U and V as described in Table 1 and the logistic function o(0) In our benchmarking experiments we estimate the parameters through the alternating least squares(ALS)algorithm described in Schein et al. [33], but fit the dimensions in a stagewise fashion(e. g, determine dimension k-l, fix it and then determine dimension k) We set the number of dimensions by performing validation on a portion of the training data, using a separate test set of observations only at final evaluation. 4.2. The two-way aspect mode The aspect model, a latent class variable framework designed for contingency table smoothing [Hofmann, 15], appears natural for the task of predicting an association be- tween p and m. Figure 1(a)shows a graphical model description of the aspect model for a person/movie contingency table and Table 2 explains our notation used in the graphical
58 SCHEIN ET AL. Table 1. Notation for the logistic PCA model. |P| number of observations (people) |M| dimensionality of data (movies) |L| dimensionality of latent space Un coefficients (|P|×|L| matrix) Vm basis vectors (|L|×|M| matrix) m bias vector (1 × |M| vector) pm = (UV )pm + m Figure 1. Graphical model of (a) person/movie aspect model and (b) person/actor aspect model. These graphs can be interpreted precisely as Bayesian belief networks. L = |P| p=1 |M| m=1 Xpm log σ (pm) + (1 − Xpm)log σ (−pm) , (3) where factors into U and V as described in Table 1 and the logistic function σ (θ ) is defined as σ (θ ) = 1 1 + exp(−θ ). (4) In our benchmarking experiments we estimate the parameters through the alternating least squares (ALS) algorithm described in Schein et al. [33], but fit the dimensions in a stagewise fashion (e.g., determine dimension k −1, fix it and then determine dimension k). We set the number of dimensions by performing validation on a portion of the training data, using a separate test set of observations only at final evaluation. 4.2. The two-way aspect model The aspect model, a latent class variable framework designed for contingency table smoothing [Hofmann, 15], appears natural for the task of predicting an association between p and m. Figure 1(a) shows a graphical model description of the aspect model for a person/movie contingency table and Table 2 explains our notation used in the graphical
CROC: A NEW EVALUATION CRITERION FOR RECOMMENDER SYSTEMS 2. Notation used in our aspect model descriptions. Random variable Interpretation P P latent class model as well as in other descriptions of the aspect model applied to the movie recommen- 4.2.I. Pure collaborative filtering model The aspect model of Figure 1(a)encodes a probability distribution over each person/movie pair. In other Rs domains we replace the movie symbol M with the item of interest. Observations consist of tuples(p, m)recording that person p has seen/rated movie m. We store observations in a count matrix or con tingency table with rows ranging over people and columns ranging over movies(or vice versa). In some domains the data may include multiple observations that are identical (e. g, Lyle saw Memento twice). With each observation we increment by one the count of the appropriate contingency table cell(or matrix entry). A naive probability estimate for each cell is simply the observed frequency of events in that cell. However, notice that using this method of assigning probabilities, an empty cell implies that there is zero probability of the corresponding person seeing the corresponding movie, clearly an unrealistic inference. An aspect model hypothesizes the existence of a hidden or latent variable z(e. g,an affinity for a particular style of movies) that motivates person p to watch movie m. Ac- cording to the generative model semantics, person p chooses a latent class z, which in urn determines the movie m watched. The choice of movie m is assumed independent f p given knowledge of z. Since z is hidden, we er possible choices to define the P(p,m)=2P(P)P(zlp)P(mIz) (5) Parameters P(zlp)and P(m z) correspond to the processes of p stochastically choosing a latent class z, and z stochastically choosing m. The P(p, m)values can be thought f as smoothed estimates of the probability distribution of the contingency table. The latent variables perform the smoothing in a manner that maximizes the model likelihood by keeping estimates of P(p, m)close to the empirical distribution). The model also creates smoothed estimates for the values P(p)and P(m), both taking their interpretations from contingency table analysis. The parameters are calculated using the tempered EM algorithm as described in [Hofmann, 15]. We choose the number of latent classes usin performance on a partition of training data as the criterion. Our own source code for fitting the two-way aspect model is available online [Schein et al., 30).3 4.2.2. Adding content information The recommender system described so far is a pure collaborative filtering algorithm developed by Hofmann and Puzicha [16]. In our bench marking, we will employ the pure collaborative filtering aspect model to the prediction of
CROC: A NEW EVALUATION CRITERION FOR RECOMMENDER SYSTEMS 59 Table 2. Notation used in our aspect model descriptions. Random variable Object Interpretation P p person M m movie A a actor Z z latent class model as well as in other descriptions of the aspect model applied to the movie recommendation task. 4.2.1. Pure collaborative filtering model The aspect model of Figure 1(a) encodes a probability distribution over each person/movie pair. In other RS domains we replace the movie symbol M with the item of interest. Observations consist of tuples (p, m) recording that person p has seen/rated movie m. We store observations in a count matrix or contingency table with rows ranging over people and columns ranging over movies (or vice versa). In some domains the data may include multiple observations that are identical (e.g., Lyle saw Memento twice). With each observation we increment by one the count of the appropriate contingency table cell (or matrix entry). A naïve probability estimate for each cell is simply the observed frequency of events in that cell. However, notice that using this method of assigning probabilities, an empty cell implies that there is zero probability of the corresponding person seeing the corresponding movie, clearly an unrealistic inference. An aspect model hypothesizes the existence of a hidden or latent variable z (e.g., an affinity for a particular style of movies) that motivates person p to watch movie m. According to the generative model semantics, person p chooses a latent class z, which in turn determines the movie m watched. The choice of movie m is assumed independent of p given knowledge of z. Since z is hidden, we sum over possible choices to define the distribution over (p, m): P(p, m) = z P(p)P(z|p)P(m|z). (5) Parameters P(z|p) and P(m|z) correspond to the processes of p stochastically choosing a latent class z, and z stochastically choosing m. The P(p, m) values can be thought of as smoothed estimates of the probability distribution of the contingency table. The latent variables perform the smoothing in a manner that maximizes the model likelihood (by keeping estimates of P(p, m) close to the empirical distribution). The model also creates smoothed estimates for the values P(p) and P(m), both taking their interpretations from contingency table analysis. The parameters are calculated using the tempered EM algorithm as described in [Hofmann, 15]. We choose the number of latent classes using performance on a partition of training data as the criterion. Our own source code for fitting the two-way aspect model is available online [Schein et al., 30].3 4.2.2. Adding content information The recommender system described so far is a pure collaborative filtering algorithm developed by Hofmann and Puzicha [16]. In our benchmarking, we will employ the pure collaborative filtering aspect model to the prediction of