Knowledge-Based Systems 23(2010)520-528 Contents lists available at Science Direct Knowledge-Based Systems ELSEVIER journalhomepagewww.elsevier.com/locate/knosys A new collaborative filtering metric that improves the behavior of recommender systems J. Bobadilla, F. Serradilla,J. Bernal Universidad Politecnica de madrid, Computer Science, Crta de valencia, Km 7, 28031 Madrid, Spain ARTICLE INFO A BSTRACT Recommender systems are typically provided as Web 2.0 services and are part of the range of applica- 21July2009 tions that give support to large-scale social networks, enabling on-line recommendations to be made revised form 18 March 2010 19 March 2010 ased on the use of networked databases. The operating core of recommender systems is based on the nline 23 March 2010 collaborative filtering stage, which, in current user to user recommender processes, usually uses the Pear son correlation metric. In this paper, we present a new metric which combines the numerical information of the votes with independent information from those values, based on the proportions of the common orative fltering and uncommon votes between each pair of users. Likewise, we define the reasoning and experiments on amender systems hich the design of the metric is based and the restriction of being applied to recommender where the possible range of votes is not greater than 5. In order to demonstrate the supe the proposed metric, we provide the comparative results of a set of experiments based on the Mo lean squared differences FilmAffinity and Net Flix databases. In addition to the traditional levels of accuracy, results are also pro- vided on the metrics'coverage, the percentage of hits obtained and the precision/recall 2010 Elsevier B V. All rights reserved. 1 Introduction Memory-based methods [22, 37, 35, 40 use similarity metrics and act directly on the ratio matrix that contains the ratings of Recommender systems(RS)cover an important field within col- all users who have expressed their preferences on the collaborative laborative services that are developed in the Web 2.0 environment service; these metrics mathematically express a distance between [21, 19, 26 and enable user-generated opinions to be exploited in a two users based on each of their ratios. Model-based methods [1] sophisticated and powerful way. Recommender systems can be use the ratio matrix to create a model from which the sets of sim- considered as social networking tools that provide dynamic and ilar users will be established. Among the most widely-used models collaborative communication, interaction and knowledge. we have: bayesian classifiers [8, neural networks [18 and fuzzy Recommender systems cover a wide variety of applications systems [39]. Generally, commercial recommender systems use 20, 4, 10, 28.5, 14, 27, although those related to movie recommen- memory-based methods, whilst model-based methods are usually dations are the most well-known and most widely-used in the associated with research recommender systems. search field [23, 3, 25] Regardless of the method used in the collaborative filtering filtering(CF)phase [1, 17,6, 34, 33]. Collaborative filtering is based ommender systems as high as possible: nevertheless, there are on making predictions about a users preferences or tastes based other objectives that need to be taken into account [38 avoid on the preferences of a group of users that are considered similar overspecialization phenomena, find good items, credibility of rec- to this user. A substantial part of the research in the area of collab- ommendations, precision and recall measures, etc. orative filtering centers on how to determine which users are sim- To date, various publications have been written which tackle the ilar to the given one: in order to tackle this task, there are way the recommender systems are evaluated, among the most sig- fundamentally 3 approaches: memory-based methods, model- nificant we have[ 17 which reviews the key decisions in evaluating based methods and hybrid approaches collaborative filtering recommender systems: the user tasks, the type of analysis and datasets being used, the ways in which predic- tion quality is measured and the user-based evaluation of the syster as a whole. Hernandez and gaudioso 9] is a current study which ations: RS, recommender systems: CF, collaborative filte ponding author. Tel: +34 913365133: fax: +34 913367522. proposes a recommendation filtering process based on the distinc address: jesus. bobadilla@upmes 0. Bobadilla). tion between interactive and non-interactive subsystems. General 0950-7051/s-see front matter o 2010 Elsevier B v. All rights reserved o:10.1016/ knosys2010.03009
A new collaborative filtering metric that improves the behavior of recommender systems J. Bobadilla *, F. Serradilla, J. Bernal Universidad Politécnica de Madrid, Computer Science, Crta. de Valencia, Km 7, 28031 Madrid, Spain article info Article history: Received 21 July 2009 Received in revised form 18 March 2010 Accepted 19 March 2010 Available online 23 March 2010 Keywords: Collaborative filtering Recommender systems Metric Jaccard Mean squared differences Similarity abstract Recommender systems are typically provided as Web 2.0 services and are part of the range of applications that give support to large-scale social networks, enabling on-line recommendations to be made based on the use of networked databases. The operating core of recommender systems is based on the collaborative filtering stage, which, in current user to user recommender processes, usually uses the Pearson correlation metric. In this paper, we present a new metric which combines the numerical information of the votes with independent information from those values, based on the proportions of the common and uncommon votes between each pair of users. Likewise, we define the reasoning and experiments on which the design of the metric is based and the restriction of being applied to recommender systems where the possible range of votes is not greater than 5. In order to demonstrate the superior nature of the proposed metric, we provide the comparative results of a set of experiments based on the MovieLens, FilmAffinity and NetFlix databases. In addition to the traditional levels of accuracy, results are also provided on the metrics’ coverage, the percentage of hits obtained and the precision/recall. 2010 Elsevier B.V. All rights reserved. 1. Introduction Recommender systems (RS) cover an important field within collaborative services that are developed in the Web 2.0 environment [21,19,26] and enable user-generated opinions to be exploited in a sophisticated and powerful way. Recommender systems can be considered as social networking tools that provide dynamic and collaborative communication, interaction and knowledge. Recommender systems cover a wide variety of applications [20,4,10,28,5,14,27], although those related to movie recommendations are the most well-known and most widely-used in the research field [23,3,25]. The recommender systems stage that normally has the greatest influence on the quality of the results obtained is the collaborative filtering (CF) phase [1,17,6,34,33]. Collaborative filtering is based on making predictions about a user’s preferences or tastes based on the preferences of a group of users that are considered similar to this user. A substantial part of the research in the area of collaborative filtering centers on how to determine which users are similar to the given one; in order to tackle this task, there are fundamentally 3 approaches: memory-based methods, modelbased methods and hybrid approaches. Memory-based methods [22,37,35,40] use similarity metrics and act directly on the ratio matrix that contains the ratings of all users who have expressed their preferences on the collaborative service; these metrics mathematically express a distance between two users based on each of their ratios. Model-based methods [1] use the ratio matrix to create a model from which the sets of similar users will be established. Among the most widely-used models we have: bayesian classifiers [8], neural networks [18] and fuzzy systems [39]. Generally, commercial recommender systems use memory-based methods, whilst model-based methods are usually associated with research recommender systems. Regardless of the method used in the collaborative filtering stage, the technical objective generally pursued is to minimize the prediction errors, by making the accuracy [12,11,29] of the recommender systems as high as possible; nevertheless, there are other objectives that need to be taken into account [38]: avoid overspecialization phenomena, find good items, credibility of recommendations, precision and recall measures, etc. To date, various publications have been written which tackle the way the recommender systems are evaluated, among the most significant we have [17] which reviews the key decisions in evaluating collaborative filtering recommender systems: the user tasks, the type of analysis and datasets being used, the ways in which prediction quality is measured and the user-based evaluation of the system as a whole. Hernández and Gaudioso [9] is a current study which proposes a recommendation filtering process based on the distinction between interactive and non-interactive subsystems. General 0950-7051/$ - see front matter 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2010.03.009 Abbreviations: RS, recommender systems; CF, collaborative filtering. * Corresponding author. Tel.: +34 913365133; fax: +34 913367522. E-mail address: jesus.bobadilla@upm.es (J. Bobadilla). Knowledge-Based Systems 23 (2010) 520–528 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
publications and reviews also exist which include the most com- sim(x, y)=- 2i(rzi-rmed)(ryi-Tme measures: mean absolute error, coverage precision, recall and derivatives of these: mean squared error, normalized mean absolute Tmed: median value in the rating scale. error, ROC and fallout: Goldberg et al. [13 focuses on the aspects not related to the evaluation, Breese et al. [6 compare the predictive rankxi-rankx)(rank accuracy of various methods in a set of representative problem domains. Candillier et al. [7] and Schafer et al. [36] review the mai rank collaborative filtering methods proposed in the literature The rest of the paper is structured as follows: Although Pearson correlation is the most commonly used met ric in the process of memory-based CF (user to user). this choice is In Section 2 we provide the basis for the principles on which the not always backed by the nature and distribution of the data in the esign of the new metric will be based, we present graphs RS. Formally, in order to be able to apply this metric with guaran which show the way in which the users vote, we carry out tees, the following assumptions must be met experiments which support the decisions made, we establish the best way of selecting numerical and non-numerical infor-. Linear relationship between x and y mation from the votes and, finally, we establish the hypothesis Continuous random variables. on which the paper and its proposed metric are based. mally distributed In section 3 we establish the mathematical formulation of the metric These conditions are not normally met in real RS, and Pearson In Sections 4 and 5, respectively, we list the experiments that correlation presents some significant cases of erroneous operation ill be carried out and we present and discuss the results that should not be ignored in Rs. Despite the deficiencies of Pearson correlation, this similarity Section 6 presents the most relevant conclusions of the measure presents the best prediction and recommendation results in CF-based RS[15, 16, 31,7,35, furthermore, it is the most co monly used and therefore any alternative metric proposed must 2. Approach and design of the new similarity metric its results On accepting that Pearson correlation is the metric for which 2.1. Introduction the results must be improved, but not necessarily the most appro- riate to be taken as a base. it is advisable to focus on the informa- Collaborative filtering methods work on a table of U users who tion that is obtained in the different research processes and which an rate I items. The prediction of a non-rated item i for a user u is can sometimes be overlooked when searching for other different computed as an aggregate of the ratings of the k most similar users objectives to improving the accuracy of rs(cold-start problem, (k-neighborhoods) for the same item i, where Ku denotes the set of trust and novelty, sparsity, etc. ) k-neighborhoods of u and rni denotes of value of the user n rating The simplest information to give us an idea of the nature of the rs on the item i (o if there is not rating value is to find out what users usually vote: do they always tend to vote for Once the set of K users(neighborhoods)similar to active u has the same values? Do they always tend to vote for the same items? Is been calculated, in order to obtain the prediction of item i on user there much difference between the votes of some users and others? u, one of the following aggregation approaches is often used: the average(2), the weighted sum (3)and the adjusted weighted and NetFlix RS (where you can vote in the interval [1.51).We can aggregation(deviation-from-mean)(4). We will use the auxiliar see how, on average, the users focus their votes on the higher levels set Gui in order to define Eqs. (2)-(5): of the interval, but avoiding the extremes, particularly the lower extremes. The distribution of the votes is not balanced and partic Gu={n∈KBn≠·} (1) ularly negative or particularly positive votes are avoided. ∑mCn≠ (2) of the votes cast in the MovieLens 1m and NetFlix databases Graphs(A)and ( b)of Fig. 2 show the number of items that display pa={u∑sim(u,n)asGu≠, (3) the arithmetic average specified on the x axis; we can see that there are hardly any items rated, on average, below 2 or above 4, pui=lu+ sim(u,n)(rni-n)Gu≠② (4) whereby most of the cases are between the values 3 and 4 Graphs C) and(D)of Fig. 2 show the number of items that display the standard deviation specified on the x axis; we can see that most where u serves as a normalizing factor, usually computed of the items have been voted by the users, on average with a max- imum difference of 1.2 votes =1/∑sim(u.n)Cu≠ According to the figures analyzed, we find that traditional met- rics must often achieve results by operating on a set of discrete rat- The most popular similarity metrics are Pearson correlation(6), Ings with very little variation(majority of votes between 3 and 4) cosine (7), constrained Pearsons correlation(8)and spearma and with the obligation of improving simpler and quicker estima- rank correlation ( 9): tions, such as always predicting with the value of the arithmetic average of the votes of each item (in which we know there is sel- dom a standard deviation higher than 1.2 sim(x,y)= ∑(rx-Fx)(ry-F,) 2(x-F1)2(y- 2.2. Basic experimentation sim(x,y)= After ascertaining the low diversity of the votes cast by the users, it seems reasonable to consider that the votes mainly tend
publications and reviews also exist which include the most commonly accepted metrics, aggregation approaches and evaluation measures: mean absolute error, coverage, precision, recall and derivatives of these: mean squared error, normalized mean absolute error, ROC and fallout; Goldberg et al.[13]focuses on the aspects not related to the evaluation, Breese et al. [6] compare the predictive accuracy of various methods in a set of representative problem domains. Candillier et al. [7] and Schafer et al. [36] review the main collaborative filtering methods proposed in the literature. The rest of the paper is structured as follows: In Section 2 we provide the basis for the principles on which the design of the new metric will be based, we present graphs which show the way in which the users vote, we carry out experiments which support the decisions made, we establish the best way of selecting numerical and non-numerical information from the votes and, finally, we establish the hypothesis on which the paper and its proposed metric are based. In Section 3 we establish the mathematical formulation of the metric. In Sections 4 and 5, respectively, we list the experiments that will be carried out and we present and discuss the results obtained. Section 6 presents the most relevant conclusions of the publication. 2. Approach and design of the new similarity metric 2.1. Introduction Collaborative filtering methods work on a table of U users who can rate I items. The prediction of a non-rated item i for a user u is computed as an aggregate of the ratings of the K most similar users (k-neighborhoods) for the same item i, where Ku denotes the set of k-neighborhoods of u and rn,i denotes of value of the user n rating on the item i ( if there is not rating value). Once the set of K users (neighborhoods) similar to active u has been calculated, in order to obtain the prediction of item i on user u, one of the following aggregation approaches is often used: the average (2), the weighted sum (3) and the adjusted weighted aggregation (deviation-from-mean) (4). We will use the auxiliar set Gu,i in order to define Eqs. (2)–(5): Gu;i ¼ n 2 Kuj9rn;i – ; ð1Þ pu;i ¼ 1 #Gu;i X n2Gu;i rn;i () Gu;i – £; ð2Þ pu;i ¼ lu;i X n2Gu;i simð Þ u; n rn;i () Gu;i – £; ð3Þ pu;i ¼ ru þ lu;i X n2Gu;i simð Þ u; n rn;i rn () Gu;i – £; ð4Þ where l serves as a normalizing factor, usually computed: lu;i ¼ 1 X n2Gu;i simðu; nÞ () Gu;i – £ , : ð5Þ The most popular similarity metrics are Pearson correlation (6), cosine (7), constrained Pearson’s correlation (8) and Spearman rank correlation (9): simð Þ¼ x; y P i rx;i rx ry;i ry ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P i rx;i rx 2P i ry;i ry 2 q ; ð6Þ simð Þ¼ x; y P i rx;iry;i ffiffiffiffiffiffiffiffiffiffiffiffi P i r2 x;i q ffiffiffiffiffiffiffiffiffiffiffiffi P i r2 y;i q ; ð7Þ simð Þ¼ x; y P i rx;i rmed ry;i rmed ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P i rx;i rmed 2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P i ry;i rmed 2 q ; rmed : median value in the rating scale; ð8Þ simð Þ¼ x; y P i rankx;i rankx ranky;i ranky ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P i rankx;i rankx 2P i ranky;i ranky 2 r : ð9Þ Although Pearson correlation is the most commonly used metric in the process of memory-based CF (user to user), this choice is not always backed by the nature and distribution of the data in the RS. Formally, in order to be able to apply this metric with guarantees, the following assumptions must be met: Linear relationship between x and y. Continuous random variables. Both variables must be normally distributed. These conditions are not normally met in real RS, and Pearson correlation presents some significant cases of erroneous operation that should not be ignored in RS. Despite the deficiencies of Pearson correlation, this similarity measure presents the best prediction and recommendation results in CF-based RS [15,16,31,7,35], furthermore, it is the most commonly used, and therefore, any alternative metric proposed must improve its results. On accepting that Pearson correlation is the metric for which the results must be improved, but not necessarily the most appropriate to be taken as a base, it is advisable to focus on the information that is obtained in the different research processes and which can sometimes be overlooked when searching for other different objectives to improving the accuracy of RS (cold-start problem, trust and novelty, sparsity, etc.). The simplest information to give us an idea of the nature of the RS is to find out what users usually vote: do they always tend to vote for the same values? Do they always tend to vote for the same items? Is there much difference between the votes of some users and others? Fig. 1 shows the distribution of the votes cast in MovieLens 1M and NetFlix RS (where you can vote in the interval [1..5]). We can see how, on average, the users focus their votes on the higher levels of the interval, but avoiding the extremes, particularly the lower extremes. The distribution of the votes is not balanced and particularly negative or particularly positive votes are avoided. Fig. 2 shows the arithmetic average and the standard deviation of the votes cast in the MovieLens 1M and NetFlix databases. Graphs (A) and (B) of Fig. 2 show the number of items that display the arithmetic average specified on the x axis; we can see that there are hardly any items rated, on average, below 2 or above 4, whereby most of the cases are between the values 3 and 4. Graphs (C) and (D) of Fig. 2 show the number of items that display the standard deviation specified on the x axis; we can see that most of the items have been voted by the users, on average, with a maximum difference of 1.2 votes. According to the figures analyzed, we find that traditional metrics must often achieve results by operating on a set of discrete ratings with very little variation (majority of votes between 3 and 4) and with the obligation of improving simpler and quicker estimations, such as always predicting with the value of the arithmetic average of the votes of each item (in which we know there is seldom a standard deviation higher than 1.2). 2.2. Basic experimentation After ascertaining the low diversity of the votes cast by the users, it seems reasonable to consider that the votes mainly tend J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528 521
J Bobadilla et al. / Knowledge-Based Systems 23(2010)520-528 A 210000000 5000000 Fig. 1. Distribution of votes in RS: (A)MovieLens 1M and(B)NetFlix. B 8:eesN8罚罚导守导导日 8ssN:s。9导守导导日 Arithmetic average 0m0mmmmg 500 8ss=s:ssN:89罚导守守导导日 Nq。da。Nqda。。ta。Nqa。 Standard deviation Fig. 2. Arithmet rage and standard deviation on the MovieLens IM and Net Flix ratings of the items (A)MovieLens arithmetic average, ( B)Net Flix arithmetic average, (c) lovieLens standard deviation,(D) NetFlix standard deviation. to represent a positive or non-positive rating of the items, and to a the precision when the number of recommendations(N)is high. lesser extent a range of these ratings; for instance, in a rS with pos- The numerical key to this improvement lies in the improved sible votes situated in the interval [1.5], a 4 will generally repre- capacity of the discrete calculation to determine whether an item sent a positive rating, which in some cases will be reinforced is recommended(based on the number of k-neighborhoods with with the rating 5. Similarly, a 3 will represent a non-positive rating, value P in that item), as regards the calculation with numerical In order to test this hypothesis, we have designed an experi- approach on the numerical values of the vote lected aggregation which in some cases will be reinforced with the rating 2 or 1 votes into P votes(Positive)and all of 1, 2 and 3 votes into N votes (Non-positive), in such a way that we aim to measure the impact Pearson -Positive/ Non- made on the recommendations by doing without the detailed information provided by the numerical values of the votes In the experiment we compare the precision/recall obtained in a egular way(using the numerical values of the votes )with that ob- tained using only the discretized values P and N: for this purpose we establish the relevance threshold at value 4 (0=4), assimilating a0.55 relevant"with"positive"; we use Pearson correlation, deviation om mean aggregation approach, 20% of test users, 20% of test items. number of recommendations from 2 to 20 K= 150. The experiment has been repeated for values between K= 100 and Rean05060708 K=200, obtaining equivalent results Fig 3 displays the results, which show how the"positive/non positive" discretization not only does not worsen the precision/re- results obtained using the numerical values. 20% of test users. 20% of test items. call measurements, but rather it improves them both, particularly K=150, Pearson correlation,0=4
to represent a positive or non-positive rating of the items, and to a lesser extent a range of these ratings; for instance, in a RS with possible votes situated in the interval [1..5], a 4 will generally represent a positive rating, which in some cases will be reinforced with the rating 5. Similarly, a 3 will represent a non-positive rating, which in some cases will be reinforced with the rating 2 or 1. In order to test this hypothesis, we have designed an experiment on the MovieLens 1M database: we transformed all 4 and 5 votes into P votes (Positive) and all of 1, 2 and 3 votes into N votes (Non-positive), in such a way that we aim to measure the impact made on the recommendations by doing without the detailed information provided by the numerical values of the votes. In the experiment we compare the precision/recall obtained in a regular way (using the numerical values of the votes) with that obtained using only the discretized values P and N; for this purpose, we establish the relevance threshold at value 4 (h = 4), assimilating ‘‘relevant” with ‘‘positive”; we use Pearson correlation, deviation from mean aggregation approach, 20% of test users, 20% of test items, number of recommendations from 2 to 20, K = 150. The experiment has been repeated for values between K = 100 and K = 200, obtaining equivalent results. Fig. 3 displays the results, which show how the ‘‘positive/nonpositive” discretization not only does not worsen the precision/recall measurements, but rather it improves them both, particularly the precision when the number of recommendations (N) is high. The numerical key to this improvement lies in the improved capacity of the discrete calculation to determine whether an item is recommended (based on the number of k-neighborhoods with value P in that item), as regards the calculation with numerical values (prediction obtained by applying the selected aggregation approach on the numerical values of the votes and their subsequent thresholding). Fig. 1. Distribution of votes in RS: (A) MovieLens 1M and (B) NetFlix. Fig. 2. Arithmetic average and standard deviation on the MovieLens 1M and NetFlix ratings of the items. (A) MovieLens arithmetic average, (B) NetFlix arithmetic average, (C) MovieLens standard deviation, (D) NetFlix standard deviation. Fig. 3. Precision/recall obtained by transforming all 4 and 5 votes into P votes (positive) and all 1, 2 and 3 votes into N votes (non-positive), compared to the results obtained using the numerical values. 20% of test users, 20% of test items, K = 150, Pearson correlation, h = 4. 522 J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528
2.3. Non-numerical information: Jaccard for the similarity between users, which aims to improve the results provided by traditional metrics. As the numerical values of the votes appear to lose relevance in the recommendation process(not in the prediction process), we 2.4. Numerical information: mean squared differend are obliged to search for similarity information between users by looking beyond the specific values of their votes. In this sense, it By unifying the concepts and results set out in Section 2, we find is reasonable to focus our attention on two related aspects the following (1) To not grant too much credibility to the similarity of two (1)The similarity between two users(core of the CF RS)is bein users based only on the similitude of a very limited set of based on numerical metrics of statistical origin(such as ommon items: traditional metrics provide their measure Pearson correlation)which should be applied to continuous of similarity without taking into account whether it has been variables, but which in fact are applied to discrete variables obtained based on few or many items voted by both users: which, for the purposes of recommendation, only contain 2 this way, it is not improbable to find the highest similarity measurements associated with pairs of users with very few (2)We are not making use of non-numerical information of the votes which could be valuable in order to complement the (2) The users who have voted for a large number of items should numerical information and provide a metric which satisfa be compared, in so far as is possible, with other users with torily groups these two sources of information. whom they have a large number of items voted in common, this way. for example, in users with around one thousand The most commonly used metrics(constrained Pearson correla- votes cast, a similarity of 0.78 calculated with some 150 tion, Spearman rank correlation, cosine, Pearson correlation, etc. ommon items is more convincing than a similarity of 0.82 display, to a greater or lesser extent, the deficiencies set out in refer- calculated with 60 common items. In short, the proportion ence to Pearson correlation: however, mean squared differences between the common votes and the total votes should be (MSD), based on the geometrical principles of the Euclidean dis- taken very much into consideration. ance, provide different characteristics which could be suitably com- plemented with Jaccard. This metric has been tested into 35 this The most direct way to quantify the above-mentioned aspects is study highlights the good accuracy results obtained using MSD by using the Jaccard metric [24], which calculates the proportion be- but at the cost of coverage values that make it unviable in general tween the number of items that two users have voted in common RS applications. The following paragraph of the conclusions of [35] and the number of different items that both users have voted for sums up its possibilities: the MSD metric offers very interesting re- in total, i.e. the intersection divided by the union of the items voted. sults due to its different behavior compared to the other two studied In order to design our new metric correctly, we must discover the (cosine and Pearson correlation)and its good results in all aspects impact that Jaccard can have as a factor of similarity between users. except in the coverage, which is undoubtedly its weak point As Jaccard does not operate with the numerical values of the votes, it seems improbable that, on its own, it will be able to have a positive 2.5. Hypothesis mpact on the MAE of the RS, however, it is more apparent that the coverage can improve by selecting users with more common votes The hypothesis on which this paper is based is that a suitable nd therefore, in general, with more votes cast. combination of Jaccard and MSd could complement Jaccard with On the other hand, it is reasonable to suspect that users who the numerical values of the votes, and could mitigate the deficien- have voted for a sufficient proportion of items in common display cies in the coverage entailed in the use of MSD, in such a way that common tastes: we know that the negative ratings are not very their joint use would enable the improvement of the results of tra- widely-used( Figs. 1 and 2)and therefore, we can theorize that part ditional metrics in general and of Pearson correlation in particular of the common absences of yotes between 2 users can mean com which will be used as the metric of reference for which the results on non-positive ratings that they preferred not to cast, likewise, must be improved. we know that most of the votes cast have a positive rating( Figs. 1 Although, a prior, the choice of the similarity measure MSD and 2), and therefore, many votes in common could denote a large seems to be the most suitable, it is advisable to test Jaccard com- number of positive ratings in common. bined not only with MSD, but also with the most common metrics: In order to try to test this theory we have designed the follow- Pearson correlation(PC)(6), cosine ( cos)(7), constrained Pearsons ing experiment: for every possible pair of users of MovieLens 1M correlation( CPC)(8)and Spearman rank correlation (SRO)(9) we have calculated the Jaccard value, the MAE and the coverage In Fig. 5. graph 5A shows the evolution of the mae using the met- obtained by establishing the first user of the pair as the active user rics PC, Jaccard*PC, Jaccard *COS, Jaccard*CPC, Jaccard* SRC and and the second one as their only neighborhood. On the x axis we Jaccard*(1-MSD)applied to the Movielens 1M database. As we represent the possible Jaccard values, represented in the interval expected, the best results are obtained by MSD. Graph 5B shows [0..1 where 0 indicates that there are no items in common and the coverage obtained by applying these same metrics: in this case, 1 indicates that all the items voted are in common COS and MSD give the best results up to k= 300 and SrC gives the L In Fig. 4. graph 4A shows the number of pairs of users(y axis) best results from this value of K In short, the similarity measure ich display the jaccard values indicated on the x axis. as is to MSD is confirmed as the best option in order to test the hypothesis be expected, most of the cases present a very low overlap between in this paper tionship between the increase in the value of Jaccard and the accu- 3. Formalization of the new metric racy obtained in the interval [00. 4, in which the great majority of the cases are grouped together(Graph 4A). Graph 4c shows a di The metric presented in this paper takes as reference to be im- rect relationship between the Jaccard value and an improvement proved the most widely-used metric in user to user memory-based CF: Pearson correlation; however, the operating principals that rule The reasoning given and the results obtained justify the incor- this metric will not be taken as a base, but rather, we will use the poration of the jaccard metric as an integral part of a new metric mean squared difference (MSD)metric, which is much less
2.3. Non-numerical information: Jaccard As the numerical values of the votes appear to lose relevance in the recommendation process (not in the prediction process), we are obliged to search for similarity information between users by looking beyond the specific values of their votes. In this sense, it is reasonable to focus our attention on two related aspects: (1) To not grant too much credibility to the similarity of two users based only on the similitude of a very limited set of common items: traditional metrics provide their measure of similarity without taking into account whether it has been obtained based on few or many items voted by both users; this way, it is not improbable to find the highest similarity measurements associated with pairs of users with very few items commonly voted. (2) The users who have voted for a large number of items should be compared, in so far as is possible, with other users with whom they have a large number of items voted in common, this way, for example, in users with around one thousand votes cast, a similarity of 0.78 calculated with some 150 common items is more convincing than a similarity of 0.82 calculated with 60 common items. In short, the proportion between the common votes and the total votes should be taken very much into consideration. The most direct way to quantify the above-mentioned aspects is by using the Jaccard metric [24], which calculates the proportion between the number of items that two users have voted in common and the number of different items that both users have voted for in total, i.e. the intersection divided by the union of the items voted. In order to design our new metric correctly, we must discover the impact that Jaccard can have as a factor of similarity between users. As Jaccard does not operate with the numerical values of the votes, it seems improbable that, on its own, it will be able to have a positive impact on the MAE of the RS, however, it is more apparent that the coverage can improve by selecting users with more common votes and therefore, in general, with more votes cast. On the other hand, it is reasonable to suspect that users who have voted for a sufficient proportion of items in common display common tastes: we know that the negative ratings are not very widely-used (Figs. 1 and 2) and therefore, we can theorize that part of the common absences of votes between 2 users can mean common non-positive ratings that they preferred not to cast, likewise, we know that most of the votes cast have a positive rating (Figs. 1 and 2), and therefore, many votes in common could denote a large number of positive ratings in common. In order to try to test this theory, we have designed the following experiment: for every possible pair of users of MovieLens 1M we have calculated the Jaccard value, the MAE and the coverage obtained by establishing the first user of the pair as the active user and the second one as their only neighborhood. On the x axis we represent the possible Jaccard values, represented in the interval [0..1], where 0 indicates that there are no items in common and 1 indicates that all the items voted are in common. In Fig. 4, graph 4A shows the number of pairs of users (y axis) which display the Jaccard values indicated on the x axis. As is to be expected, most of the cases present a very low overlap between the items voted by each pair of users. Graph 4B shows a direct relationship between the increase in the value of Jaccard and the accuracy obtained in the interval [0..0.4], in which the great majority of the cases are grouped together (Graph 4A). Graph 4C shows a direct relationship between the Jaccard value and an improvement in the coverage. The reasoning given and the results obtained justify the incorporation of the Jaccard metric as an integral part of a new metric for the similarity between users, which aims to improve the results provided by traditional metrics. 2.4. Numerical information: mean squared differences By unifying the concepts and results set out in Section 2, we find the following: (1) The similarity between two users (core of the CF RS) is being based on numerical metrics of statistical origin (such as Pearson correlation) which should be applied to continuous variables, but which in fact are applied to discrete variables which, for the purposes of recommendation, only contain 2 values of use (positive/non-positive). (2) We are not making use of non-numerical information of the votes which could be valuable in order to complement the numerical information and provide a metric which satisfactorily groups these two sources of information. The most commonly used metrics (constrained Pearson correlation, Spearman rank correlation, cosine, Pearson correlation, etc.) display, to a greater or lesser extent, the deficiencies set out in reference to Pearson correlation; however, mean squared differences (MSD), based on the geometrical principles of the Euclidean distance, provide different characteristics which could be suitably complemented with Jaccard. This metric has been tested into [35]; this study highlights the good accuracy results obtained using MSD, but at the cost of coverage values that make it unviable in general RS applications. The following paragraph of the conclusions of [35] sums up its possibilities: ‘‘the MSD metric offers very interesting results due to its different behavior compared to the other two studied (cosine and Pearson correlation) and its good results in all aspects except in the coverage, which is undoubtedly its weak point”. 2.5. Hypothesis The hypothesis on which this paper is based is that a suitable combination of Jaccard and MSD could complement Jaccard with the numerical values of the votes, and could mitigate the deficiencies in the coverage entailed in the use of MSD, in such a way that their joint use would enable the improvement of the results of traditional metrics in general and of Pearson correlation in particular, which will be used as the metric of reference for which the results must be improved. Although, a priori, the choice of the similarity measure MSD seems to be the most suitable, it is advisable to test Jaccard combined not only with MSD, but also with the most common metrics: Pearson correlation (PC) (6), cosine (COS) (7), constrained Pearson’s correlation (CPC) (8) and Spearman rank correlation (SRC) (9). In Fig. 5, graph 5A shows the evolution of the MAE using the metrics PC, Jaccard PC, Jaccard COS, Jaccard CPC, Jaccard SRC and Jaccard (1 MSD) applied to the MovieLens 1M database. As we expected, the best results are obtained by MSD. Graph 5B shows the coverage obtained by applying these same metrics; in this case, COS and MSD give the best results up to K = 300 and SRC gives the best results from this value of K. In short, the similarity measure MSD is confirmed as the best option in order to test the hypothesis in this paper. 3. Formalization of the new metric The metric presented in this paper takes as reference to be improved the most widely-used metric in user to user memory-based CF: Pearson correlation; however, the operating principals that rule this metric will not be taken as a base, but rather, we will use the mean squared difference (MSD) metric, which is much less J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528 523
J Bobadilla et al/Knowledge-Based Systems 23(2010)520-528 12000 B 0.0 §8图8器图昌落器图昌。器8喜器喜舀器§邑昌8器 Jaccard 8芯器N品葛8器鵠吕 。。 Vaccaro ig. 4. Measurements related to the jaccard metric on MovieLens. (A)Number of pairs of users that display the jaccard values represented on the x axis. (B)Averaged MAE obtained in the pairs of users with the jaccard values represented on the x axis. (C) Averaged coverages obtained in the pairs of users with the jaccard values represented on he x axIs 115 B Jaccard"CPC 1000 00 0012001400 correlation and mean squared differences. (A)MAE, (B)Coverage. Mow IM, 20% of test users, 20% of test items, k E [2. 1500] step correlation, Spearman rank Fig. 5. MAE and coverage obtained with Pearson correlation and by combining jaccard with Pearson correlation, cosine, cons ommonly used due to its low capacity to produce new recom- The metric designed is based on two factors mendations MSD offers both a great advantage and a great disadvantage at The similarity between two users calculated as the mean of the the same time; the advantage is that it generates very good general squared differences(MSD): the smaller these differences, the results: low average error, high percentage of correct predictions greater the similarity between the 2 users. This part of the met and low percentage of incorrect predictions: the disadvantage is ric enables very good accuracy results to be obtained. that it has an intrinsic tendency to choose as similar users to one The number of items in which both one user and the other have given user those users who have rated a very small number of made a rating regarding the total number of items which have items [35. e.g. if we have 7 items that can be rated from 1 to 5 been rated between the two users. E.g. given users u1: ( 3. 2. and three users u1,u2,u3 with the following ratings:u1:(,·,4.4,··,·)andu2:(·,4.4,3,·,1) a common rating has been 5,·,·,·u2:(3,4,5,5,1,4,·)u3:(3.5.4,5.·,3,·)(· means made in two items as regards a joint rating of five items. This not rated item), the MSD metric will indicate that(u1, u3) have a to- factor enables us to greatly improve the metric's capacity to al similarity(o),(u1, u2) have a similarity 0.5 and(u2, u3)have a make predictions ower similarity(0.6). This situation is not convincing, as intuitively we realize u2 and u3 are very similar, whilst ul is only similar to u2 An important design aspect is the decision whether not to use a and u3 in 2 ratios, and, therefore, it is not logical to choose it as the parameter for which the value should be given arbitrarily,i.ethe most similar to them, and what is worse, if it is chosen it will not result provided by the metric should be obtained by only taking provide us with possibilities to recommend new items. he values of the ratings provided by the users of the rs. The strategy to follow to design the new metric is to consider By working on the 2 factors with standardized values [0.1 the metric obtained is as follows: Given the lists of ratings of 2 generic along the way its good behavior as regards accuracy and quality of users x and y(2)(2…,,(可…可)sthe the results number of items of our RS, where one of the possible values of each
commonly used due to its low capacity to produce new recommendations. MSD offers both a great advantage and a great disadvantage at the same time; the advantage is that it generates very good general results: low average error, high percentage of correct predictions and low percentage of incorrect predictions: the disadvantage is that it has an intrinsic tendency to choose as similar users to one given user those users who have rated a very small number of items [35], e.g. if we have 7 items that can be rated from 1 to 5 and three users u1, u2, u3 with the following ratings: u1: (, , 4, 5, , , ), u2: (3, 4, 5, 5, 1, 4, ), u3: (3, 5, 4, 5, , 3, ) ( means not rated item), the MSD metric will indicate that (u1,u3) have a total similarity (0), (u1,u2) have a similarity 0.5 and (u2,u3) have a lower similarity (0.6). This situation is not convincing, as intuitively we realize u2 and u3 are very similar, whilst u1 is only similar to u2 and u3 in 2 ratios, and, therefore, it is not logical to choose it as the most similar to them, and what is worse, if it is chosen it will not provide us with possibilities to recommend new items. The strategy to follow to design the new metric is to considerably raise the capacity to generate MSD predictions, without losing along the way its good behavior as regards accuracy and quality of the results. The metric designed is based on two factors: The similarity between two users calculated as the mean of the squared differences (MSD): the smaller these differences, the greater the similarity between the 2 users. This part of the metric enables very good accuracy results to be obtained. The number of items in which both one user and the other have made a rating regarding the total number of items which have been rated between the two users. E.g. given users u1: (3, 2, 4, , , ) and u2: (, 4, 4, 3, , 1), a common rating has been made in two items as regards a joint rating of five items. This factor enables us to greatly improve the metric’s capacity to make predictions. An important design aspect is the decision whether not to use a parameter for which the value should be given arbitrarily, i.e. the result provided by the metric should be obtained by only taking the values of the ratings provided by the users of the RS. By working on the 2 factors with standardized values [0..1], the metric obtained is as follows: Given the lists of ratings of 2 generic users x and y: rx;ry : r1 x ;r2 x ;r3 x ; ... ; rI x ; r1 y ;r2 y ;r3 y ; ; ... ; rI y j I is the number of items of our RS, where one of the possible values of each Fig. 5. MAE and coverage obtained with Pearson correlation and by combining Jaccard with Pearson correlation, cosine, constrained Pearson’s correlation, Spearman rank correlation and mean squared differences. (A) MAE, (B) Coverage. MovieLens 1M, 20% of test users, 20% of test items, k e [2..1500] step 25. Fig. 4. Measurements related to the Jaccard metric on MovieLens. (A) Number of pairs of users that display the Jaccard values represented on the x axis. (B) Averaged MAE obtained in the pairs of users with the Jaccard values represented on the x axis. (C) Averaged coverages obtained in the pairs of users with the Jaccard values represented on the x axis. 524 J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528