A new recommender system to combine content-based and collaborative filtering systems Byung-Do Kim Professor of Marketing at the School of Business Administration, Seoul National University, Korea. He wa previously on the faculty of Carnegie Mellon University, Pittsburgh, USA. His current research interests include various conometric and statistical modelling issues on consumer choice behaviour, e-co rketing. His previous research has appeared in Joumal of Business Economic Statistics, Joumal of Interactive Marketing Joumal of Marketing Research, Joumal of Retailing, Marketing Letters and Marketing Science, among others. Sun-Ok Kim include re ommender at the School of Business Administration, Seoul National University, Korea. She received her BBA Korea and received her MBA from Seoul Nation systems, consumer choice modelling, database niversity, Korea. Her c Abstract The enormous number of choices often create confusion for consumers so they often like to get the opinion of other people in order to make better buying decisions. Many e-commerce sites are implementing recommender systems to help their customers find the most valuable products and services There are two fundamentally different approaches, the content-based and collaborative filtering techniques, to recommend products to customers based on their historical preferences. A new recommendation algorithm to combine these two systems is proposed in this paper. Applying the model to film rating data, the model is shown to perform better than the previous recommendation models in terms of predictive curacy. How the model can be applied to personalise Internet shopping based on customer's transaction history is also discussed INTRODUCTION Recommendation becomes even more Consumers use the evaluation or opinion important in the Internet-based shopping of other people as an important environment where consumers do not information source. People like to get make physical contact with products and recommendations when they perceive a face higher cognitive risk. In addition, risk in making a purchase decision or e-commerce sites offer a very large ung-Do Kim when they want to simplify their buying number of alternatives since they do not decision. For instance, when a consumer have any physical constraint on inventory buys a camcorder, the consumer may ask or shelf space. Hence, consumers may be their friends who have knowledge or confused by the number of choices. If Seoul, 151-742, Korea. experience of camcorders, or they may the consumer is not familiar with the Te:82-2-880-8258 Fax:82-2878-3154;e-ma ask a salesperson to help them buy the Internet, the problem becomes even bxk@plaza. snuac kr best camcorder more serious. In order to solve these Journal of Database Marketing Vol 8, 3, 244-252 O Henry Stewart Publications 1350-2328(2001)
Recommendation becomes even more important in the Internet-based shopping environment where consumers do not make physical contact with products and face higher cognitive risk. In addition, e-commerce sites offer a very large number of alternatives since they do not have any physical constraint on inventory or shelf space. Hence, consumers may be confused by the number of choices. If the consumer is not familiar with the Internet, the problem becomes even more serious. In order to solve these INTRODUCTION Consumers use the evaluation or opinion of other people as an important information source.1 People like to get recommendations when they perceive a risk in making a purchase decision or when they want to simplify their buying decision. For instance, when a consumer buys a camcorder, the consumer may ask their friends who have knowledge or experience of camcorders, or they may ask a salesperson to help them buy the best camcorder. 244 Journal of Database Marketing Vol. 8, 3, 244–252 Henry Stewart Publications 1350-2328 (2001) A new recommender system to combine content-based and collaborative filtering systems Received: 22nd February, 2001 Byung-Do Kim is Assistant Professor of Marketing at the School of Business Administration, Seoul National University, Korea. He was previously on the faculty of Carnegie Mellon University, Pittsburgh, USA. His current research interests include various econometric and statistical modelling issues on consumer choice behaviour, e-commerce, reward programmes and database marketing. His previous research has appeared in Journal of Business & Economic Statistics, Journal of Interactive Marketing, Journal of Marketing Research, Journal of Retailing, Marketing Letters and Marketing Science, among others. Sun-Ok Kim is a doctoral candidate at the School of Business Administration, Seoul National University, Korea. She received her BBA from Yonsei University, Korea and received her MBA from Seoul National University, Korea. Her current research interests include recommender systems, consumer choice modelling, database marketing and retailing. Abstract The enormous number of choices often create confusion for consumers so they often like to get the opinion of other people in order to make better buying decisions. Many e-commerce sites are implementing recommender systems to help their customers find the most valuable products and services. There are two fundamentally different approaches, the content-based and collaborative filtering techniques, to recommend products to customers based on their historical preferences. A new recommendation algorithm to combine these two systems is proposed in this paper. Applying the model to film rating data, the model is shown to perform better than the previous recommendation models in terms of predictive accuracy. How the model can be applied to personalise Internet shopping based on customer’s transaction history is also discussed. Byung-Do Kim Seoul National University, School of Business Administration, 56-1 Shinlim-dong, Kwanak-ku, Seoul, 151-742, Korea. Tel: 82-2-880-8258; Fax: 82-2-878-3154; e-mail: bxk@plaza.snu.ac.kr
A new recommender system to combine content-based and collaborative filter systems problems, several e-commerce sites are same contents predict them to employing recommender systems to help have identical 1 econdly. the their customers make their purchase content-based system tends to restrict the decisions more efficientIy scope of the recommendation to items A recommender system is an similar to those the consumer has already electronic agent that helps customers to rated. Finally, there is no way to find the most valuable products/services provide recommendations for new tastes.In fact, as the importance or r customers because it knows nothing based on their historical preferences about their preferences e-commerce increases, the recommender In contrast, the collaborative filtering system becomes an essential tool in hnique recommends items that similar implementing personalised marketing consumers have liked. Consumers in the The well-designed recommender system collaborative filtering system share their analyses the inferred or stated preference evaluations and opinions regarding each of each customer and automatically product so that other consumers can suggests a set of products/services better decide which items to choose. 13 it er focuses on the automates the process of word-of-mouth recommender systems which suggest communication among consumers products/services based on customers Collaborative filtering overcomes the stated preferences or previous purchase limitations of the content-based systems histories even though there are several by enabling consumers to share their other types. And in this class there are opinions and experiences about products two fundamentally different approaches, It has been successfully applied to many the content-based and collaborative e-commerce sites (eg books, music CDs, films, wines, etc. ) It also has limitations The content -based recommender though. First, collaborative filtering does system suggests products to consumers by not work very well when the number of analysing the content of items that they evaluators/users is small relative to the liked in the past. Features and attributes volume of information in the system. products can be contents of items. Its That is, it is difficult to find similar users underlying assumption is that the content in predicting ratings for some unpopular of an item is what determines the user's products. Secondly, it has the early rater preference. The content-based systems problem that occurs when a new have been widely used with various product/item appears in the database applications. For example, search engines Collaborative filtering cannot provide such as yahoo and alta vista predictive ratings for a new product until recommend relevant documents from other consumers have evaluated it user-suppliedkeywordsAmazon.com The main purpose of the paper is to recommends new books and/or albums develop a hybrid model that combines based on customers' favourite authors or the content -based and collaborative musIcians filtering systems. Generalising from the The content-based approach is an previous models, the new model can b effective recommendation tool, especially flexibly applied across various contexts for new items. It has several limitations and overcome the weakness of the however. First, it often provides bad content-based and collaborative filtering recommendations since it only considers techniques. Applying the model to film the pre-specified contents for ating data, the new model is shown to products/services. If two items have the perform better than previous e Henry Stewart Publications 1350-2328(2001) Vol 8, 3, 244-252 Journal of Database Marketing
same contents, it will predict them to have identical ratings. Secondly, the content-based system tends to restrict the scope of the recommendation to items similar to those the consumer has already rated.11 Finally, there is no way to provide recommendations for new customers because it knows nothing about their preferences.12 In contrast, the collaborative filtering technique recommends items that similar consumers have liked. Consumers in the collaborative filtering system share their evaluations and opinions regarding each product so that other consumers can better decide which items to choose,13 it automates the process of word-of-mouth communication among consumers. Collaborative filtering overcomes the limitations of the content-based systems by enabling consumers to share their opinions and experiences about products. It has been successfully applied to many e-commerce sites (eg books, music CDs, films, wines, etc.). It also has limitations though. First, collaborative filtering does not work very well when the number of evaluators/users is small relative to the volume of information in the system. That is, it is difficult to find similar users in predicting ratings for some unpopular products. Secondly, it has the early rater problem that occurs when a new product/item appears in the database. Collaborative filtering cannot provide predictive ratings for a new product until other consumers have evaluated it. The main purpose of the paper is to develop a hybrid model that combines the content-based and collaborative filtering systems. Generalising from the previous models, the new model can be flexibly applied across various contexts and overcome the weakness of the content-based and collaborative filtering techniques. Applying the model to film rating data, the new model is shown to perform better than previous problems, several e-commerce sites are employing recommender systems to help their customers make their purchase decisions more efficiently.2 A recommender system is an electronic agent that helps customers to find the most valuable products/services based on their historical preferences or tastes.3,4 In fact, as the importance of e-commerce increases, the recommender system becomes an essential tool in implementing personalised marketing. The well-designed recommender system analyses the inferred or stated preference of each customer and automatically suggests a set of products/services. This paper focuses on the recommender systems which suggest products/services based on customers’ stated preferences or previous purchase histories even though there are several other types.5 And in this class there are two fundamentally different approaches, the content-based and collaborative filtering techniques. The content-based recommender system suggests products to consumers by analysing the content of items that they liked in the past.6 Features and attributes of products can be contents of items. Its underlying assumption is that the content of an item is what determines the user’s preference.7 The content-based systems have been widely used with various applications. For example, search engines such as Yahoo and Alta Vista recommend relevant documents from user-supplied keywords.8 Amazon.com recommends new books and/or albums based on customers’ favourite authors or musicians. The content-based approach is an effective recommendation tool, especially for new items. It has several limitations however.9,10 First, it often provides bad recommendations since it only considers the pre-specified contents for products/services. If two items have the Henry Stewart Publications 1350-2328 (2001) Vol. 8, 3, 244–252 Journal of Database Marketing 245 A new recommender system to combine content-based and collaborative filter systems
Kim and Kim recommendation models in terms of multiattribute approaches(eg preference predictive accuracy. regression) to explain consumer's The rest of the rganised as preference for products by a set of their follows. In the next section the attributes. These models. however. often content-based and collaborative lead to poor predictions about customer techniques are described more formally, preferences because of missing and a hybrid model is developed to information such as undiscovered combine them. Why the new model is attributes or important attribute erence ally better than the existing interactions, sensory or experiential models is also discussed. In the following attributes and word-of-mouth effects. 9 section the new model is applied and The collaborative filtering component of shown to perform better than the he new model can be used to capture existing recommender systems in terms this missing information of two statistical criteria. The marketing Before describing the model in greater implications of the model and its detail, it is helpful to look at the input extension to e-commerce sites are then data to understand the task more clearly explored. Finally, the limitations of the The typical input data for recommende model are discussed along with future system is represented in the form of research directions and the authors (evaluation) ratings on each conclusion roduct/item. As shown in Table 1. it is an n x m user-item matrix with each cell representing a user/ consumer's rating on DEVELOPING A NEW a specific item/product. The main task is RECOMMENDER SYSTEM to predict the preference(or rating)for Recognising that the content-based and missing cells based on other observed collaborative filtering system each has its evaluations. For example, Amy has rated dvantages and disadvantages Films 12. 4 and M. Then what recommending products, researchers have Amy's predicted rating for Film 32 tempted to develop a hybrid model to Similarly, the missing ratings for other combine customers are predicted. Once all the Claiming that their models take predicted film ratings have been dvantage of the collaborative filtering btained. film recommendations can be approach without losing the benefit of vided for each customer (eg suggest the content-based approach, they have three highly-rated films for each shown that their models perform better customer) than the individual approach The algorithm of the model consists of Consistent with this research trend six major steps. First, a set of content brid recommender system to combine the products/items needs to be determined content-based and collaborative filtering For example, consider a film systems. The point of departure of their recommendation site such as model is extraction of the content www.moviecritic.comHeresitevisitors of products /items by can get film recommendations once they employing a regression and then register and evaluate a minimum of 12 application of collaborative filtering to ilms. Key features (or contents) the consumer's preference unexplained by determining a visitor's preference for a this(content-based)regression. Marketing film may be the genre of the film(eg researchers have traditionally used comedy, drama, action), the director, the 46 Journal of Database Marketing Vol 8, 3, 244-252 O Henry Stewart Publications 1350-2328(2001)
multiattribute approaches (eg preference regression) to explain consumer’s preference for products by a set of their attributes. These models, however, often lead to poor predictions about customer preferences because of missing information such as undiscovered attributes or important attribute interactions, sensory or experiential attributes and word-of-mouth effects.19 The collaborative filtering component of the new model can be used to capture this missing information. Before describing the model in greater detail, it is helpful to look at the input data to understand the task more clearly. The typical input data for recommender system is represented in the form of (evaluation) ratings on each product/item. As shown in Table 1, it is an n m user-item matrix with each cell representing a user/consumer’s rating on a specific item/product. The main task is to predict the preference (or rating) for missing cells based on other observed evaluations. For example, Amy has rated Films 1, 2, 4 and M. Then what is Amy’s predicted rating for Film 3? Similarly, the missing ratings for other customers are predicted. Once all the predicted film ratings have been obtained, film recommendations can be provided for each customer (eg suggest three highly-rated films for each customer). The algorithm of the model consists of six major steps. First, a set of content components characterising all products/items needs to be determined. For example, consider a film recommendation site such as www.moviecritic.com. Here, site visitors can get film recommendations once they register and evaluate a minimum of 12 films. Key features (or contents) determining a visitor’s preference for a film may be the genre of the film (eg comedy, drama, action), the director, the recommendation models in terms of predictive accuracy. The rest of the paper is organised as follows. In the next section the content-based and collaborative techniques are described more formally, and a hybrid model is developed to combine them. Why the new model is theoretically better than the existing models is also discussed. In the following section the new model is applied and shown to perform better than the existing recommender systems in terms of two statistical criteria. The marketing implications of the model and its extension to e-commerce sites are then explored. Finally, the limitations of the model are discussed along with future research directions and the authors’ conclusions. DEVELOPING A NEW RECOMMENDER SYSTEM Recognising that the content-based and collaborative filtering system each has its advantages and disadvantages in recommending products, researchers have attempted to develop a hybrid model to combine the two approaches.14–18 Claiming that their models take advantage of the collaborative filtering approach without losing the benefit of the content-based approach, they have shown that their models perform better than the individual approach. Consistent with this research trend, the authors have developed a hybrid recommender system to combine the content-based and collaborative filtering systems. The point of departure of their model is extraction of the content component of products/items by employing a regression and then application of collaborative filtering to the consumer’s preference unexplained by this (content-based) regression. Marketing researchers have traditionally used 246 Journal of Database Marketing Vol. 8, 3, 244–252 Henry Stewart Publications 1350-2328 (2001) Kim and Kim
A new recommender system to combine content-based and collaborative filter syste Table 1: Input data for recommendation system Film 1 Film 2 Film 3 Film 4 Film M 5 2 4 Joseph 2 Michael 5 4 producer, the main actors/actresses and The rest of the algorithm is required to So on explain these discrepancies Secondly, the following regression Thirdly, based on the estimated odel is applied for each customer once regressions, the fitted preferences/ratings the key features have been identified ted for all all products/items. Note that here the Ri= Boi+Bui xui+..+ BKi xKi+ Ei predicted ratings for bot observed and (1)unobserved(or missing) products are computed where Ri is the preference (or rating) of The fourth step is to create a data consumer i for product j and Xui is the atrix of prediction errors. The value of the first feature for product j prediction errors are defined as the evaluated by consumer i. Note that difference between the actual prefere this regression K number of features for and the predicted preference. That is products are identified Ey= Ri-Ri. In the regression context, The parameters to be estimated, or Bs the errors are the residuals in regression n equation(1), measure how important model or the preferences unexplained by each feature is in determining the he regression model equation(1). Not preference of the consumer. Note that that prediction errors cannot be equation(1)is applied for each customer's calculated for products for which there observed ratings. Once the parameters are no actual ratings. Hence, consisting have been estimated the consumer i's of a series of prediction errors with a set preference on products not yet evaluated of missing values, the resulting data can be predicted. For example, the matrix of prediction errors looks similar regression is applied to Amy's observed to the input data matrix in Table 1 film preferences in Table 1. Upon Fifthly, the collaborative filtering estimation, Amy's rating for Film 3 can be technique is applied to the data matrix predicted with the estimated parameters created in the previous step. Th and features of Film 3 neighbourhood-based algorithm is The procedure explained so far is no employed among various collaborative different from the content-based filtering techniques. recommender system. That is Here the goal is to calibrate the values preferences of other consumers have not for missing cells. In the neighbourhood- been used to pre consumer IS based method. it can be calculated as reference. As noted in the previous section, however, it is possible for a onsumer to rate two films with identical c=E,+∑m(en-E) features differently because there may be ther factors influencing her preference. where er, i is the predicted value/rating of e Henry Stewart Publications 1350-2328(2001) Vol 8, 3, 244-252 Journal of Database Marketing 247
The rest of the algorithm is required to explain these discrepancies. Thirdly, based on the estimated regressions, the fitted preferences/ratings (Rˆ ij) are computed for all consumers and all products/items. Note that here the predicted ratings for both observed and unobserved (or missing) products are computed. The fourth step is to create a data matrix of prediction errors. The prediction errors are defined as the difference between the actual preference and the predicted preference. That is, ij Rij Rˆ ij. In the regression context, the errors are the residuals in regression model or the preferences unexplained by the regression model equation (1). Note that prediction errors cannot be calculated for products for which there are no actual ratings. Hence, consisting of a series of prediction errors with a set of missing values, the resulting data matrix of prediction errors looks similar to the input data matrix in Table 1. Fifthly, the collaborative filtering technique is applied to the data matrix created in the previous step. The neighbourhood-based algorithm is employed among various collaborative filtering techniques.20 Here the goal is to calibrate the values for missing cells. In the neighbourhoodbased method, it can be calculated as: et,j – t n i=1 wt,i(i,j – i ) (2) where et,j is the predicted value/rating of producer, the main actors/actresses and so on. Secondly, the following regression model is applied for each customer once the key features have been identified: Rij 0i 1iX1ij ... KiXKij ij (1) where Rij is the preference (or rating) of consumer i for product j and X1ij is the value of the first feature for product j evaluated by consumer i. Note that in this regression K number of features for products are identified. The parameters to be estimated, or s in equation (1), measure how important each feature is in determining the preference of the consumer. Note that equation (1) is applied for each customer’s observed ratings. Once the parameters have been estimated the consumer i’s preference on products not yet evaluated can be predicted. For example, the regression is applied to Amy’s observed film preferences in Table 1. Upon estimation, Amy’s rating for Film 3 can be predicted with the estimated parameters and features of Film 3. The procedure explained so far is no different from the content-based recommender system. That is, preferences of other consumers have not yet been used to predict consumer i’s preference. As noted in the previous section, however, it is possible for a consumer to rate two films with identical features differently because there may be other factors influencing her preference. Henry Stewart Publications 1350-2328 (2001) Vol. 8, 3, 244–252 Journal of Database Marketing 247 A new recommender system to combine content-based and collaborative filter systems Table 1: Input data for recommendation system Film 1 Film 2 Film 3 Film 4 . . . Film M Amy Joseph Michael .... Jim Laura 5 1 . .... 3 5 2 . 4 .... 1 3 . 1 3 .... . 4 4 2 . .... 1 . ... ... ... ... ... ... 1 . 5 .... 2 1
Kim and Kim consumer t on product j and n is the Step 6: sum the output from Step 3 and lumber of consumers in the collaborativ Step 5 filtering database who have evaluated the Ict j. The weight similarity between consumer i and the DATA AND ESTIMATION RESULTS (target)consumer f. T is a normalising In this section the model is applied to factor such that the absolute values of actual film rating data called the weights sum to one EachMovie database- supplied by DEC Back to the (film) rating example systems. The database was collected for given in Table 1, suppose that Amy's 18 months to September 1997.It rating on Film 3 is predicted. In the includes 2,811, 983 ratings for 1, 628 neighbourhood-based method, it is given different films from over 70,000 users. It by the weighted average of Joseph so has some information on users (e Michael, Laura and others'ratings on age, sex and zip-code) and films(eg Film 3. In addition, the weights(w )are name, genre, release date). Users were determined by how similar Amy is to instructed to evaluate films on a six-point other evalutors in terms of film ratings. scale from 1 to 0(1, 0.8, 0.6, 0.4, 0.2, There are many ways to specify this O). Higher value indicates stronger similarity measure including the Pearson preference on the item correlation coefficient the constrained Fifty users were randomly selected Pearson correlation, the Spearman rank from the database, each with more than correlation coefficient and the vector 120 film ratings, to validate the model here are many other The 50 users selected have a total of 1.103fil For be described in this paper, interested ot each user, 5 per cent of the ratings were enng withheld as the validation sample. Sarwar readers should see Sarwar et al. 22 et al. adopted the same sampling The final step is to sum the output method and this model is compared with from the third step and the fifth steps heir filter-bot hybrid model. That is, the content-based approach in Four other competing models ar step 3 provides Ri while the pplied to the film rating data. First, a collaborative filtering in step 5 produces baseline model is employed to eir. The predicted preference of product j benchmark the performance of other for consumer i is the sum of these two personalised recommender systems. It numbers. Now the algorithm can be predicts the rating for each film by the summarised: mean rating across users. Secondly, the content-based Step 1: determine a set of content recommender system is fitted where the omponents characterising all genres of the films are used as the products/services contents of the film/item. a dummy Step 2: fit the(contents)regression for variable is created for each of the ten each consumer genre variables including comedy, drama, Step 3: calculate the fitted preferences for action, art/foreign, classic, animation all consumers and all products family, romance, horror and thriller. A Step 4: create a data matrix of prediction film can be simultaneously classified into more than one of these genres. The ten the collaborative filtering genre dummies are regressed on actual technique into the data matrix film ratings in the estimation sample for 48 Journal of Database Marketing Vol 8, 3, 244-252 O Henry Stewart Publications 1350-2328(2001)
Step 6: sum the output from Step 3 and Step 5. DATA AND ESTIMATION RESULTS In this section the model is applied to actual film rating data — called EachMovie database — supplied by DEC systems. The database was collected for 18 months to September 1997. It includes 2,811,983 ratings for 1,628 different films from over 70,000 users. It also has some information on users (eg age, sex and zip-code) and films (eg name, genre, release date). Users were instructed to evaluate films on a six-point scale from 1 to 0 (1, 0.8, 0.6, 0.4, 0.2, 0). Higher value indicates stronger preference on the item. Fifty users were randomly selected from the database, each with more than 120 film ratings, to validate the model. The 50 users selected have a total of 9,026 ratings on 1,103 film items. For each user, 5 per cent of the ratings were withheld as the validation sample. Sarwar et al. 23 adopted the same sampling method and this model is compared with their filter-bot hybrid model. Four other competing models are applied to the film rating data. First, a baseline model is employed to benchmark the performance of other personalised recommender systems. It predicts the rating for each film by the mean rating across users. Secondly, the content-based recommender system is fitted where the genres of the films are used as the contents of the film/item. A dummy variable is created for each of the ten genre variables including comedy, drama, action, art/foreign, classic, animation, family, romance, horror and thriller. A film can be simultaneously classified into more than one of these genres. The ten genre dummies are regressed on actual film ratings in the estimation sample for consumer t on product j and n is the number of consumers in the collaborative filtering database who have evaluated the product j. The weight wt,i is the similarity between consumer i and the (target) consumer t. is a normalising factor such that the absolute values of the weights sum to one. Back to the (film) rating example given in Table 1, suppose that Amy’s rating on Film 3 is predicted. In the neighbourhood-based method, it is given by the weighted average of Joseph, Michael, Laura and others’ ratings on Film 3. In addition, the weights (wt,i) are determined by how similar Amy is to other evalutors in terms of film ratings. There are many ways to specify this similarity measure including the Pearson correlation coefficient, the constrained Pearson correlation, the Spearman rank correlation coefficient and the vector similarity.21 There are many other important issues in implementing collaborative filtering but they will not be described in this paper, interested readers should see Sarwar et al. 22 The final step is to sum the output from the third step and the fifth steps. That is, the content-based approach in step 3 provides Rˆ ij while the collaborative filtering in step 5 produces eij. The predicted preference of product j for consumer i is the sum of these two numbers. Now the algorithm can be summarised: Step 1: determine a set of content components characterising all products/services Step 2: fit the (contents) regression for each consumer Step 3: calculate the fitted preferences for all consumers and all products Step 4: create a data matrix of prediction errors Step 5: apply the collaborative filtering technique into the data matrix 248 Journal of Database Marketing Vol. 8, 3, 244–252 Henry Stewart Publications 1350-2328 (2001) Kim and Kim