Availableonlineatwww.sciencedirectcom ° Science Direct Expert Sy stems with Applications ELSEVIER Expert Systems with Applications 35(2008)1386-1399 www.elsevier.com/locate/eswa Fuzzy-genetic approach to recommender systems based on a novel hybrid user model Mohammad Yahya H. Al-Shamri, Kamal K. Bharadwaj School of Computer and Systems Sciences, Jawaharlal Nehru Unicersity, New Delhi 110 067, India Abstract The main strengths of collaborative filtering(CF), the most successful and widely used filtering technique for recommender systems e its cross-genre or outside the box'recommendation ability and that it is completely independent of any machine-readable represen tation of the items being recommended. However, CF suffers from sparsity, scalability, and loss of neighbor transitivity. CF techniques are either memory-based or model-based. While the former is more accurate, its scalability compared to model-based is poor. An impor tant contribution of this paper is a hybrid fuzzy-genetic approach to recommender systems that retains the accuracy of memory-based CF and the scalability of model-based CF. Using hybrid features, a novel user model is built that helped in achieving significant reduc- tion in system complexity, sparsity, and made the neighbor transitivity relationship hold. The user model is employed to find a set of like linded users within which a memory-based search is carried out. This set is much smaller than the entire set, thus improving systems scalability. Besides our proposed approaches are scalable and compact in size, computational results reveal that they outperform the classical approach. e 2007 Elsevier Ltd. All rights reserved Keywords: Recommender systems; Collaborative filtering: Web personalization; User model; Fuzzy sets 1. Introduction down the information overload, and efficiently guide the user in a personalized manner to interesting items within The popular use of web as a global information system a very large space of possible options(Burke, 2002). Typi- has flooded us with a tremendous amount of data and cally rs recommend information (URLS, netnews articles ), information. The end users are overloaded with a huge entertainment(books, movies, restaurants), or individuals amount of information from myriad resources. This explo- (experts). Amazon. com (Linden, Smith, York, 2003) sive growth of data has generated an urgent need for pow- and Movie Lens. org(Miller, Albert, Lam, Konstan, erful automated web personalization tools that can Riedl, 2003)are two well-known examples of Rs on the intelligently assist us in transforming the vast amount of web data into useful information and knowledge. In other Recommender systems employ four information filter words, these tools ensure that the right information is ing techniques, demographic filtering(DMF)(Krulwich, delivered to the right people at the right time( Adomavicius 1997), content-based filtering(CBF)(Lang, 1995), collabo- Tuzhilin, 2005; Eirinaki vazirgiannis, 2003). Web rec- rative filtering(Resnick, Iakovou, Sushak, Bergstrom, ommender systems(RS), the most successful example of Riedl, 1994; Shardanand Maes, 1995), and hybrid filter- Web personalization tools, tailor information access, trim ing techniques(Balabanovic Shoham, 1997; Pazzani, 1999: Shahabi, Banaei-Kashani, Chen, McLeod, 2001) DMF categorizes the user based on the user personal attr Corresponding author. Mobile: \9\W9810196636: fax: +91(11) butes and makes recommendations based on demographic 26717528 mohamad. alshamriagmaiLcom(M Y.H. Al-Shamri), classes while the CBF suggests items similar to the ones the (KK. Bharadwan user preferred in the past 0957-4174/S- see front matter 2007 Elsevier Ltd. All rights reserved doi:10.1016/eswa2007.08.016
Fuzzy-genetic approach to recommender systems based on a novel hybrid user model Mohammad Yahya H. Al-Shamri , Kamal K. Bharadwaj School of Computer and Systems Sciences, Jawaharlal Nehru University, New Delhi 110 067, India Abstract The main strengths of collaborative filtering (CF), the most successful and widely used filtering technique for recommender systems, are its cross-genre or ‘outside the box’ recommendation ability and that it is completely independent of any machine-readable representation of the items being recommended. However, CF suffers from sparsity, scalability, and loss of neighbor transitivity. CF techniques are either memory-based or model-based. While the former is more accurate, its scalability compared to model-based is poor. An important contribution of this paper is a hybrid fuzzy-genetic approach to recommender systems that retains the accuracy of memory-based CF and the scalability of model-based CF. Using hybrid features, a novel user model is built that helped in achieving significant reduction in system complexity, sparsity, and made the neighbor transitivity relationship hold. The user model is employed to find a set of likeminded users within which a memory-based search is carried out. This set is much smaller than the entire set, thus improving system’s scalability. Besides our proposed approaches are scalable and compact in size, computational results reveal that they outperform the classical approach. 2007 Elsevier Ltd. All rights reserved. Keywords: Recommender systems; Collaborative filtering; Web personalization; User model; Fuzzy sets 1. Introduction The popular use of web as a global information system has flooded us with a tremendous amount of data and information. The end users are overloaded with a huge amount of information from myriad resources. This explosive growth of data has generated an urgent need for powerful automated web personalization tools that can intelligently assist us in transforming the vast amount of data into useful information and knowledge. In other words, these tools ensure that the right information is delivered to the right people at the right time (Adomavicius & Tuzhilin, 2005; Eirinaki & Vazirgiannis, 2003). Web recommender systems (RS), the most successful example of Web personalization tools, tailor information access, trim down the information overload, and efficiently guide the user in a personalized manner to interesting items within a very large space of possible options (Burke, 2002). Typically RS recommend information (URLs, netnews articles), entertainment (books, movies, restaurants), or individuals (experts). Amazon.com (Linden, Smith, & York, 2003) and MovieLens.org (Miller, Albert, Lam, Konstan, & Riedl, 2003) are two well-known examples of RS on the web. Recommender systems employ four information filtering techniques, demographic filtering (DMF) (Krulwich, 1997), content-based filtering (CBF) (Lang, 1995), collaborative filtering (Resnick, Iakovou, Sushak, Bergstrom, & Riedl, 1994; Shardanand & Maes, 1995), and hybrid filtering techniques (Balabanovic & Shoham, 1997; Pazzani, 1999; Shahabi, Banaei-Kashani, Chen, & McLeod, 2001). DMF categorizes the user based on the user personal attributes and makes recommendations based on demographic classes while the CBF suggests items similar to the ones the user preferred in the past. 0957-4174/$ - see front matter 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2007.08.016 * Corresponding author. Mobile: +91 (11) 9810196636; fax: +91 (11) 26717528. E-mail addresses: mohamad.alshamri@gmail.com (M.Y.H. Al-Shamri), kbharadwaj@gmail.com (K.K. Bharadwaj). www.elsevier.com/locate/eswa Available online at www.sciencedirect.com Expert Systems with Applications 35 (2008) 1386–1399 Expert Systems with Applications
M.Y.H. Al-Shamri, K.K. Bharadwaj/ Expert Systems with Applications 35(2008)1386-1399 The most familiar and most widely implemented filter- as fuzzy. But at the item level it is difficult to fuzzify the ing is the collaborative filtering. The initial CF was intro- profile because it would require prohibitively large space duced by Tapestry(Goldberg, Nichols, Oki, Terry, and long processing time 1992), and was automated by GroupLens(Resnick et al Our work in this paper is an attempt towards introduc- 1994)and Ringo( Shardanand Maes, 1995). Typically, ing hybridization at four different levels, namely feature- CF explores similar users(neighbors), recognizes common- level, model-level, CF algorithm-level, and approach-level alities between the user and his neighbors on the basis of Firstly, we proposed hybrid features that exploit both user their ratings, and then accordingly generates new recom- ratings for high rated items and some content description mendations based on inter-user comparisons(Adomavicius of the items. At the model-level we built a user model from Tuzhilin, 2005; Eirinaki Vazirgiannis, 2003). Further, the set of hybrid features and DMf profile. Hybridization CF and dMf have the unique capacity to identify cross- between model-based and memory-based algorithms of CF genre niches and can entice users to jump outside of the is done at the CF algorithm-level. The user model is used to familiar outside the box'( Burke, 2002) find a set of like-minded users within which a memory- Breese, Heckerman, and Kadie (1998)classified CF based search is carried out. This set is much smaller in size algorithms as either memory-based(Resnick et al., 1994; than the original set, thus making the technique scalable Shardanand Maes, 1995), the entire data is used for rec- Finally, we developed a hybrid fuzzy-genetic Rs by ommendations, or model-based (Shahabi et al., 2001)in employing Ga to evolve appropriate weights for each fea- which a model is derived offline from the data to be used ture of the user model and proposing a novel fuzzy distance for online recommendations. While the former is more metric to match users accurate, its scalability compared to model-based is poor. The contributions of this paper are three-fold: Practically, CF faces two fundamental challenges, namely accuracy and scalability. Although memory-based algo- A novel user model is built that enables hybrid filtering, rithms are simple, provide high accuracy recommend reduces the system complexity and the computational tions, and admit easy addition of new data, but they are time by roughly a factor of six. computationally expensive as the size of the input dataset A novel fuzzy distance function is proposed for users increases. Eventually, the user will leave the Web site before the processing completes. On the other hand, apply- .A hybrid fuzzy-genetic approach to recommender sys ing the model-based algorithm alone on such sparse data tems is developed though reduces the online processing cost, often comes at the cost of recommendation accuracy. One common threat The rest of this paper is organized as follows: some in current RS research is the need to combine recommenda- background on the rs is given in the next section. In Sec tion techniques to achieve peak performance because each tion 3, the proposed user model is presented, while the technique has its own pros and cons fuzzy and hybrid fuzzy-genetic approaches are given in Sec Essentially, Rs keep a profile for each user Without any tion 4. The experimental results of the proposed information about the user, the rs are not able to assist approaches and classical approach are discussed in Section the user. The user profile contains raw information about 5. Finally, in the last section we conclude our work with a the user. The terms user profile and user model are often review of our contributions along with some future used as synonyms. However, recent researchers(Froschl, research directions 2005: Koch, 2000) differentiate between them according to the level of sophistication. The user profile is a collection 2. Background of raw personal information represents preferences, back ground, personal details, and interactions with the system. 2. 1. Recommender systems Depending on the user profile, a user Thus, the user profile is used to retrieve the needed infor Recommender systems have gained an increasing mation to build up a model for the user. Koch(2000) importance since the early work on CF in the mid-1990s describes the user model as the representation of the sys- when researchers started focusing on RS that explicitly tems beliefs about the user and the user profile as a simple rely on the ratings structure(Adomavicius Tuzhilin, 2005). Normally, explicit ratings from users are binary rat Some efforts have been made towards introducing fuzz- ings (like/dislike)or follow a specified numerical scale less in RS. Nasraoui and Petenes(2003)used fuzzy indicating the degree of preference (e.g, 1-bad to pproximate reasoning to develop a general framework MAX-excellent, where MAX is the maximum possible for the recommendation process while Suryavanshi, Shiri, rating for a given system). Usually, one of the following and Mudur(2005)used relational fuzzy subtractive cluster- information filtering techniques is employed to generate ing Shahabi et al.(2001)proposed a Yoda Rs that softly recommendations: classifying active user based on typical patterns of users and then generating soft recommendations for him. The Demographic filtering (DMF): The user will be recom- user profile includes many features that can be described mended items similar to the ones other people with same
The most familiar and most widely implemented filtering is the collaborative filtering. The initial CF was introduced by Tapestry (Goldberg, Nichols, Oki, & Terry, 1992), and was automated by GroupLens (Resnick et al., 1994) and Ringo (Shardanand & Maes, 1995). Typically, CF explores similar users (neighbors), recognizes commonalities between the user and his neighbors on the basis of their ratings, and then accordingly generates new recommendations based on inter-user comparisons (Adomavicius & Tuzhilin, 2005; Eirinaki & Vazirgiannis, 2003). Further, CF and DMF have the unique capacity to identify crossgenre niches and can entice users to jump outside of the familiar ‘outside the box’ (Burke, 2002). Breese, Heckerman, and Kadie (1998) classified CF algorithms as either memory-based (Resnick et al., 1994; Shardanand & Maes, 1995), the entire data is used for recommendations, or model-based (Shahabi et al., 2001) in which a model is derived offline from the data to be used for online recommendations. While the former is more accurate, its scalability compared to model-based is poor. Practically, CF faces two fundamental challenges, namely accuracy and scalability. Although memory-based algorithms are simple, provide high accuracy recommendations, and admit easy addition of new data, but they are computationally expensive as the size of the input dataset increases. Eventually, the user will leave the Web site before the processing completes. On the other hand, applying the model-based algorithm alone on such sparse data, though reduces the online processing cost, often comes at the cost of recommendation accuracy. One common threat in current RS research is the need to combine recommendation techniques to achieve peak performance because each technique has its own pros and cons. Essentially, RS keep a profile for each user. Without any information about the user, the RS are not able to assist the user. The user profile contains raw information about the user. The terms user profile and user model are often used as synonyms. However, recent researchers (Froschl, 2005; Koch, 2000) differentiate between them according to the level of sophistication. The user profile is a collection of raw personal information represents preferences, background, personal details, and interactions with the system. Depending on the user profile, a user can be modeled. Thus, the user profile is used to retrieve the needed information to build up a model for the user. Koch (2000) describes the user model as the representation of the system’s beliefs about the user and the user profile as a simple user model. Some efforts have been made towards introducing fuzziness in RS. Nasraoui and Petenes (2003) used fuzzy approximate reasoning to develop a general framework for the recommendation process while Suryavanshi, Shiri, and Mudur (2005) used relational fuzzy subtractive clustering. Shahabi et al. (2001) proposed a Yoda RS that softly classifying active user based on typical patterns of users and then generating soft recommendations for him. The user profile includes many features that can be described as fuzzy. But at the item level it is difficult to fuzzify the profile because it would require prohibitively large space and long processing time. Our work in this paper is an attempt towards introducing hybridization at four different levels, namely featurelevel, model-level, CF algorithm-level, and approach-level. Firstly, we proposed hybrid features that exploit both user ratings for high rated items and some content descriptions of the items. At the model-level we built a user model from the set of hybrid features and DMF profile. Hybridization between model-based and memory-based algorithms of CF is done at the CF algorithm-level. The user model is used to find a set of like-minded users within which a memorybased search is carried out. This set is much smaller in size than the original set, thus making the technique scalable. Finally, we developed a hybrid fuzzy-genetic RS by employing GA to evolve appropriate weights for each feature of the user model and proposing a novel fuzzy distance metric to match users. The contributions of this paper are three-fold: • A novel user model is built that enables hybrid filtering, reduces the system complexity and the computational time by roughly a factor of six. • A novel fuzzy distance function is proposed for users matching. • A hybrid fuzzy-genetic approach to recommender systems is developed. The rest of this paper is organized as follows: some background on the RS is given in the next section. In Section 3, the proposed user model is presented, while the fuzzy and hybrid fuzzy-genetic approaches are given in Section 4. The experimental results of the proposed approaches and classical approach are discussed in Section 5. Finally, in the last section we conclude our work with a review of our contributions along with some future research directions. 2. Background 2.1. Recommender systems Recommender systems have gained an increasing importance since the early work on CF in the mid-1990s when researchers started focusing on RS that explicitly rely on the ratings structure (Adomavicius & Tuzhilin, 2005). Normally, explicit ratings from users are binary ratings (like/dislike) or follow a specified numerical scale indicating the degree of preference (e.g., 1 – bad to MAX – excellent, where MAX is the maximum possible rating for a given system). Usually, one of the following information filtering techniques is employed to generate recommendations: • Demographic filtering (DMF): The user will be recommended items similar to the ones other people with same M.Y.H. Al-Shamri, K.K. Bharadwaj / Expert Systems with Applications 35 (2008) 1386–1399 1387
M. Y.H. Al-Shamri, KK Bharadwaj Expert Systems with Applications 35(2008)1386-135 demographic preferred. Lifestyle Finder(Krulwich, hybridization methods are described by Adomavicius and 1997)uses demographic groups from marketing research Tuzhilin(2005) and Burke(2002) to suggest a range of products and services Basically to build a rs using any filtering technique, a Content-based filtering(CBF: The user will be recom- model for each user is required which must reflect the user mended items similar to the ones he preferred in the preferences and taste. The key success of any filtering tech- past. Example of such systems is News Weeder(Lang, nique comes from the way it learns user models. The CBF 1995) learns the model from the features describing the items the Collaborative filtering (CF): The user will be recom- user has rated in the past, whereas CF learns the model mended items people with similar tastes and preferences from ratings of the items themselves( Breese et al., 1998; liked in the past. GroupLens(Resnick et al., 1994), Resnick et al., 1994; Vozalis Margaritis, 2003 ) In real MovieLens(Miller et al., 2003), and Ringo( Shardanand life users put some priorities for each feature and these pri- Maes, 1995) are some examples of such systems orities need learning also Ujin and Bentley (2004)used the Hybrid filtering: These techniques combine more than evolutionary search to learn these priorities tailoring the one filtering technique to enhance the performance like recommendation process to the preferences of each individ- Fab(balabanovic Shoham, 1997)and Amazon. com ual user. (Linden et al., 2003). Formally, in CF recommenders, we have a set of users U=u1, u2,.. um rating a set of items S=s1, 52, .. Sn) The main shortcomings of CBF are content limitations, such as books, movies, or CDs. The spaces S and U are only a very shallow analysis of certain kinds of content large and can be very large in some applications. Each use can be supplied, and over-specialization; the user is W; i= 1, 2,., m has rated a subset of items S; Specifically, restricted to see items similar to those already rated by the rating of user ue for item s; j=1, 2,..., n is denoted by him only(Balabanovic Shoham, 1997). These systems rci. All the available ratings are collected in a mxn us are not suitable for dynamic and very large environments, item matrix denoted by R. Four phases are required to where items are millions and inserted in the system fre- form the recommendation task in CF recommenders quently(Adomavicius Tuzhilin, 2005) The great power of CF relative to CBF is its cross-genre (a) Data collection or outside the box'recommendation ability Moreover CF (b) User model formation is completely independent of any machine-readable repre (c) Neighborhood set selection sentation of the items being recommended(Adomavicius (d) Making recommendations tuzhilin, 2005; Burke, 2002). However, CF suffers some weaknesses: new user problem(cold start problem), spar sity, scalability, and loss of neighbor transitivity(Breese 2. 1.1. Data collection et al., 1998; Vozalis Margaritis, 2003 ). Usually each user Generally, three types of data can be collected from rates only a very limited percentage of items, when com- users, demographical data through the registration process, pared to the available total. This leads to sparse user-item explicit ratings for a subset of the available items, and matrices, the set of users cross the set of items, therefore implicit data from the user's online behavior. In order to weak recommendations could be produced because the suc- conduct our experiments we used the original MovieLens cessful neighbors cannot be found dataset(http://www.movielens.umn.edu).Thedatasetcon a scalability problem. Further, relying directly on individ- very good, and 5-excellent numerical scale. Each user has ual items ratings result in a loss of neighbor transitivity. rated at least 20 movies. Simple demographical data such Assume that we have three users ui, uj, and uk. Users ui as age, gender, occupation and zip code are included for and u; correlate highly, also u and uk correlate highly. all users, which are collected when a new user registers Because of transitivity there is a possibility that users u; on the system. The movie title, release date, video release and uk correlate highly too. Such a transitive relationship date, and genre data are given for each movie. The genre is not captured in pure CF, unless users u; and uk have feature specifies if the movie is an action, adventure, ani- rated many common items(Vozalis margaritis, 2003). mation, childrens, comedy, crime, documentary, drama, To remedy the aforementioned weaknesses of CBF and fantasy, film-noir, horror, musical, mystery, romance CF, trust-aware RS(Massa Avesani, 2004 )-recommen- sci-fi, thriller, war, or western. a single movie can belong dations are based only on ratings given by users trusted to more than one genre directly or indirectly by the active user-and hybrid filter ing recommenders( Balabanovic Shoham, 1997: Linden 2.1.2. User model formation et al., 2003; Pazzani, 1999: Shahabi et al., 2001) are pro- The difference between a user profile and a user model posed. A recent comprehensive survey of the state of the lies in the different levels of sophistication. The user profile art in Rs, various limitations of the current generation, is simply a collection of personal information about the the various ways to extend their capabilities, and various user, which can be described as a simple user model
demographic preferred. Lifestyle Finder (Krulwich, 1997) uses demographic groups from marketing research to suggest a range of products and services. • Content-based filtering (CBF): The user will be recommended items similar to the ones he preferred in the past. Example of such systems is NewsWeeder (Lang, 1995). • Collaborative filtering (CF): The user will be recommended items people with similar tastes and preferences liked in the past. GroupLens (Resnick et al., 1994), MovieLens (Miller et al., 2003), and Ringo (Shardanand & Maes, 1995) are some examples of such systems. • Hybrid filtering: These techniques combine more than one filtering technique to enhance the performance like Fab (Balabanovic & Shoham, 1997) and Amazon.com (Linden et al., 2003). The main shortcomings of CBF are content limitations; only a very shallow analysis of certain kinds of content can be supplied, and over-specialization; the user is restricted to see items similar to those already rated by him only (Balabanovic & Shoham, 1997). These systems are not suitable for dynamic and very large environments, where items are millions and inserted in the system frequently (Adomavicius & Tuzhilin, 2005). The great power of CF relative to CBF is its cross-genre or ‘outside the box’ recommendation ability. Moreover CF is completely independent of any machine-readable representation of the items being recommended (Adomavicius & Tuzhilin, 2005; Burke, 2002). However, CF suffers some weaknesses: new user problem (cold start problem), sparsity, scalability, and loss of neighbor transitivity (Breese et al., 1998; Vozalis & Margaritis, 2003). Usually each user rates only a very limited percentage of items, when compared to the available total. This leads to sparse user-item matrices, the set of users cross the set of items, therefore weak recommendations could be produced because the successful neighbors cannot be found. On the other hand, the computational cost of RS grows fast with both the number of users and items giving rise to a scalability problem. Further, relying directly on individual items ratings result in a loss of neighbor transitivity. Assume that we have three users ui, uj, and uk. Users ui and uj correlate highly, also uj and uk correlate highly. Because of transitivity there is a possibility that users ui and uk correlate highly too. Such a transitive relationship is not captured in pure CF, unless users ui and uk have rated many common items (Vozalis & Margaritis, 2003). To remedy the aforementioned weaknesses of CBF and CF, trust-aware RS (Massa & Avesani, 2004) – recommendations are based only on ratings given by users trusted directly or indirectly by the active user – and hybrid filtering recommenders (Balabanovic & Shoham, 1997; Linden et al., 2003; Pazzani, 1999; Shahabi et al., 2001) are proposed. A recent comprehensive survey of the state of the art in RS, various limitations of the current generation, the various ways to extend their capabilities, and various hybridization methods are described by Adomavicius and Tuzhilin (2005) and Burke (2002). Basically to build a RS using any filtering technique, a model for each user is required which must reflect the user preferences and taste. The key success of any filtering technique comes from the way it learns user models. The CBF learns the model from the features describing the items the user has rated in the past, whereas CF learns the model from ratings of the items themselves (Breese et al., 1998; Resnick et al., 1994; Vozalis & Margaritis, 2003). In real life, users put some priorities for each feature and these priorities need learning also. Ujjin and Bentley (2004) used the evolutionary search to learn these priorities tailoring the recommendation process to the preferences of each individual user. Formally, in CF recommenders, we have a set of users U = {u1,u2,...,um} rating a set of items S = {s1,s2,...,sn}, such as books, movies, or CDs. The spaces S and U are large and can be very large in some applications. Each user ui, i = 1, 2,...,m has rated a subset of items Si. Specifically, the rating of user uc for item sj, j = 1, 2,...,n is denoted by rc,j. All the available ratings are collected in a m · n useritem matrix denoted by R. Four phases are required to perform the recommendation task in CF recommenders: (a) Data collection (b) User model formation (c) Neighborhood set selection (d) Making recommendations 2.1.1. Data collection Generally, three types of data can be collected from users, demographical data through the registration process, explicit ratings for a subset of the available items, and implicit data from the user’s online behavior. In order to conduct our experiments we used the original MovieLens dataset (http://www.movielens.umn.edu). The dataset consists of 100,000 ratings, assigned by 943 users on 1682 movies. All ratings follow the 1 – bad, 2 – average, 3 – good, 4 – very good, and 5 – excellent numerical scale. Each user has rated at least 20 movies. Simple demographical data such as age, gender, occupation and zip code are included for all users, which are collected when a new user registers on the system. The movie title, release date, video release date, and genre data are given for each movie. The genre feature specifies if the movie is an action, adventure, animation, children’s, comedy, crime, documentary, drama, fantasy, film-noir, horror, musical, mystery, romance, sci-fi, thriller, war, or western. A single movie can belong to more than one genre. 2.1.2. User model formation The difference between a user profile and a user model lies in the different levels of sophistication. The user profile is simply a collection of personal information about the user, which can be described as a simple user model. 1388 M.Y.H. Al-Shamri, K.K. Bharadwaj / Expert Systems with Applications 35 (2008) 1386–1399
M. Y.H. Al-Shamri, KK Bharadwaj/ Expert Systems with Applications 35 (2008)1386-1399 Depending on the content and the amount of information tance function gives another way to compute similarity, about the user, which is stored in the user profile, a user can which takes into account multiple features(Uiin Bent- be modeled(Froschl, 2005: Koch, 2000). To be precise, a ley, 2004) user model is a representation of the knowledge and per- sonal characteristics, which the system believes that a user possesses. The profiling information can be elicited from demo- graphical data(e. g, users age, gender, occupation, etc. ) Here xi is the jth feature for the common item Si, N is the user preferences about features of the items(e.g, movie number of features, and ==Sxyl, the cardinality of sxy. title, genre, director, year of release, leading actors, etc. ) Note that a vector of features represents each user there and user ratings on experienced items(e.g, previously seen fore it is written bold in formula(2) ovies)(Massa Avesani, 2004). To build a user model two questions have to be taken into account, what infor- 2.1.4. Making recommendations mation has to be represented in the model and how this In this phase, RS assign a predicted rating to all the information is effectively represented? The first question items seen by the neighborhood set and not by the active is general for all applications while the second is applica- user. The predicted rating, pra.j, indicates the expected tion-dependent(Koch, 2000). An effective, tailored recom- interestingness of the item s; to the user wa, is usually com- mendation depends on the underlying user model, which is puted as an aggregate of the ratings of user's(ua) neighbor in turn used for similarity computations. A representative hood set for the same item user model would appropriately reflect user's tastes, prefer ences and need Most previous work on user model construction relies where C denotes the set of neighbors who have rated item nly on explicit ratings(Breese et al., 1998: Goldberg S;. The most widely used aggregation function is the et al., 1992; Resnick et al., 1994; Shardanand Maes, weighted sum(Adomavicius Tuzhilin, 2005: Breese 1995). However, in real life, the way in which two people et al., 1998; Vozalis Margaritis, 2003 ), which is called are said to be similar is not based solely on whether they also Resnick's prediction formula(Resnick et al., 1994) have close opinions on a specific subject, e.g., movie rat ings, but also on other factors, such as their background pra, =ma+k>d(a,c)x(c-me) (4) and personal details. Therefore, issues such as age, gender, and preferences of movie genres should also be taken into The multiplier k serves as a normalizing factor and is usu- account (Ujin Bentley, 2004) ally selected as k=1/>eld(a, c) and me is the average 2. 1.3. Neighborhood set selection Once user models have been established, the system can match the active user to the available database and a set of neighbors for him needs to be formed and ranked accord ing to a suitable distance function. The size of the neigh- 3. A novel hybrid user model borhood set could be fixed by selecting the top K users or could be variable by selecting the users whose similarity The construction of user models is a key task- the sys value is above a certain threshold(Breese et al., 1998: Voz- tems success will depend to a large extent on the models to alis Margaritis, 2003). Various functions have been used represent the users' actual interests. a pure CF user profile to compute the distance, d(a, c), between users ua and ue in( Goldberg et al., 1992; Resnick et al., 1994)consists of a CF recommenders. The most popular function for mem- vector of items with their ratings continuously augmented ory-based CF is the Pearson correlation coefficient as the user interacts with the system over time. This huge (Resnick et al, 1994), where the distance between two users amount of data needs a very large space and a long pro- is based only on the ratings both users have declared. The cessing time. At query time, searching the entire database Pearson correlation coefficient is given by to find the best set of neighbors is computationally very (rxs -,)(rvs-m, expensive. Eventually, the user will leave the Web site corr(x, y) (1) before the processing completes On the other hand, the actual user preferences cannot always be captured by explicit ratings, some content where Sxy is the set of items rated by both users ux and uy. descriptions of items are required. This problem gets solved Let us call the rs, which uses formula (1) for similarity by hybrid filtering. However, most current hybridization computations as Pearson RS(PRS) methods(Burke, 2002)construct two separate profiles Clearly, formula(1)is not appropriate if other features and implement the online process for each filtering tech are included in the model because it considers only the nique separately. Finally, some merger to give the final common items for both users. The modified Euclidean dis- result is used. What if we construct a model according to
Depending on the content and the amount of information about the user, which is stored in the user profile, a user can be modeled (Froschl, 2005; Koch, 2000). To be precise, a user model is a representation of the knowledge and personal characteristics, which the system believes that a user possesses. The profiling information can be elicited from demographical data (e.g., user’s age, gender, occupation, etc.), user preferences about features of the items (e.g., movie title, genre, director, year of release, leading actors, etc.), and user ratings on experienced items (e.g., previously seen movies) (Massa & Avesani, 2004). To build a user model two questions have to be taken into account, what information has to be represented in the model and how this information is effectively represented? The first question is general for all applications while the second is application-dependent (Koch, 2000). An effective, tailored recommendation depends on the underlying user model, which is in turn used for similarity computations. A representative user model would appropriately reflect user’s tastes, preferences, and needs. Most previous work on user model construction relies only on explicit ratings (Breese et al., 1998; Goldberg et al., 1992; Resnick et al., 1994; Shardanand & Maes, 1995). However, in real life, the way in which two people are said to be similar is not based solely on whether they have close opinions on a specific subject, e.g., movie ratings, but also on other factors, such as their background and personal details. Therefore, issues such as age, gender, and preferences of movie genres should also be taken into account (Ujjin & Bentley, 2004). 2.1.3. Neighborhood set selection Once user models have been established, the system can match the active user to the available database and a set of neighbors for him needs to be formed and ranked according to a suitable distance function. The size of the neighborhood set could be fixed by selecting the top K users or could be variable by selecting the users whose similarity value is above a certain threshold (Breese et al., 1998; Vozalis & Margaritis, 2003). Various functions have been used to compute the distance, d(a, c), between users ua and uc in CF recommenders. The most popular function for memory-based CF is the Pearson correlation coefficient (Resnick et al., 1994), where the distance between two users is based only on the ratings both users have declared. The Pearson correlation coefficient is given by corrðx; yÞ ¼ P s2Sxy ðrx;s mxÞðry;s my Þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P s2Sxy ðrx;s mxÞ 2P s2Sxy ðry;s my Þ 2 q ; ð1Þ where Sxy is the set of items rated by both users ux and uy. Let us call the RS, which uses formula (1) for similarity computations as Pearson RS (PRS). Clearly, formula (1) is not appropriate if other features are included in the model because it considers only the common items for both users. The modified Euclidean distance function gives another way to compute similarity, which takes into account multiple features (Ujjin & Bentley, 2004) dðx; yÞ ¼ 1 z Xz i¼1 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi XN j¼1 ðxi;j yi;jÞ 2 vuut : ð2Þ Here xi,j is the jth feature for the common item si, N is the number of features, and z = jSxyj, the cardinality of Sxy. Note that a vector of features represents each user therefore it is written bold in formula (2). 2.1.4. Making recommendations In this phase, RS assign a predicted rating to all the items seen by the neighborhood set and not by the active user. The predicted rating, pra,j, indicates the expected interestingness of the item sj to the user ua, is usually computed as an aggregate of the ratings of user’s (ua) neighborhood set for the same item sj pra;j ¼ aggruc2C rc;j; ð3Þ where C denotes the set of neighbors who have rated item sj. The most widely used aggregation function is the weighted sum (Adomavicius & Tuzhilin, 2005; Breese et al., 1998; Vozalis & Margaritis, 2003), which is called also Resnick’s prediction formula (Resnick et al., 1994) pra;j ¼ ma þ k X uc2C dða; cÞðrc;j mcÞ: ð4Þ The multiplier k serves as a normalizing factor and is usually selected as k ¼ 1= P uc2Cjdða; cÞj and mc is the average rating of user uc mc ¼ 1 jScj X s2Sc rc;s: ð5Þ 3. A novel hybrid user model The construction of user models is a key task – the system’s success will depend to a large extent on the models to represent the users’ actual interests. A pure CF user profile (Goldberg et al., 1992; Resnick et al., 1994) consists of a vector of items with their ratings continuously augmented as the user interacts with the system over time. This huge amount of data needs a very large space and a long processing time. At query time, searching the entire database to find the best set of neighbors is computationally very expensive. Eventually, the user will leave the Web site before the processing completes. On the other hand, the actual user preferences cannot always be captured by explicit ratings, some content descriptions of items are required. This problem gets solved by hybrid filtering. However, most current hybridization methods (Burke, 2002) construct two separate profiles and implement the online process for each filtering technique separately. Finally, some merger to give the final result is used. What if we construct a model according to M.Y.H. Al-Shamri, K.K. Bharadwaj / Expert Systems with Applications 35 (2008) 1386–1399 1389
M. Y.H. Al-Shamri, KK Bharadwaj/ Expert Systems with Applications 35(2008)1386-1399 some filtering technique and then apply on the constructed 3. 1. Necessary formulae model another filtering technique? In our approaches only one online filtering process(CF)is needed, while the other The total ratings(TR)of user u; is filtering techniques (CBF and DMF) are used to dense the data by building a compact user model, which is an offline TR(=>ris (6) ocess Further, a sparse user-item matrix causes a scalability Here S; is the set of movies rated by user u The genre rat problem for CF. However, we can overcome this problem ing(GR)(resp. genre frequency(GF) for high rated items to some extent by combining or integrating many informa- of genre G; corresponding to user u; is computed using for- tion sources. In our work, we develop a set of hy brid fea- mula (7)(resp. formula( 8) tures that combines some of the users' and items properties. An example of a hybrid feature(Basu, Hirsh. GR(i,j) Cohen, 1998)would be the set of Movies of the Comedy Genre User x Highly Rated. We derive the hybrid feature GF(i,j)=25(is), kE(3, 4, 5) based on the user's ratings for a set of high rated movies s∈GCS and the content descriptions of the genres corresponding where to this set of movies. The hybrid features are utilized as the basis for formulating a genre interestingness measure S(ris) (GIM). Once we have an appropriate formula for GIM, a user model can be constructed from DMF user profile It is to be noted that only the items having rating r>3 are and GIMs considered, i.e. the items rated as good (3), very good (4), Block diagram of the proposed work is given in Fig. I. or excellent(5). Such items would be called items with high First of all, a hybrid user model is built using the hybrid ratings. Finally, relative genre rating (RGR)(resp. relative features and DMF user profile, thereafter CF recom- genre frequency(RGF)), the ratio of u;'s ratings(resp. fre- mender generates a neighborhood set according to quency) for high rated items of G; to his total ratings(resp model-based CF. Finally the entire database of this set is frequency), is computed as retrieved to recommend items according to memory-based CF. Before going to details, let us present necessary formu- RGR(i, j)=TR( GR(i,j) (10) lae for the Movie Lens dataset and illustrate GIM compu- tations through an example RGF(i,j) GF(i,j TF() where TF(=Si, the cardinality of Si 3.2. Example I CBF USer profile Content For the sake of simplicity, we consider a table(Table 1) Descriptions of only three users who have rated movies belonging to four genres. In columns Gi, i= 1, 2, 3, 4, one(1)indicates the belongingness of a given movie to G, and 0 otherwise Also a non-zero value in the user ratings columns indicates Hybrid Features Table I Data of Example 1 Hybrid User Model Movie Corresponding genres User ratings User.l User.2 User-3 CF Recommender 23456789 0 Set of Neighbors 0 G001111 Database G010010110010 3154000000 003040050 0 Recommendatio o12 1011l0 02002430 Fig. I. Block diagram of the proposed work
some filtering technique and then apply on the constructed model another filtering technique? In our approaches only one online filtering process (CF) is needed, while the other filtering techniques (CBF and DMF) are used to dense the data by building a compact user model, which is an offline process. Further, a sparse user-item matrix causes a scalability problem for CF. However, we can overcome this problem to some extent by combining or integrating many information sources. In our work, we develop a set of hybrid features that combines some of the users’ and items’ properties. An example of a hybrid feature (Basu, Hirsh, & Cohen, 1998) would be the set of Movies of the Comedy Genre User X Highly Rated. We derive the hybrid feature based on the user’s ratings for a set of high rated movies and the content descriptions of the genres corresponding to this set of movies. The hybrid features are utilized as the basis for formulating a genre interestingness measure (GIM). Once we have an appropriate formula for GIM, a user model can be constructed from DMF user profile and GIMs. Block diagram of the proposed work is given in Fig. 1. First of all, a hybrid user model is built using the hybrid features and DMF user profile, thereafter CF recommender generates a neighborhood set according to model-based CF. Finally the entire database of this set is retrieved to recommend items according to memory-based CF. Before going to details, let us present necessary formulae for the MovieLens dataset and illustrate GIM computations through an example. 3.1. Necessary formulae The total ratings (TR) of user ui is TRðiÞ ¼ X s2Si ri;s: ð6Þ Here Si is the set of movies rated by user ui. The genre rating (GR) (resp. genre frequency (GF)) for high rated items of genre Gj corresponding to user ui is computed using formula (7) (resp. formula (8)) GRði;jÞ ¼ X s2GjSi;rP3 ri;s; ð7Þ GFði;jÞ ¼ X s2GjSi dk ðrisÞ; k 2 f3; 4; 5g; ð8Þ where, dk ðri;sÞ ¼ 1; k ¼ ri;s; 0; k 6¼ ri;s: ð9Þ It is to be noted that only the items having rating r P 3 are considered, i.e. the items rated as good (3), very good (4), or excellent (5). Such items would be called items with high ratings. Finally, relative genre rating (RGR) (resp. relative genre frequency (RGF)), the ratio of ui’s ratings (resp. frequency) for high rated items of Gj to his total ratings (resp. frequency), is computed as RGRði;jÞ ¼ GRði;jÞ TRðiÞ ; ð10Þ RGFði;jÞ ¼ GFði;jÞ TFðiÞ ; ð11Þ where TF(i) = jSij, the cardinality of Si. 3.2. Example 1 For the sake of simplicity, we consider a table (Table 1) of only three users who have rated movies belonging to four genres. In columns Gi, i = 1, 2, 3, 4, one (1) indicates the belongingness of a given movie to Gi, and 0 otherwise. Also a non-zero value in the user ratings columns indicates CBF User Profile DMF User Profile Content Descriptions Explicit Ratings Hybrid Features Hybrid User Model CF Recommender Recommendations Set of Neighbors Database Fig. 1. Block diagram of the proposed work. Table 1 Data of Example 1 Movie Corresponding genres User ratings G1 G2 G3 G4 User-1 User-2 User-3 1 1010150 2 1011300 3 1100153 4 0110510 5 011 1 4 0 4 6 0100020 7 1101000 8 001 1 0 0 5 9 0110020 10 1 1 1 0 0 4 1 11 1 1 0 1 0 3 3 12 1 0 0 0 0 0 3 1390 M.Y.H. Al-Shamri, K.K. Bharadwaj / Expert Systems with Applications 35 (2008) 1386–1399