Categorizing User Interests in Recommender Systems Sourav Saha, Sandipan Majumder, Sanjog Ray, and Ambuj Mahanti I IM Calcutta souravso8eiimcal.ac.in,sandipan.compenggagmail.com IIM Indore sanjogr@iimidr ac in Abstract. The traditional method of recommender systems suffers from the Sparsity problem whereby incomplete dataset results in poor recommendations Another issue is the drifting preference, i.e. the change of the user's prefe with time. In this paper, we propose an algorithm that takes minimal inputs to do away with the Sparsity problem and takes the drift into consideration giving more priority to latest data. The streams of elements are decomposed into the orresponding attributes and are classified in a preferential list with tags as Sporadic”,"New”;" Regular”,"Old”and"Past"- each category signifying a hanging preference over the previous respectively. A repeated occurrence of attribute set of interest implies the users preference for such attribute(s) The proposed algorithm is based on drifting preference and has been tested with the Yahoo Webscope R4 dataset. Results have shown that our algorithm have shown significant improvements over the comparable"Sliding Window algorithm Keywords: Drifting Preference, Recommender Systems, Collaborative, Yahoo Movies 1 Introduction Recommender systems are technology based systems that provide personalized recommendations to users. They generate recommendations by profiling each user Profiling is done by observing each users interests, online behavior and transaction history. Recommendations can also take into account opinions and actions of other users with similar tastes. Recommender systems algorithms can be classified into two major categories, namely content based and collaborative filtering based. A third approach called hybrid approach combines both content based and collaborative fil- tering based methods. In content based recommendations, a user is recommended items similar to the items he preferred in the past For content based recommenda- tions, each item in the items domain must be characterized by a set of keywords that describe it. In content-based recommendations, a profile of a user is first created by keyword analysis of the items previously seen and rated by him. On the basis of the This research has been supported by Indian Institute of Management Calcutta(IIM Calcutta), work order no. 019/RP: SIRS//397/2008-09 R. Setchi et al(Eds ) KES 2010, Part Il, LNAI 6277, Pp. 282-291, 2010 e Springer-Verlag Berlin Heidelberg 2010
R. Setchi et al. (Eds.): KES 2010, Part II, LNAI 6277, pp. 282–291, 2010. © Springer-Verlag Berlin Heidelberg 2010 Categorizing User Interests in Recommender Systems* Sourav Saha1 , Sandipan Majumder1 , Sanjog Ray2 , and Ambuj Mahanti1 1 IM Calcutta souravs08@iimcal.ac.in, sandipan.compengg@gmail.com, am@iimcal.ac.in 2 IIM Indore sanjogr@iimidr.ac.in Abstract. The traditional method of recommender systems suffers from the Sparsity problem whereby incomplete dataset results in poor recommendations. Another issue is the drifting preference, i.e. the change of the user’s preference with time. In this paper, we propose an algorithm that takes minimal inputs to do away with the Sparsity problem and takes the drift into consideration giving more priority to latest data. The streams of elements are decomposed into the corresponding attributes and are classified in a preferential list with tags as “Sporadic”, “New”, “Regular”, “Old” and “Past” – each category signifying a changing preference over the previous respectively. A repeated occurrence of attribute set of interest implies the user’s preference for such attribute(s). The proposed algorithm is based on drifting preference and has been tested with the Yahoo Webscope R4 dataset. Results have shown that our algorithm have shown significant improvements over the comparable “Sliding Window” algorithm. Keywords: Drifting Preference, Recommender Systems, Collaborative, Yahoo Movies. 1 Introduction Recommender systems are technology based systems that provide personalized recommendations to users. They generate recommendations by profiling each user. Profiling is done by observing each user’s interests, online behavior and transaction history. Recommendations can also take into account opinions and actions of other users with similar tastes. Recommender systems algorithms can be classified into two major categories, namely content based and collaborative filtering based. A third approach called hybrid approach combines both content based and collaborative filtering based methods. In content based recommendations, a user is recommended items similar to the items he preferred in the past. For content based recommendations, each item in the items domain must be characterized by a set of keywords that describe it. In content-based recommendations, a profile of a user is first created by keyword analysis of the items previously seen and rated by him. On the basis of the * This research has been supported by Indian Institute of Management Calcutta (IIM Calcutta), work order no. 019/RP: SIRS//3397/2008-09
Categorizing User Interests in Recommender Systems 283 profile created, the user is then recommended new items. Content based systems are not very popular as they lag in recommendation quality when the users tastes change Also, in many domains getting high quality content data is a major issue. In collabora tive filtering, the user is recommended items that people with similar tastes and prefe rences liked in the past. This technique mainly relies on explicit ratings given by the user and is the most successful and widely used technique. There are two primary approaches that are used to build collaborative filtering memory based recommender systems: user based collaborative filtering [1] and item based collaborative filtering [3]. In user based collaborative filtering, a neighborhood of k similar users is found for a particular user. Items are then recommended to the user from the set of items seen by its k nearest neighbors but not yet seen by him. In item based collaborative filtering, similarities between the various items are computed. Items are then recom mended to the user from the set of k items that have highest similarity to the items already present in the user's profile One drawback of traditional recommendation systems are that they assume user in- terests to be static, which is not the case as preferences change with time. This change in preference can occur in two ways. The first kind of change that is most widely seen occurs when a user develops a new interest. Some common examples of this kind change are a movie viewer acquiring a recent liking for western movies, a book reader developing a new interest for books by an author or on a particular subject Examples of users developing new interests are prevalent in most domains where recommender systems are extensively used like books, movies, music, research pa pers, television etc. Popular recommender system algorithms, like user based colla borative filtering and item based collaborative filtering algorithms, use only ratings data to generate recommendation. As context data like genre, subject and keywor are not used, the recommendations generated by these systems are not specifically customized for the user's c interests, and as a result there is lag before they can detect change in his preferences. This inability of systems to quickly recognize a shift in interest is also known as the concept drift problem. The second kind of change occurs when user preference for a particular item hanges over time. When asked to re-rate movies, users may re-rate the same item differently. User preferences evolve over time: a user who has rated an item highly five years back may not hold the same opinion about the item currently. Existing algorithms analyze a user's complete movie rating data to create a profile. As equal weights are given to all the items in the user's profile, the profile created may not give the true picture of the user's current preferences. These drawbacks of existing algo- rithms can hamper the effectiveness of the recommender systems, which can lead to missed opportunities to market products to a user and can affect user confidence to- wards the recommender system. Most of the previous research on the problem of concept drift has focused on plying weights to time series data. The weights applied are based on a time decaying function with higher weights applied to new items. Older data points are removed or very low weights are applied to them so as to decrease their influence on recommen- dations generated. This approach can be helpful in solving the problem of concept drift when the second kind of interest change occurs, i.e. when the user's preference for a particular item changes over time and the present data points present a truer picture of the user's interest. But this approach of using only current data points for
Categorizing User Interests in Recommender Systems 283 profile created, the user is then recommended new items. Content based systems are not very popular as they lag in recommendation quality when the user’s tastes change. Also, in many domains getting high quality content data is a major issue. In collaborative filtering, the user is recommended items that people with similar tastes and preferences liked in the past. This technique mainly relies on explicit ratings given by the user and is the most successful and widely used technique. There are two primary approaches that are used to build collaborative filtering memory based recommender systems: user based collaborative filtering [1] and item based collaborative filtering [3]. In user based collaborative filtering, a neighborhood of k similar users is found for a particular user. Items are then recommended to the user from the set of items seen by its k nearest neighbors but not yet seen by him. In item based collaborative filtering, similarities between the various items are computed. Items are then recommended to the user from the set of k items that have highest similarity to the items already present in the user’s profile. One drawback of traditional recommendation systems are that they assume user interests to be static, which is not the case as preferences change with time. This change in preference can occur in two ways. The first kind of change that is most widely seen occurs when a user develops a new interest. Some common examples of this kind change are a movie viewer acquiring a recent liking for western movies, a book reader developing a new interest for books by an author or on a particular subject. Examples of users developing new interests are prevalent in most domains where recommender systems are extensively used like books, movies, music, research papers, television etc. Popular recommender system algorithms, like user based collaborative filtering and item based collaborative filtering algorithms, use only ratings data to generate recommendation. As context data like genre, subject and keywords are not used, the recommendations generated by these systems are not specifically customized for the user’s current interests, and as a result there is lag before they can detect change in his preferences. This inability of systems to quickly recognize a shift in interest is also known as the concept drift problem. The second kind of change occurs when user preference for a particular item changes over time. When asked to re-rate movies, users may re-rate the same item differently. User preferences evolve over time: a user who has rated an item highly five years back may not hold the same opinion about the item currently. Existing algorithms analyze a user’s complete movie rating data to create a profile. As equal weights are given to all the items in the user’s profile, the profile created may not give the true picture of the user’s current preferences. These drawbacks of existing algorithms can hamper the effectiveness of the recommender systems, which can lead to missed opportunities to market products to a user and can affect user confidence towards the recommender system. Most of the previous research on the problem of concept drift has focused on applying weights to time series data. The weights applied are based on a time decaying function with higher weights applied to new items. Older data points are removed or very low weights are applied to them so as to decrease their influence on recommendations generated. This approach can be helpful in solving the problem of concept drift when the second kind of interest change occurs, i.e. when the user’s preference for a particular item changes over time and the present data points present a truer picture of the user’s interest. But this approach of using only current data points for
S. Saha et al generating recommendations further magnifies the problem of data sparsity. While previous research on concept drift in recommender systems have tried to tackle the problem by using weighted data, none of them have tried to use content data like genre, keywords etc. And without using content data, the more prevalent problem of oncept drift due to new interests cannot be solved. In this paper, we propose an approach to categorize user preferences from its trans action history in a recommender system. Our approach can be used by context based collaborative filtering based and hybrid recommender system algorithms to make better predictions. Our approach of categorizing user interests can be used as one component of a hybrid algorithm designed for solving the concept drift problem. In section 2 we describe our categorization approach for user interests. In section 3 we provide an example to further clarify our approach. In section 4 we review previous research done in the area of drifting preferences. Section 5 describes the experimental evaluation. Section 6 has the discussions on the results obtained. Finally, we conclude our paper in section 7. 2 Proposed Model for Categorizing User Interests A user's interests are identified by analyzing the attribute or content descriptor of items associated with his transactions. In movies domain, movie genres like action, drama, comedy, animation, etc, are used to characterize a movie. If a user only se lects movies belonging to drama and action genres then we can say that the user's interests are drama and action. Similarly, in books domain, subject keywords are used to identify interests. While in music domain genres like rock, pop, country music etc are used to identify interests. Our approach is relevant for all domains where items can be differentiated on the basis of content descriptors and can be grouped into cate- gories based on the content descriptors. In this paper, we have used movies domain to illustrate our approach. We use the terminology interests and genres interchangeably In movies domain, the genre of the movie is the most popular content descriptor data that is used to describe movie content. A single movie can have more than one genre associated with it. For example, the movie Casablanca is associated with drama romance and war genres. To elaborate our categorization model, we first describe the different classes we have defined to classify a user set of interests derived from his transactions, and then we explain our design for movement of interests from one class to another for a user 2.1 Interests Categorization In our proposed approach, a user's interests can be categorized into five classes Sporadic: Interests categorized as"Sporadic"category for a user are those attributes that occur once in a while in the user's transactions. Sporadic"interests are initially considered as a deviation from the users regular interests and seen as transient inter ests, as the probability of them occurring again in future transactions is minimal Recommender systems should ideally ignore these interests as ing the reoccur rence of these interests in future transactions is not feasible. A user buying a gif can be seen as an example of"Sporadic"interests. A user who does not like the
284 S. Saha et al. generating recommendations further magnifies the problem of data sparsity. While previous research on concept drift in recommender systems have tried to tackle the problem by using weighted data, none of them have tried to use content data like genre, keywords etc. And without using content data, the more prevalent problem of concept drift due to new interests cannot be solved. In this paper, we propose an approach to categorize user preferences from its transaction history in a recommender system. Our approach can be used by context based, collaborative filtering based and hybrid recommender system algorithms to make better predictions. Our approach of categorizing user interests can be used as one component of a hybrid algorithm designed for solving the concept drift problem. In section 2 we describe our categorization approach for user interests. In section 3 we provide an example to further clarify our approach. In section 4 we review previous research done in the area of drifting preferences. Section 5 describes the experimental evaluation. Section 6 has the discussions on the results obtained. Finally, we conclude our paper in section 7. 2 Proposed Model for Categorizing User Interests A user’s interests are identified by analyzing the attribute or content descriptor of items associated with his transactions. In movies domain, movie genres like action, drama, comedy, animation, etc., are used to characterize a movie. If a user only selects movies belonging to drama and action genres then we can say that the user’s interests are drama and action. Similarly, in books domain, subject keywords are used to identify interests. While in music domain genres like rock, pop, country music etc are used to identify interests. Our approach is relevant for all domains where items can be differentiated on the basis of content descriptors and can be grouped into categories based on the content descriptors. In this paper, we have used movies domain to illustrate our approach. We use the terminology interests and genres interchangeably. In movies domain, the genre of the movie is the most popular content descriptor data that is used to describe movie content. A single movie can have more than one genre associated with it. For example, the movie Casablanca is associated with drama, romance and war genres. To elaborate our categorization model, we first describe the different classes we have defined to classify a user set of interests derived from his transactions, and then we explain our design for movement of interests from one class to another for a user. 2.1 Interests Categorization In our proposed approach, a user’s interests can be categorized into five classes. Sporadic: Interests categorized as “Sporadic” category for a user are those attributes that occur once in a while in the user’s transactions. “Sporadic” interests are initially considered as a deviation from the user’s regular interests and seen as transient interests, as the probability of them occurring again in future transactions is minimal. Recommender systems should ideally ignore these interests as predicting the reoccurrence of these interests in future transactions is not feasible. A user buying a gift can be seen as an example of “Sporadic” interests. A user who does not like the
Categorizing User Interests in Recommender Systems 285 science-fiction genre may buy science-fiction books as gifts for friends or family. This change in user interest should be considered as a passing interest rather than a new interest. Recommender systems that consider sporadic interests as a change in a users taste will end up recommending items that may not be useful to the user. The chal lenge is to identify those sporadic interests that have the potential to become new interests. Interests that were once"Sporadic"but don' t recur in future has to leave the system altogether. If it starts recurring it has to start afresh as "Fresh Interest"(Fig-1) New: New" interests are those interests that are new to the user but have been occur- ring more often in the users recent transactions. For a user, few of his interests cate- radic"interests become categorized as "New"interests after they occur more often in successive transactions. Similarly, non-occurrences of an interest existing in"New"category in successive transactions will result in the interest shift ing back to the ""Sporadic"category of the user. Similarly, interests in the"New category that occur frequently in successive transactions get categorized as"Regular interest for the user. "New"interests are of much importance from a marketer's point of view: as the user is exploring a new area of interest, he is more likely to be influ enced by a recommendation made by the system. Regular: Once the user has subscribed to some particular attributes of interest a con- siderable number of times, such an interest can be put in the" Regular"category. This particular type of interest is assumed to hold good for the future transactions of the user as well and repeated occurrences only reinforce the belief. It needs to be mentioned that for every categories, inclusion of an interest takes place after the cor responding threshold is overcome(similarly, breaching the threshold results in exclu sion and shifting to a lower category) Old: Interests once categorized in the"Regular "Category of a user gets classified in the"Old"Category when they stop occurring in frequent successive transactions. An interest classified as"Old"interests can get reclassified as a"Regular"interest with increased occurrence in recent transactions(and thereby successfully crossing the threshold). "Old"and"New"have nearly similar preferences during prediction. Past: Successive non-occurrence of an interest in the"Old"category results it being shifted to the"Past"category. " Past"categories can have equal or more preferences compared to "Sporadic"category but they definitely have a lower preference com- pared to the“New" category. Interests in the“ Past"category can turn back to“Old” with repeated occurrences or may have to leave the system altogether with repeated non-occurrences. An interest having once left the system if starts recurring once again later, it has to follow the channel of fresh Interest as shown in Fig-l below. An inter est leaving the system from"Sporadic"or"Past"has to start afresh 2. 2 Movement of Interests The designing of the algorithm here takes care of the drifting effect. Repeated occur- rence of interest adds more weightage to that interest. Alternatively, non-occurrence decreases the weightage. The movement of the interest depends on the threshold de- fined for the particular category. With repeated occurrences the interest is moved to the next higher category if it's more than the threshold defined for the next category Similarly, if some interest stops occurring, then its weightage initially remains
Categorizing User Interests in Recommender Systems 285 science-fiction genre may buy science-fiction books as gifts for friends or family. This change in user interest should be considered as a passing interest rather than a new interest. Recommender systems that consider sporadic interests as a change in a user’s taste will end up recommending items that may not be useful to the user. The challenge is to identify those sporadic interests that have the potential to become new interests. Interests that were once “Sporadic” but don’t recur in future has to leave the system altogether. If it starts recurring it has to start afresh as “Fresh Interest” (Fig-1). New: “New” interests are those interests that are new to the user but have been occurring more often in the user’s recent transactions. For a user, few of his interests categorized as “Sporadic” interests become categorized as “New” interests after they occur more often in successive transactions. Similarly, non-occurrences of an interest existing in “New” category in successive transactions will result in the interest shifting back to the “Sporadic” category of the user. Similarly, interests in the “New” category that occur frequently in successive transactions get categorized as “Regular” interest for the user. “New” interests are of much importance from a marketer’s point of view: as the user is exploring a new area of interest, he is more likely to be influenced by a recommendation made by the system. Regular: Once the user has subscribed to some particular attributes of interest a considerable number of times, such an interest can be put in the “Regular” category. This particular type of interest is assumed to hold good for the future transactions of the user as well and repeated occurrences only reinforce the belief. It needs to be mentioned that for every categories, inclusion of an interest takes place after the corresponding threshold is overcome (similarly, breaching the threshold results in exclusion and shifting to a lower category). Old: Interests once categorized in the “Regular” Category of a user gets classified in the “Old” Category when they stop occurring in frequent successive transactions. An interest classified as “Old” interests can get reclassified as a “Regular” interest with increased occurrence in recent transactions (and thereby successfully crossing the threshold). “Old” and “New” have nearly similar preferences during prediction. Past: Successive non-occurrence of an interest in the “Old” category results it being shifted to the “Past” category. “Past” categories can have equal or more preferences compared to “Sporadic” category but they definitely have a lower preference compared to the “New” category. Interests in the “Past” category can turn back to “Old” with repeated occurrences or may have to leave the system altogether with repeated non-occurrences. An interest having once left the system if starts recurring once again later, it has to follow the channel of Fresh Interest as shown in Fig-1 below. An interest leaving the system from “Sporadic” or “Past” has to start afresh. 2.2 Movement of Interests The designing of the algorithm here takes care of the drifting effect. Repeated occurrence of interest adds more weightage to that interest. Alternatively, non-occurrence decreases the weightage. The movement of the interest depends on the threshold defined for the particular category. With repeated occurrences the interest is moved to the next higher category if it’s more than the threshold defined for the next category. Similarly, if some interest stops occurring, then its weightage initially remains
S. Saha et al constant for the next few transactions and then finally stops dropping. Both the decay percentage and the waiting period depend on the preference category to which the interest currently belongs. Thus if the users had an inclination for a given interest ometime back, which has stopped recurring then such preference initially waits for the next few transaction and then starts fading off and finally goes away threshold hreshold Fresh Interests Nev Lost interes old f Lets define the new algorithm now. We assume the data is in the chronological order, grouped by the user. Let there be U1, U2, Ui. UN user available. A particular user, denoted by Ui has moved through mi elements. Thus, the total number of records in the system is given by Xinmi=M= Xj=1Rj Define the thresholds setT=(I.E=(TS ITN ITR], To, TP))for Sporadic", "New","Regular, "Old"and"Past"category respectively, where each of the Ts contain the threshold for both inclusion and exclusion for the corresponding category. Define the weight function f(c) where c is the category in which the genre belongs currently. The h(c, w) is an activation function taking binary values(0 or 1) where w is the pre-defined window size for successive non-occurrence of any given attribute that currently belong to a particular category c(here Sporadic, New, Regular, Old and Past)after which the decaying function gets activated. Hence the decaying function d(c, w) is the product of f(c)h(c, w) Initially, when an attribute comes into as a fresh interest for a given user, it's given some weightage based on the frequency of occurrence. Once the cumulative weights cross the threshold for the starting category, which is"Sporadic", such attribute of interest is tagged as"". From" Sporadic"the attribute can move to"N category and then to the"Regular"category, each time satisfying the threshold for the given category (refer Fig-1). If the attribute stops recurring indicating a loss of inter est or oversaturation, its weight starts decreasing by a certain percentage after a given waiting period(pre-defined window size) based on the category to which it belong currently. The interest moves to the lower categories each time the threshold for a given category is breached and then finally moves out of the system. 3 An Example The following example illustrates our approach more clearly. Let us consider a who rents movies(the element under consideration) in last 10 transactions. The lowing table shows how the attribute of interest(here movie genres) shift across five sets for this user on the basis of his movie selection behavior
286 S. Saha et al. constant for the next few transactions and then finally stops dropping. Both the decay percentage and the waiting period depend on the preference category to which the interest currently belongs. Thus if the users had an inclination for a given interest sometime back, which has stopped recurring then such preference initially waits for the next few transaction and then starts fading off and finally goes away. Fig. 1. Let’s define the new algorithm now. We assume the data is in the chronological order, grouped by the user. Let there be U1, U2, Ui…UN user available. A particular user, denoted by Ui has moved through mi elements. Thus, the total number of records in the system is given by Define the thresholds set T = {{I},{E}} = {{TS}, {TN}, {TR}, {TO}, {TP}} for “Sporadic”, “New”, “Regular”, “Old” and “Past” category respectively, where each of the Ts contain the threshold for both inclusion and exclusion for the corresponding category. Define the weight function f(c) where c is the category in which the genre belongs currently. The h(c,w) is an activation function taking binary values (0 or 1) where w is the pre-defined window size for successive non-occurrence of any given attribute that currently belong to a particular category c (here Sporadic, New, Regular, Old and Past) after which the decaying function gets activated. Hence the decaying function d(c,w) is the product of f(c)*h(c,w). Initially, when an attribute comes into as a fresh interest for a given user, it’s given some weightage based on the frequency of occurrence. Once the cumulative weights cross the threshold for the starting category, which is “Sporadic”, such attribute of interest is tagged as “Sporadic”. From “Sporadic” the attribute can move to “New” category and then to the “Regular” category, each time satisfying the threshold for the given category (refer Fig-1). If the attribute stops recurring indicating a loss of interest or oversaturation, its weight starts decreasing by a certain percentage after a given waiting period (pre-defined window size) based on the category to which it belongs currently. The interest moves to the lower categories each time the threshold for a given category is breached and then finally moves out of the system. 3 An Example The following example illustrates our approach more clearly. Let us consider a user who rents movies (the element under consideration) in last 10 transactions. The following table shows how the attribute of interest (here movie genres) shift across the five sets for this user on the basis of his movie selection behavior. σ ݉݅ ே ୀଵ ൌ ܯ ൌ σ ܴ݆ ܯ ݆ൌͳ Fresh Interests ⎯threshold ⎯ → ⎯⎯ Sporadic ⎯threshold ⎯ → ⎯⎯ New ⎯threshold ⎯ → ⎯⎯ Regular Lost Interests ←⎯ ⎯ threshold ⎯⎯ Past ←⎯ ⎯ threshold ⎯⎯ Old ←⎯ ⎯ threshold ⎯⎯ Regular