Decision Support Systems 51(2011)772-781 Contents lists available at ScienceDirect Decision Support Systems ELSEVIER journalhomepagewww.elsevier.com/locate/dss Collaborative user modeling for enhanced content filtering in recommender systems Heung-Nam Kim a Inay Ha, Kee-Sung Lee Geun-Sik Jo, Abdulmotaleb El-Saddik ol of In formation Technology and niversity of ottawa, 800 King Edward, ottawa, Ontario, KIN 6N5, Canada b School of Computer and Information Eng Inha University, 253 Younghyun-dong, Nam-gu, Incheon(402-751). Korea ARTICLE IN FO A BSTRACT Available online 31 January 2011 Recommender systems, which have emerged in response to the problem of information overload, provide users with recommendations of content suited to their needs. To provide proper recommendations to users. personalized recommender systems require accurate user models of characteristics, preferences and needs. In h to user model Recommender system to users. Our approach first discovers useful and meaningful user patterns, and then enriches the personal model vith collaboration from other similar users. In order to evaluate the performance of our approach, we compare experimental results with those of a probabilistic learming model, a user model based on collaborative filtering pproaches, and a vector space model We present experimental results that show how our model performs better than existing alternatives D 2011 Elsevier B.V. All rights reserved. 1 Introduction Nevertheless, collaborative filtering suffers from a fundame problem, namely the cold start problem, which can be divided into cold The prevalence of Web 2.0 technologies and services enables end- start items and cold start users [25. Several researchers have offered users to be producers as well as consumers of content. Even on a daily proposals dealing with the challenge of addressing this problem basis, an enormous amount of textual content, such as online news, 10, 17, 22, 25]. In a collaborative filtering-based recommender system, research papers, blog articles, and wikis is generated on the Web. It is an item cannot be recommended until a large number of us getting more difficult to make automatic recommendations to a user previously rated it This is known as a cold start item. this related to his/her preferences, not only because of the huge amount of applies to new items generated every few minutes and can be information but also because of the difficulty of automatically grasping alleviated by content-based technology. In the case of domains such his/her interests [7]. Recommender systems, which have emerged in textual documents, content-based filtering has proven to be effective in response to the above challenges, provide users with recommendations of locating textual content relevant to a specific content information need content suited to their needs. There are two widely used approaches 6, 10. However, content-based filtering also encounters limitations for among recommender systems, content-based filtering and collaborative a cold start user, similar to collaborative filtering. A cold start user filtering. The traditional task in collaborative filtering is to predict the describes a new user that joins a recommender system and has utility of a certain item for the target user from the opinions of other presented few opinions (ie, the user has insufficient preference similar users, and thereby make appropriate recommendations 21]. On history). With these situations, the system is generally unable to make the other hand, content-based filtering provides recommendations by high quality recommendations. comparing representations of content contained in an item to those of a We address these issues by introducing a collaborative approach to users interest content, ignoring opinions of other similar users [17] user modeling for enhancing content filtering, Our goal is to build Collaborative filtering has an advantage over content-based filtering in robust user model that can be applied to personalized recommender situations where it is hard to analyze the underlying content, e. g, music, systems. By capturing a users content of interest, we can discover the videos, and photos. Because collaborative filtering process is only based on preference patterns and terms existing in the users content of interest. historical information about whether or not a given target user has In addition to partially overcome the cold start user problem, we previously preferred an item, analysis of the actual content, itself, is not propose an enrichment method of the personal model in collaboration necessarily required. with other similar users This paper presents three specific contributions toward user modeling in recommender systems. First, we propose a new method author.TeL:+16135625800x6248:fax:+16135625664 of building a user model, allowing understanding and filtering of the ni4596@gmaiL com(H-N. Kim), inayeeslab inhaac kr (L Ha). user'sinterests We then present a method of a collaborative enrichment eslab inha ac kr(K-S. Lee), gsjo@inha.ackr(G -S Jo), abed@mcrlab ottawa ca of user interests in dealing with the cold start problem. Second, we propose how the individual model can be applied to personalized 0167-9236/S-see front matter e 2011 Elsevier B V. All rights reserved oi:10.1016/dss201101012
Collaborative user modeling for enhanced content filtering in recommender systems Heung-Nam Kim a, ⁎, Inay Ha b , Kee-Sung Lee b , Geun-Sik Jo b , Abdulmotaleb El-Saddik a a School of Information Technology and Engineering, University of Ottawa, 800 King Edward, Ottawa, Ontario, K1N 6N5, Canada b School of Computer and Information Engineering, Inha University, 253 Younghyun-dong, Nam-gu, Incheon (402–751), Korea article info abstract Available online 31 January 2011 Keywords: Collaborative user modeling Recommender system Personalization Content-based user model Recommender systems, which have emerged in response to the problem of information overload, provide users with recommendations of content suited to their needs. To provide proper recommendations to users, personalized recommender systems require accurate user models of characteristics, preferences and needs. In this study, we propose a collaborative approach to user modeling for enhancing personalized recommendations to users. Our approach first discovers useful and meaningful user patterns, and then enriches the personal model with collaboration from other similar users. In order to evaluate the performance of our approach, we compare experimental results with those of a probabilistic learning model, a user model based on collaborative filtering approaches, and a vector space model. We present experimental results that show how our model performs better than existing alternatives. © 2011 Elsevier B.V. All rights reserved. 1. Introduction The prevalence of Web 2.0 technologies and services enables endusers to be producers as well as consumers of content. Even on a daily basis, an enormous amount of textual content, such as online news, research papers, blog articles, and wikis is generated on the Web. It is getting more difficult to make automatic recommendations to a user related to his/her preferences, not only because of the huge amount of information but also because of the difficulty of automatically grasping his/her interests [7]. Recommender systems, which have emerged in response to the above challenges, provide users with recommendations of content suited to their needs. There are two widely used approaches among recommender systems, content-based filtering and collaborative filtering. The traditional task in collaborative filtering is to predict the utility of a certain item for the target user from the opinions of other similar users, and thereby make appropriate recommendations [21]. On the other hand, content-based filtering provides recommendations by comparing representations of content contained in an item to those of a user's interest content, ignoring opinions of other similar users [17]. Collaborative filtering has an advantage over content-based filtering in situations where it is hard to analyze the underlying content, e.g., music, videos, and photos. Because collaborative filtering process is only based on historical information about whether or not a given target user has previously preferred an item, analysis of the actual content, itself, is not necessarily required. Nevertheless, collaborative filtering suffers from a fundamental problem, namely the cold start problem, which can be divided into cold start items and cold start users [25]. Several researchers have offered proposals dealing with the challenge of addressing this problem [10,17,22,25]. In a collaborative filtering-based recommender system, an item cannot be recommended until a large number of users have previously rated it. This is known as a cold start item. This problem applies to new items generated every few minutes and can be partially alleviated by content-based technology. In the case of domains such as textual documents, content-based filtering has proven to be effective in locating textual content relevant to a specific content information need [6,10]. However, content-based filtering also encounters limitations for a cold start user, similar to collaborative filtering. A cold start user describes a new user that joins a recommender system and has presented few opinions (i.e., the user has insufficient preference history). With these situations, the system is generally unable to make high quality recommendations. We address these issues by introducing a collaborative approach to user modeling for enhancing content filtering. Our goal is to build a robust user model that can be applied to personalized recommender systems. By capturing a user's content of interest, we can discover the preference patterns and terms existing in the user's content of interest. In addition to partially overcome the cold start user problem, we propose an enrichment method of the personal model in collaboration with other similar users. This paper presents three specific contributions toward user modeling in recommender systems. First, we propose a new method of building a user model, allowing understanding and filtering of the user'sinterests.We then present a method of a collaborative enrichment of user interests in dealing with the cold start problem. Second, we propose how the individual model can be applied to personalized Decision Support Systems 51 (2011) 772–781 ⁎ Corresponding author. Tel.: +1 613 562 5800x6248; fax: +1 613 562 5664. E-mail addresses: nami4596@gmail.com (H.-N. Kim), inay@eslab.inha.ac.kr (I. Ha), lks@eslab.inha.ac.kr (K.-S. Lee), gsjo@inha.ac.kr (G.-S. Jo), abed@mcrlab.uottawa.ca (A. El-Saddik). 0167-9236/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.dss.2011.01.012 Contents lists available at ScienceDirect Decision Support Systems j o u r n a l h om e p a g e : www. e l s ev i e r. c om / l o c a t e / d s s
H-N. Kim et al / Decision Support Systems 51(2011)772-781 commendations relevant to the users needs. We incorporate Collaborative and content-based filtering methods have unique collaborative characteristics into a content-based approach. Third, we advantages and disadvantages. Therefore, some studies combine these provide detailed experimental evaluations with real datasets and techniques in developing hybrid recommender systems [4]. Berkovsky et investigate how collaborative user models work in terms of improving al. [22] presented a method of user modeling data integration for the the recommendation performance purposes of a specific recommendation task, referred to as the mediation The subsequent sections are organized as follows: Section 2 of a user model. By importing and integrating data collected from other ummarizes previous studies related to user modeling and personalized recommender systems, four types of user model mediation are presented recommendations In Section 3, we describe the notations and method cross-user, cross-item, cross-context, and cross-representation In [3]. the to build the initial user model. We then describe a collaborative same authors presented mediated user models that are transformed from approach for modeling user interests and recommending content in collaborative filtering to content-based recommender systems. In 8.a Section 4. Next, Section 5 describes the implemented system and content-collaborative hybrid recommender system is proposed that interface In Section 6, we present the effectiveness of our approach in exploits WordNet-based user profiles to capture the semantics of user terms of its performance. Finally, conclusions are presented and future interests. Similar to our approach, the authors generated the neighbor work is discussed in Section 7 hood of a user through content-based methods. Melville et al.[1 followed a two-stage approach. First they applied a naive Bayesian classifier as content-based predictor to complete the rating matrix, and 2 Related work then they re-estimated ratings from this full rating matrix by collaborative filtering. CinemaScreen 22] reversed the stages. It executed content-based In personalized recommender systems, two main approaches have filtering on a result set generated through collaborative filtering. been developed: a content-based filtering approach and a collaborative Although the above-mentioned studies combine collaborative and filtering approach. Following the proposal of GroupLens [21]. the first content-based filtering approaches to exploit the benefits of each and system to generate automated recommendations, collaborative filtering lessen the disadvantages, our approach takes a different stance. Differing approaches have seen the widest use in a large number of information from earlier work, we automatically identify meaningful or useful patterns filtering problems relating to such things as movies, books, music, online in building a user model. In addition, rather than utilizing explicit user news, TV programs, and research papers. Despite success and popularity. feedback such as numeric ratings assigned to content, our aim is to build a collaborative filtering encounters several limitations, including the robust user model implicitly inferred by the system from observing user parsity of the data, scalability, the cold start problem, and untrustworthy behavior. Through the identification of useful patterns of a user in users. A number of researchers have addressed these problems using collaboration with other similar users, we discover content relevant to the content-based filtering 4]. user's needs. Content-based filtering methods, which are another well-known technique in recommender systems, have been developed using 3. Building a personal user model learning procedures. These procedures require training data to identify personal preferences (user model)from information objects and their The capability to learn users' preferences is at the heart of a ntent. Webmate tracked documents of interest to the user and personalized recommender system. In order to provide proper exploited the vector space model using the TF-IDF(Term Frequency- recommendations to users, personalized recommender systems require Inverse Document Frequency) method [ 6. Schwab et al 26 explored user models of characteristics, preferences, and needs. This information the use of a classification approach to recommend articles relevant to is typically referred to in the literature as a User Model(UM)3. the user profile, such as NewsDude In News Dude, two types of user Additionally, since every user can have different interests, featur interests are used: short-term and long-term interests. To avoid selection for representing users' interests should be personalized and recommendations of very similar documents, a short-term profile is performed individually for each user [16. In this section, we describe used. For the long-term interests of a user, the probabilities of a our approach to building a personal user model that is driven by the document are calculated using Naive Bayes approach to classify a users content of interest. document as interesting or not. Instead of learning from users' explicit Before going into further detail, the notation and definitions information, PVA 5 learned a user profile implicitly without user required for understanding our approach are introduced. Let C analysis algorithm is employed to present novel information for users by content c is a set of terms, each of which may appear in multiple identifying the novelty of articles in the contexts of articles they content with different weights that quantify the importance of the previously reviewed. Lihua et al 14 proposed a method of modeling term for describing the content. In our study, a weight Wy associated multiple user interests by using a self-organizing map neural network with a pair(t, g)(i.e. a term ty of a content c)is computed by a fairly with a changeable network structure. SitelF[15 proposed using word common type of TF-IDF weighting scheme [23. To build a personal sense-based document representation to build a model of the user's user model, potentially representative of user interests, we initially interests. A filtering procedure was employed to dynamically predi eed some information given by the user, called user feedback. The new documents based on a semantic network st common ways to obtain the feedback is to use information user's interaction [19]. Explicit feed back requires a user to evaluate content and indicate how relevant or interesting specific content is to him /her using like/dislike(a binary scale)or numerical ratings. Even though explicit feedback helps us to capture user preferences After mining user us content of interest, five personalized term patterns are found accurately, there is a serious drawback in that users do not tend to Length provide enough feedback. Users are generally not motivated to [t1,t2,t3 provide their feedback if they do not receive immediate benefits even when they would profit in the long-term 19. Therefore, in our study, we take implicit feedback into consideration in the sense that [t2,t,t4 the system automatically infers the users preferences from the user's 0.32 behaviors 2,7, 19. In general, the preference indicator of implicit
recommendations relevant to the user's needs. We incorporate collaborative characteristics into a content-based approach. Third, we provide detailed experimental evaluations with real datasets and investigate how collaborative user models work in terms of improving the recommendation performance. The subsequent sections are organized as follows: Section 2 summarizes previous studies related to user modeling and personalized recommendations. In Section 3, we describe the notations and method to build the initial user model. We then describe a collaborative approach for modeling user interests and recommending content in Section 4. Next, Section 5 describes the implemented system and interface. In Section 6, we present the effectiveness of our approach in terms of its performance. Finally, conclusions are presented and future work is discussed in Section 7. 2. Related work In personalized recommender systems, two main approaches have been developed: a content-based filtering approach and a collaborative filtering approach. Following the proposal of GroupLens [21], the first system to generate automated recommendations, collaborative filtering approaches have seen the widest use in a large number of information filtering problems relating to such things as movies, books, music, online news, TV programs, and research papers. Despite success and popularity, collaborative filtering encounters several limitations, including the sparsity of the data, scalability, the cold start problem, and untrustworthy users. A number of researchers have addressed these problems using content-based filtering [4]. Content-based filtering methods, which are another well-known technique in recommender systems, have been developed using learning procedures. These procedures require training data to identify personal preferences (user model) from information objects and their content. Webmate tracked documents of interest to the user and exploited the vector space model using the TF-IDF (Term Frequency– Inverse Document Frequency) method [6]. Schwab et al. [26] explored the use of a classification approach to recommend articles relevant to the user profile, such as NewsDude. In NewsDude, two types of user interests are used: short-term and long-term interests. To avoid recommendations of very similar documents, a short-term profile is used. For the long-term interests of a user, the probabilities of a document are calculated using Naïve Bayes approach to classify a document as interesting or not. Instead of learning from users' explicit information, PVA [5] learned a user profile implicitly without user intervention. The user profile is represented as a keyword vector in the form of a hierarchical category structure. In Newsjunkie [11], a noveltyanalysis algorithm is employed to present novel information for users by identifying the novelty of articles in the contexts of articles they previously reviewed. Lihua et al. [14] proposed a method of modeling multiple user interests by using a self-organizing map neural network with a changeable network structure. SiteIF [15] proposed using word sense-based document representation to build a model of the user's interests. A filtering procedure was employed to dynamically predict new documents based on a semantic network. Collaborative and content-based filtering methods have unique advantages and disadvantages. Therefore, some studies combine these techniques in developing hybrid recommender systems [4]. Berkovsky et al. [22] presented a method of user modeling data integration for the purposes of a specific recommendation task, referred to as the mediation of a user model. By importing and integrating data collected from other recommender systems, four types of user model mediation are presented: cross-user, cross-item, cross-context, and cross-representation. In [3], the same authors presented mediated user models that are transformed from collaborative filtering to content-based recommender systems. In [8], a content-collaborative hybrid recommender system is proposed that exploits WordNet-based user profiles to capture the semantics of user interests. Similar to our approach, the authors generated the neighborhood of a user through content-based methods. Melville et al. [17] followed a two-stage approach. First they applied a naive Bayesian classifier as content-based predictor to complete the rating matrix, and then they re-estimated ratings from this full rating matrix by collaborative filtering. CinemaScreen [22] reversed the stages. It executed content-based filtering on a result set generated through collaborative filtering. Although the above-mentioned studies combine collaborative and content-based filtering approaches to exploit the benefits of each and lessen the disadvantages, our approach takes a different stance. Differing from earlier work, we automatically identifymeaningful or useful patterns in building a user model. In addition, rather than utilizing explicit user feedback such as numeric ratings assigned to content, our aim is to build a robust user model implicitly inferred by the system from observing user behavior. Through the identification of useful patterns of a user in collaboration with other similar users, we discover content relevant to the user's needs. 3. Building a personal user model The capability to learn users' preferences is at the heart of a personalized recommender system. In order to provide proper recommendations to users, personalized recommender systems require user models of characteristics, preferences, and needs. This information is typically referred to in the literature as a User Model (UM) [3]. Additionally, since every user can have different interests, feature selection for representing users' interests should be personalized and performed individually for each user [16]. In this section, we describe our approach to building a personal user model that is driven by the user's content of interest. Before going into further detail, the notation and definitions required for understanding our approach are introduced. Let C= {c1, c2, …, cn} be the set of all content, T= {t1, t2, …, tm} be the set of all index terms, and U= {u1, u2, …, ul} be the set of distinct users. The content cj is a set of terms, each of which may appear in multiple content with different weights that quantify the importance of the term for describing the content. In our study, a weight wi,j associated with a pair (ti, cj) (i.e., a term ti of a content cj) is computed by a fairly common type of TF-IDF weighting scheme [23]. To build a personal user model, potentially representative of user interests, we initially need some information given by the user, called user feedback. The most common ways to obtain the feedback is to use information given explicitly or to get information observed implicitly from the user's interaction [19]. Explicit feedback requires a user to evaluate content and indicate how relevant or interesting specific content is to him/her using like/dislike (a binary scale) or numerical ratings. Even though explicit feedback helps us to capture user preferences accurately, there is a serious drawback in that users do not tend to provide enough feedback. Users are generally not motivated to provide their feedback if they do not receive immediate benefits, even when they would profit in the long-term [19]. Therefore, in our study, we take implicit feedback into consideration in the sense that the system automatically infers the user's preferences from the user's behaviors [2,7,19]. In general, the preference indicator of implicit Table 1 After mining user u's content of interest, five personalized term patterns are found. Pattern-id PTP PS Length p1 {t1, t2, t3} 0.56 3 p2 {t1, t2, t3, t4} 0.51 4 p3 {t1, t2, t5} 0.47 3 p4 {t4, t5} 0.41 2 p5 {t2, t3, t4} 0.32 3 H.-N. Kim et al. / Decision Support Systems 51 (2011) 772–781 773
H-N. Kim et aL / Decision Support Systems 51(2011)772-781 edback can be represented as a form of a co-occurrence pair (un, c)), a given pattern Pk E Fu the pattern weight of pk for user u, denoted here u, Uis a user and cE is specific content. The co-occurrence PWu(Pk), is computed by air implies that user un viewed, clicked, collected or bookmarked content c. While implicit feedback on specific content by a user does not necessarily mean that he/ she likes the content, we assume that (3) the co-occurrence pairs of the user are his/her interest content, implicitly where H u is the mean weight for term ti in Iu and is computed as 3. 1. Modeling user interests by text mining Our approach to modeling user interests mainly consists of three Tx2间0 steps: extracting terms, mining frequent patterns, and pruning patterns In this section, we present the steps to initially build a personal user where lu()is the set of interest content for user u containing term ti model in detail and wiy is the weight of term t in content c For any pattern Px in Fu The first step in user modeling is the extraction of the terms from we determine the patterns for which the pattern weight is greater than Interest hat have b reprocessed by removing the minimum pattern weight, min_pw, and model user preferences words aming words After extracting terms. based on the identified patterns, collectively called a Personalized C is represented as a vector of attribute-value pairs Term Pattern. In addition, terms that appear within personalized term as follows. called personalized terms 5={(w)(2m2)…(m,wn) (1) Definition 2(Personalized Term Pattem, PIP) A personalized term patten is defined as a frequent term pattern for where ty is the extracted term in c and wi is the weight of t in c Wi is which the pattern weight is greater than the minimum pattern weight computed by the static TF-iDF term-weighting scheme [23] and min_pw, i.e. Pk E Fu and PW(px)>min_pw. a set of personalized term defined as follows. patterns for user u is denoted as PTPu such that PIPu=((pk, PSu (Pr)) PW(Pk)> min_pw A pk∈F Definition 3(Personalized Term, Pr) where fy is the frequency of occurrence of term ti in content c, n is the a personalized term is a term that occurs within personalized term total number of content pieces in the collections, and n; is the number patterns. The set of personalized terms for user u is denoted as PTu In of content pieces in which term t occurs. The weight indicates the addition, the vector for PTu is represented by PTu=(u, u, H2, u,.. 4, u). importance of a term in representing the content. where t is the total number of personalized terms and A, u is the mean The second step is to mine frequent term patterns from the interest weight for term t, which is computed by Eq (4) content of each user. Since every user has different interests, content Frequent patterns are a set of terms that appear frequently togetherilo used for the mining process must be selected individually for each use The formal description of the model for user u, Mu, is as follows: Mu=(PIPu, PT), where PIPu models the interest patterns(Definition 2) set of a users interest content. For example, if a set of terms and Plu models the interest terms(Definition 3). And the model is [recommendation, collaborative, personalization, filtering) appear stored in a prefix tree structure, which is inspired by a frequent-pattern frequently together in a users set of interest content, the set of those tree(FP-tree)[12 to save memory space, explore relationships of terms is a frequent pattern for the user. In the data mining research terms, and retrieve PTPs having some PTs efficiently literature, frequent pattems are typically defined as patterns that occur For example, if five personalized term patterns are found, as t least as frequently as a predetermined minimum support(min_sup) shown in Table 1, after mining the content of interest for user u, the 12). In our study, we apply the mining process based on the following tree structure of the model for user u is then constructed as follows. assumption: each transaction corresponds to an interest content of a All PTu are stored in the header table and sorted in order of descending user, items in a transaction are terms extracted from the content, and a frequency of terms since there are better chances that more prefix transaction database corresponds to a users set of interest content. terms can be shared [12]. Therefore, if the pattern support of pattern Pk( Definition 1) that is First, we create the root of the tree, labeled with"null For the first composed of at least 1(>2)different terms, is above min_sup, i.e. PSu term pattern, t, t2, t3] is inserted into the tree as a path from the root (Pk)>min_sup, then pattern px is referred to as a frequent term pattern. node, where tz is linked as the child of the root, t, is linked to tz, and t, is We denote a set of frequent term patterns for user u as Fu linked to t]. PS and length of the pattern(PS(p1)=0.56, length=3)are then attached to the last node t,. The nodes linked together in the path Definition 1(Pattem Support, Ps) imply that the nodes(terms)contained in the pattern co-occur frequently in the users interest content. For the second pattern, since Let lu be user us set of interest content and pattern pk=(t, t, tn its term pattern, t, t2, t3, and ta, shares a common prefix(t2, t, and t3) be a set of terms such that Pk T and n 22. A content piece c is said to with the existing path for the first term pattern, a new node ta is created contain pattern Pk if and only if Pk C Pattern support for pattern Pk in lu. and linked as a child of node t3. Thereafter, PS(P2)and length(p2)are written as PSu(Pk), is the ratio of content in Lu that contains pattern pk. attached to the last node ta. The third, fourth, and fifth patterns are That is, Psu(pk)=fu(pk)/ Iu l. where fu(px) indicates the occurrence inserted in a manner similar to the first and second patterns To facilitate frequency of pattern Pk in lu. tree traversal, a header table is built, in which each term points to its occurrence in the tree via a node-link Nodes with the same term-name terns containing unnecessary terms from the set of frequent constructed as shown in Fig. note that the built tree is a compact data term patterns. To this end, we define the importance of each term in structure for representing the whole interest patterns and terms of user representing a certain pattern, called the pattern weight. Formally, for u by sharing personalized terms in the personalized patterns
feedback can be represented as a form of a co-occurrence pair (uh, cj), where uh∈U is a user and cj∈C is specific content. The co-occurrence pair implies that user uh viewed, clicked, collected, or bookmarked content cj. While implicit feedback on specific content by a user does not necessarily mean that he/she likes the content, we assume that the co-occurrence pairs of the user are his/her interest content, implicitly. 3.1. Modeling user interests by text mining Our approach to modeling user interests mainly consists of three steps: extracting terms, mining frequent patterns, and pruning patterns. In this section, we present the steps to initially build a personal user model in detail. The first step in user modeling is the extraction of the terms from interest content that have been preprocessed by removing stop words and stemming words [20]. After extracting terms, each interest content cj is represented as a vector of attribute-value pairs as follows: cj= t1; j; w1; j ; t2; j; w2;j ;:::; tm; j; wm; j n o ð1Þ where ti,j is the extracted term in cj and wi,j is the weight of ti in cj. wi,j is computed by the static TF-IDF term-weighting scheme [23] and defined as follows: wi; j= fi; j maxl fl; j × log n ni ð2Þ where fi,j is the frequency of occurrence of term ti in content cj, n is the total number of content pieces in the collections, and ni is the number of content pieces in which term ti occurs. The weight indicates the importance of a term in representing the content. The second step is to mine frequent term patterns from the interest content of each user. Since every user has different interests, content used for the mining process must be selected individually for each user. Frequent patterns are a set of terms that appear frequently together in a set of a user's interest content. For example, if a set of terms {recommendation, collaborative, personalization, filtering} appear frequently together in a user's set of interest content, the set of those terms is a frequent pattern for the user. In the data mining research literature, frequent patterns are typically defined as patterns that occur at least as frequently as a predetermined minimum support (min_sup) [12]. In our study, we apply the mining process based on the following assumption: each transaction corresponds to an interest content of a user, items in a transaction are terms extracted from the content, and a transaction database corresponds to a user's set of interest content. Therefore, if the pattern support of pattern pk (Definition 1) that is composed of at least l (l≥2) different terms, is above min_sup, i.e., PSu (pk)Nmin_sup, then pattern pk is referred to as a frequent term pattern. We denote a set of frequent term patterns for user u as Fu. Definition 1 (Pattern Support, PS) Let Ιu be user u's set of interest content and pattern pk= {t1, t2, ..., tn} be a set of terms such that pk T and n≥2. A content piece cj is said to contain pattern pk if and only if pk cj. Pattern support for pattern pk in Ιu, written as PSu(pk), is the ratio of content in Ιu that contains pattern pk. That is, PSu(pk)=fu(pk)/| Ιu |, where fu(pk) indicates the occurrence frequency of pattern pk in Ιu. Once the frequent patterns are mined, in the third step we remove the patterns containing unnecessary terms from the set of frequent term patterns. To this end, we define the importance of each term in representing a certain pattern, called the pattern weight. Formally, for a given pattern pk ∈ Fu, the pattern weight of pk for user u, denoted as PWu(pk), is computed by: PWu pk ð Þ= 1 jpk j ⋅∑i∈pk μi;u ð3Þ where μi,u is the mean weight for term ti in Ιu and is computed as follows: μi;u= 1 jΙuð Þj i ×∑j∈Iuð Þi wi; j ð4Þ where Ιu(i) is the set of interest content for user u containing term ti and wi,j is the weight of term ti in content cj. For any pattern pk in Fu, we determine the patterns for which the pattern weight is greater than the minimum pattern weight, min_pw, and model user preferences based on the identified patterns, collectively called a Personalized Term Pattern. In addition, terms that appear within personalized term patterns are called Personalized Terms. Definition 2 (Personalized Term Pattern, PTP) A personalized term pattern is defined as a frequent term pattern for which the pattern weight is greater than the minimum pattern weight min_pw, i.e., pk ∈ Fu and PWu(pk)Nmin_pw. A set of personalized term patterns for user u is denoted as PTPu such that PTPu= {(pk, PSu(pk))| PWu(pk)Nmin_pw ∧ pk ∈ Fu}. Definition 3 (Personalized Term, PT) A personalized term is a term that occurs within personalized term patterns. The set of personalized terms for user u is denoted as PTu. In addition, the vector for PTu is represented by PTu → = (μ1,u, μ2,u, …, μt,u), where t is the total number of personalized terms and μi,u is the mean weight for term ti, which is computed by Eq. (4). The formal description of the model for user u, Mu, is as follows: Mu=〈PTPu, PTu〉, where PTPu models the interest patterns (Definition 2) and PTu models the interest terms (Definition 3). And the model is stored in a prefix tree structure, which is inspired by a frequent-pattern tree (FP-tree) [12], to save memory space, explore relationships of terms, and retrieve PTPs having some PTs efficiently. For example, if five personalized term patterns are found, as shown in Table 1, after mining the content of interest for user u, the tree structure of the model for user u is then constructed as follows. All PTu are stored in the header table and sorted in order of descending frequency of terms since there are better chances that more prefix terms can be shared [12]. First, we create the root of the tree, labeled with “null”. For the first term pattern, {t1, t2, t3} is inserted into the tree as a path from the root node, where t2 is linked as the child of the root, t1 is linked to t2, and t3 is linked to t1. PS and length of the pattern (PS(p1)=0.56, length=3) are then attached to the last node t3. The nodes linked together in the path imply that the nodes (terms) contained in the pattern co-occur frequently in the user's interest content. For the second pattern, since its term pattern, {t1, t2, t3, and t4}, shares a common prefix {t2, t1, and t3} with the existing path for the first term pattern, a new node t4 is created and linked as a child of node t3. Thereafter, PS(p2) and length(p2) are attached to the last node t4. The third, fourth, and fifth patterns are inserted in a manner similar to the first and second patterns. To facilitate tree traversal, a header table is built, in which each term points to its occurrence in the tree via a node-link. Nodes with the same term-name are linked in sequence via such node-links. Finally, the model for user u is constructed as shown in Fig. 1. Note that the built tree is a compact data structure for representing the whole interest patterns and terms of user u by sharing personalized terms in the personalized patterns. 774 H.-N. Kim et al. / Decision Support Systems 51 (2011) 772–781
H-N. Kim et al / Decision Support Systems 51(2011)772-781 4. 2. Collaborative enrichment of user interests Header Table Once we have identified the set of the nearest neighbors for a certain user u, his/her initial model Mu=(PIP. PTu) is enriched from the eighbors. The basic idea of enriching the model of the user u starts from assuming that the user is likely to prefer similar patterns that have been 03 04121 discovered from the neighbors with similar tastes. The patterns discovered from more similar users contribute more to enriching the model of the target user. For example, if the pattern, such as 32,3 (personalization, recommender), frequently appears in interest content of a user, he/she might also be interested in the pattern, such as [personalization, recommender, collaborative, filtering that frequently 0514 appears in interest content of users similar to him/her. This enrichment process is particularly effective to some users who do not contain Fig. 1. A tree structure of Mu for personalized term patterns in Table 1 interest terms and patterns in their user model, such as the cold start users 4. Collaborative user modeling for content filtering We elaborate on the general idea of the enrichment process in the following. Let N(u)=v,, V2,, Vk) be a sorted neighbor list of target In this section we describe how to enrich the model for a specific user u, PTPu be a set of personalized term patterns for user u, and PTPv. user. The model M, described in Section 3 is referred to the initial user VEN(u), be a set of personalized term patterns for neighbor user vof model for user u. This model can be applied immediately to generate user u. Firstly, we choose neighbor user v in descending order of content recommendations. However, diverse pattens for user u cannot similarity between target user u and neighbors. For each pattern p in PIPu, specific patterns of pi in PIPy are identified. this situation, initial personalized term patterns may not be sufficient to only if pi is a subset of p), Le..CPy On the contrary, p is said to be a represent user preferences, and thus our approach is generally unable to specific pattern of p. For example, letp:=(ta and ts)be the personalized make high quality recommendations. In addition, when we only use the terms pattern for user u such that pIE PIPu, and PTP, =(Pz, P3, Pa, and ps) itial model for recommendations, it is hard to recommend to the user be the set of PTPs for user v such that p2=(t2, ta, and ts). p3=(ta. novel content of value aside from the usual set. For the above reasons, ts), p4=(tz, L4. ts, and tz), and ps=(t, and ts)as shown in Fig. 2 we propose an enrichment method of the user model via personalized they are said to be a specific pattern. Several specific patterns thatoN Since pattern P2, P3, and pa contain the entire terms of pattern F 4. 1. Content-based neighborhood formatio than that of the general pattern. Assume that the pattern support for pr p2, Pa, and pa is 0.41, 0.5, 0.47, and 0.35, respectively (ie PSu(P,)=0.41 The main goal of neighborhood formation is to identify a set of user PSp2)=0.5. Psp3)=0.47, and PSM(p4)=0.35). In this case, only neighbors, k nearest neighbors, which is defined as a group of users pattern pz and p3 is used for enriching the model of user u if they are not exhibiting interest terms similar to those of the target user. A typical PIPs for user u, as can be seen in Fig. 3. Patterns such as p2 and p3 are collaborative filtering recommender system encounters serious limita- called Collaborative Term Patterns(CTPs)for target user u. An enriched ons for finding a set of users, namely the sparsity problem [8, 17). The model for user u by neighbor user v is built, as shown in Fig. 4. sparsity problem occurs when available data is insufficient to identify Finally, a set of collaborative patterns is identified from k nearest practice, even when users are very active, the result of ale content In neighbors, with respect to target user u. Note that the collaborative term for the target user is not allowed to be redundant. That is, only a small proportion of the total number of content. Accordingly it is if the same patterns that were previously enriched by neighbor v are similarity cannot be computed. Even when the computation of similarity for v+h, those patterns are pruned is possible, it may not be very reliable, because insufficient information is The enriched model for user u is defined as a triple Mu=(PTPu, CTP rocessed.To this end, in our study, we select the best neighbors by using PT.where PTPu is the set of personalized term patterns for user u, CTPu the personalized terms, PT, of each user. In order to find k nearest ighbors, the cosine similarity, which quantifies the similarity of a ectors according to their angle is employed to measure the similarity alues between a target user and every other user. As noted in Definition 3, the personalized terms of a pair of users, u and v, are represented as Header Table t-dimensional vectors, PT and PTy respectively. Therefore, the similarity between a pair of users, u and v is measured by Eq (5). ta ts y=(所)=后二此二 The similarity score between a pair of users is in the range[0, 1 053 0473 and the higher a users score, the more similar he/she is to the target user. After computing the all-to-all similarity between users, we define the set of nearest neighbors of each user u as an ordered list of k 035,4 sers N(u)=vi, V2,, Vk) such that uEEN(u), and sim(u, vi)is the maximum, sim(u, z)is the next maximum etc. 24 Fig. 2. Initial model for user v who is a neighbor of target user u
4. Collaborative user modeling for content filtering In this section we describe how to enrich the model for a specific user. The model Mu described in Section 3 is referred to the initial user model for user u. This model can be applied immediately to generate content recommendations. However, diverse patterns for user u cannot be discovered via the mining process in the case where the user has a small number of interest content. This is known as a cold start user.With this situation, initial personalized term patterns may not be sufficient to represent user preferences, and thus our approach is generally unable to make high quality recommendations. In addition, when we only use the initial model for recommendations, it is hard to recommend to the user novel content of value aside from the usual set. For the above reasons, we propose an enrichment method of the user model via personalized term patterns of like-minded users. 4.1. Content-based neighborhood formation The main goal of neighborhood formation is to identify a set of user neighbors, k nearest neighbors, which is defined as a group of users exhibiting interest terms similar to those of the target user. A typical collaborative filtering recommender system encounters serious limitations for finding a set of users, namely the sparsity problem [8,17]. The sparsity problem occurs when available data is insufficient to identify similar users (neighbors) due to the immense amount of content. In practice, even when users are very active, the result of rated content is only a small proportion of the total number of content. Accordingly, it is often the case that a pair of users has nothing in common, and hence the similarity cannot be computed. Even when the computation of similarity is possible, it may not be very reliable, because insufficient information is processed. To this end, in our study, we select the best neighbors by using the personalized terms, PT, of each user. In order to find k nearest neighbors, the cosine similarity, which quantifies the similarity of a pair of vectors according to their angle, is employed to measure the similarity values between a target user and every other user. As noted in Definition 3, the personalized terms of a pair of users, u and v, are represented as t-dimensional vectors, PTu → and PTv → respectively. Therefore, the similarity between a pair of users, u and v is measured by Eq. (5). sim uð Þ ; v = cos PTu → ; PTv → = ∑t k = 1μk;u ×μk;v ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ∑t k= 1μ2 k;u q × ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ∑t k= 1μ2 k;v q ð5Þ The similarity score between a pair of users is in the range [0, 1] and the higher a user's score, the more similar he/she is to the target user. After computing the all-to-all similarity between users, we define the set of nearest neighbors of each user u as an ordered list of k users Ν(u)={v1, v2,…, vk} such that u∉Ν(u), and sim(u,v1) is the maximum, sim(u,v2) is the next maximum etc. [24]. 4.2. Collaborative enrichment of user interests Once we have identified the set of the nearest neighbors for a certain user u, his/her initial model Mu=〈PTPu, PTu〉 is enriched from the neighbors. The basic idea of enriching the model of the user u starts from assuming that the user is likely to prefer similar patterns that have been discovered from the neighbors with similar tastes. The patterns discovered from more similar users contribute more to enriching the model of the target user. For example, if the pattern, such as {personalization, recommender}, frequently appears in interest content of a user, he/she might also be interested in the pattern, such as {personalization, recommender, collaborative, filtering}, that frequently appears in interest content of users similar to him/her. This enrichment process is particularly effective to some users who do not contain interest terms and patterns in their user model, such as the cold start users. We elaborate on the general idea of the enrichment process in the following. Let Ν(u)={v1,v2,…,vk} be a sorted neighbor list of target user u, PTPu be a set of personalized term patterns for user u, and PTPv, v ∈ Ν(u), be a set of personalized term patterns for neighbor user v of user u. Firstly, we choose neighbor user v in descending order of similarity between target user u and neighbors. For each pattern pi in PTPu, specific patterns of pi in PTPv are identified. Given two patterns pi and pj, pi is said to be a general pattern of pj if and only if pi is a subset of pj, i.e., pi⊂pj. On the contrary, pj is said to be a specific pattern of pi. For example, let p1= {t4, and t5} be the personalized terms pattern for user u such that p1∈ PTPu, and PTPv= {p2, p3, p4, and p5} be the set of PTPs for user v such that p2= {t2, t4, and t5}, p3= {t4, t5, and t8}, p4= {t2, t4, t5, and t7}, and p5= {t7, and t8} as shown in Fig. 2. Since pattern p2, p3, and p4 contain the entire terms of pattern p1, they are said to be a specific pattern. Several specific patterns that occur in the PTPs of neighbor v, PTPv, may be found. For efficient enrichment, we only consider specific patterns which have higher pattern support than that of the general pattern. Assume that the pattern support for p1, p2, p3, and p4 is 0.41, 0.5, 0.47, and 0.35, respectively (i.e., PSu(p1)=0.41, PSv(p2)=0.5, PSv(p3)=0.47, and PSv(p4)=0.35). In this case, only pattern p2 and p3 is used for enriching the model of user u if they are not PTPs for user u, as can be seen in Fig. 3. Patterns such as p2 and p3 are called Collaborative Term Patterns (CTPs) for target user u. An enriched model for user u by neighbor user v is built, as shown in Fig. 4. Finally, a set of collaborative patterns is identified from k nearest neighbors, with respect to target user u. Note that the collaborative term pattern for the target user is not allowed to be redundant. That is, if the same patterns that were previously enriched by neighbor v are also discovered from another neighbor h such that sim(u,v)≥sim(u,h), for v≠h, those patterns are pruned. The enriched model for user u is defined as a triple M+u=〈PTPu, CTPu, PT+u〉 where PTPu is the set of personalized term patterns for user u, CTPu Root 0.56, 3 0.47, 3 0.51, 4 0.41, 2 PT Node -links t 2 t 2 t 1 t 3 t 4 t 5 Header Table 0.8 0.6 0.7 0.3 0.4 µ 0.32, 3 t 1 t 3 t 3 t 4 t 4 t 4 t 5 t 5 Fig. 1. A tree structure of Mu for personalized term patterns in Table 1. Root 0.5, 3 0.47, 3 0.35, 4 0.31, 2 PT Node -links t 4 t 4 t 5 t 5 t 2 t 2 t 7 t 7 t 7 t 8 t 8 t 8 Header Table 0.6 0.7 0.4 0.3 0.5 µ Fig. 2. Initial model for user v who is a neighbor of target user u. H.-N. Kim et al. / Decision Support Systems 51 (2011) 772–781 775
776 H-N. Kim et aL / Decision Support Systems 51(2011)772-781 target( where Nu is the total number of patterns in both PTPu and CTPu, and BA is binary variable for determining whether or not pattern pk occurs in content cn. That is, BR* is 1 if pattern Pk appears in content cn and o otherwise, and op represents the weighted pattern support of pk for user u, which is given by: 0412 0.41 ={ PS(Pk xsim(u, v) if p=Cmp∈PP, 0.47,3 053 0473 The main concept of prediction dictates that interest patterns in the Fig. 3. Specific patterns, general pattern, and enriched patterns. model of the target user are a good estimate of the preference for the selected content. The more the content contains the patterns in the model, the higher rank the content obtains This scheme can also make recommendations for new content added regularly to the system. known as the new item problem in collaborative filtering 1, as well as is the set of collaborative term patterns for user u, and PTu is the set of support serendipitous recommendations [13]. Recommender systems interest terms that occur within either the personalized patterns or the relying exclusively on a user's interest content can only recommend collaborative patterns, respectively. In the enriched model M*u PIPu content highly related to that which the user has previously selected. It models the interest patterns of user u whereas CIPu models the enriched is hard to recommend novel content that are different from anything the interest patterns by the neighbors of user u. user has previously read before. This is known as the problem with overspecialization [1, 22). In our approach, by utilizing the enriched ( Collaborative Term Pattern, CIP) patterns from neighbors with similar tastes, we can make content to be a Let pi be a personalized term pattern for target user u, P EPTPu, and higher rank in the recommended set that the content contains the p be a personalized term pattern for neighbor v such that PE PIPw, and collaborative(enriched) patterns valuable to the target user, even E N(u). We define the set of collaborative term patterns for user u, though the patterns are not directly discovered from the user's interest denoted as CTPu. as the set of neighbor patterns p such that P: Cpi. content. pyEPTPu, and PSu(P)≤PSP) Once the content predictions about the target user, which the user has not previously read, are computed, the content are sorted in order of descending predicted value Pun. Finally, the set of N ordered content elements with the highest values are identified for user u. This After the model is enriched, we are ready to provide recommenda- is the set of content recommended to user u(top-N recommendation) tions for new content that a user has not previously read. Based on the niched model for each user, we recommend to the user the top-N ranked content that he/she might be interested in reading. Definition 5(Top-N recommendation) the mostimportant task in personalized recommendation is Let c be the set of all content, Xu be the content list that user u has a prediction, that is, speculation about how much a certain reviously collected or added to his preference list(interest content) prefer unseen content. In our study, we consider matched patterns, that and Yu be the content list not previously read by user u, Yu=C-Xu and is, how many interest patterns in a user model are contained in the new Xun Yu=0. Given a pair of content elements c and c, C andc EYu. ontent. Formally, the numeric score of the target user u for the content content c will be of more interest to user u than content c if and only if Cn, denoted as Pun, is obtained as follows the prediction score Pui of the target user u for the content ci is higher than that of content c Pui>Puf Top-N recommendations for user u ∑ PrE(PIPyUCIP)B∑D=Pmm) lPkIxoP x Ba ∑=uCm)o (6) identifies an ordered set of N content, TopNu, that will be of interest to user u such that TopN|≤N,TopN∩=, and TopNu Y Header Table 0.41,2 0804.33[0531043 Fig. 4. Enriched user u model, Mw by neighbor user v
is the set of collaborative term patterns for user u, and PT+u is the set of interest terms that occur within either the personalized patterns or the collaborative patterns, respectively. In the enriched model M+u, PTPu models the interest patterns of user u whereas CTPu models the enriched interest patterns by the neighbors of user u. Definition 4 (Collaborative Term Pattern, CTP) Let pi be a personalized term pattern for target user u, pi∈PTPu, and pj be a personalized term pattern for neighbor v such that pj∈PTPv, and v ∈ Ν(u). We define the set of collaborative term patterns for user u, denoted as CTPu, as the set of neighbor patterns pj such that pi⊂pj, pj∉PTPu, and PSu(pi)≤PSv(pj). 4.3. Personalized content recommendation After the model is enriched, we are ready to provide recommendations for new content that a user has not previously read. Based on the enriched model for each user, we recommend to the user the top-N ranked content that he/she might be interested in reading. To this end, the most important task in personalized recommendation is to generate a prediction, that is, speculation about how much a certain user would prefer unseen content. In our study, we consider matched patterns, that is, how many interest patterns in a user model are contained in the new content. Formally, the numeric score of the target user u for the content cn, denoted as Pu,n, is obtained as follows: Pu;n= ∑pk∈ PTPu∪CTPu ð ÞBpk n Nu ⋅ ∑pk∈ PTPu∪CTPu ð Þ jpk j ×ωpk u ×Bpk n ∑Pk∈ PTPu∪CTPu ð Þωpk u ð6Þ where Nu is the total number of patterns in both PTPu and CTPu, and Bn pk is binary variable for determining whether or not pattern pk occurs in content cn. That is, Bn pk is 1 if pattern pk appears in content cn and 0 otherwise, and ωu pk represents the weighted pattern support of pk for user u, which is given by: ωpk u = PSu pk ð Þ if pk∈PTPu PSv pk ð Þ×sim uð Þ ; v if pk∈CTPu; pk∈PTPv ð7Þ The main concept of prediction dictates that interest patterns in the model of the target user are a good estimate of the preference for the selected content. The more the content contains the patterns in the model, the higher rank the content obtains. This scheme can also make recommendations for new content added regularly to the system, known as the new item problem in collaborative filtering [1], as well as support serendipitous recommendations [13]. Recommender systems relying exclusively on a user's interest content can only recommend content highly related to that which the user has previously selected. It is hard to recommend novel content that are different from anything the user has previously read before. This is known as the problem with overspecialization [1,22]. In our approach, by utilizing the enriched patterns from neighbors with similar tastes, we can make content to be a higher rank in the recommended set that the content contains the collaborative (enriched) patterns valuable to the target user, even though the patterns are not directly discovered from the user's interest content. Once the content predictions about the target user, which the user has not previously read, are computed, the content are sorted in order of descending predicted value Pu,n. Finally, the set of N ordered content elements with the highest values are identified for user u. This is the set of content recommended to user u (top-N recommendation). Definition 5 (Top-N recommendation) Let C be the set of all content, Xu be the content list that user u has previously collected or added to his preference list (interest content), and Yu be the content list not previously read by user u, Yu=C−Xu and Xu ∩ Yu=∅. Given a pair of content elements ci and cj, ci ∈Yu and cj ∈Yu, content ci will be of more interest to user u than content cj if and only if the prediction score Pu,i of the target user u for the content ci is higher than that of content cj, Pu,iNPu,j. Top-N recommendations for user u identifies an ordered set of N content, TopNu, that will be of interest to user u such that |TopNu|≤N, TopNu ∩ Xu=∅, and TopNu Yu. Root 0.5, 3 0.47, 3 0.35, 4 0.41, 2 Root 0.5, 3 0.47, 3 Root 0.41, 2 Specific patterns in neighbor user General pattern in target user Enriched patterns for target user t 4 t 5 t 4 t 5 t 4 t 5 t 2 t 2 t 7 t 8 t 8 Fig. 3. Specific patterns, general pattern, and enriched patterns. 0.56, 3 0.47, 3 0.51, 4 0.41, 2 0.32, 3 0.5, 3 0.47, 3 Root PT Node -links Header Table t 2 t 2 t 2 t 1 t 1 t 3 t 3 t 3 t 4 t 4 t 4 t 4 t 5 t 5 t 5 0.8 0.6 0.7 0.3 0.4 t 8 t 8 0.5 µ Fig. 4. Enriched user u model, M+u, by neighbor user v. 776 H.-N. Kim et al. / Decision Support Systems 51 (2011) 772–781