Constructing User Profiles for Collaborative Rec Qing li and Byeong Man Kim Department of Computer Science diqing, bmkim/@sekumohackr Abstract. With the development of e-commerce and information ac- recommender systems have become a popular technique to prune large information spaces so that users are directed toward those items that best meet their needs and preferences. In this paper, clustering technique is applied in the collaborative recommender framework to con- sider semantic contents available from the user profiles. We also suggest methods to construct user profiles from rating information and attributes fitems to accommodate user preferences. Further, we show that the cor- ect application of the semantic content information obtained from user profiles does enhance the effectiveness of collaborative recommendation 1 Introduction Recommender systems produce individualized recommendations as output or has the effect of guiding the user in a personalized way to interesting or useful objects in a large space of possible options. Such systems have an obvious appeal in an environment where the amount of on-line information vastly outstrips any individual's capability to survey it. Recommender systems are now an integral partofsomee-commercesitessuchasAmazoncomcdNowandLevis.com. Recommender systems are characterized with"individualized"that separate them from search engines, which focus on the"matching": the system is supposed to return all those items that match the query ranked by degree of match At the initial state, many recommender systems were fairly simple query-based nformation retrieval systems, which can be called as content-based recommender systems. Later, GroupLens 15 and Ringo [18 developed independently, were the first to automatic prediction. Collaborative recommender systems collect ratings of items from many individuals and make recommendations based on those ratings to a given user. Such as MovieLens system recommends movies nd GAB recommends web pages based on the bookmarks 19 Although collaborative filtering has been very successful in both research and practice areas, it can not recommend new items to users without any history in the system and completely denies any information that can be extracted from This work was supported by grant No. 2000-1-51200-008-2 from the Korea Science and Engineering Foundation
Constructing User Profiles for Collaborative Recommender System Qing Li and Byeong Man Kim Department of Computer Science Kumoh National Institute of Technology, Korea {liqing,bmkim}@se.kumoh.ac.kr Abstract. With the development of e-commerce and information access, recommender systems have become a popular technique to prune large information spaces so that users are directed toward those items that best meet their needs and preferences. In this paper1 , clustering technique is applied in the collaborative recommender framework to consider semantic contents available from the user profiles. We also suggest methods to construct user profiles from rating information and attributes of items to accommodate user preferences. Further, we show that the correct application of the semantic content information obtained from user profiles does enhance the effectiveness of collaborative recommendation. 1 Introduction Recommender systems produce individualized recommendations as output or has the effect of guiding the user in a personalized way to interesting or useful objects in a large space of possible options. Such systems have an obvious appeal in an environment where the amount of on-line information vastly outstrips any individual’s capability to survey it. Recommender systems are now an integral part of some e-commerce sites such as Amazon.com, CDNow and Levis.com. Recommender systems are characterized with “individualized” that separate them from search engines, which focus on the “matching”: the system is supposed to return all those items that match the query ranked by degree of match. At the initial state, many recommender systems were fairly simple query-based information retrieval systems, which can be called as content-based recommender systems. Later, GroupLens [15] and Ringo [18] developed independently, were the first to automatic prediction. Collaborative recommender systems collect ratings of items from many individuals and make recommendations based on those ratings to a given user. Such as MovieLens system recommends movies and GAB recommends web pages based on the bookmarks [19]. Although collaborative filtering has been very successful in both research and practice areas, it can not recommend new items to users without any history in the system and completely denies any information that can be extracted from 1 This work was supported by grant No. 2000-1-51200-008-2 from the Korea Science and Engineering Foundation
Qing Li and Byeong Man Kim semantic contents of items. For this reason, hybrid recommender systems have been provided, which can exploit both user preferences and semantic contents Proposed approaches to the hybrid system which combine content-based and collaborative filters together can be categorized into three groups. The first one is the linear combination of results of collaborative and content-based filtering such as systems that are described by Claypool 5. The second group is the sequential combination of content-based filtering and collaborative filtering. In these systems, firstly, content-based filtering algorithm is applied to find users ho share similar interests. Secondly, collaborative algorithm is applied to make predictions, such as RAAP 7 and Fab filtering systems 1. The last one is the mixed combination, both the semantic contents and ratings are applied to mal commendations, such as the probabilistic model [13 and Ripper system fo recommendation 2 In this paper we describe a web-based movie recommender, which can rec- ommend the movies based on not only the user ratings but also the significant amount of ot her information that is often available about the nature of each user The application of clustering technique in our approach provides a way for intro- ducing the semantic contents of user profiles into collaborative recommendation In the following Sections, we describe our approach in detail 2 Our approach n our approach, we integrate the information of user profiles and user ratings to calculate the user-user similarity From the user profiles, we get the feature-based information of users(liked movies with Meg Ryan with average score 2.45), while, from the user ratings, we get the atomic-item-based rating information(liked this particular movie with score 3). In Fab system, feature-based user profiles eplace the atomic-item-based rating information. It help alleviate sparsity, but the abstraction does lose information. However, in our approach we use both kinds of information- those based on ratings of individuals and those based on features. The Figure 1 shows the overview of our approach Groue Fig. 1. Overview of our approach 2.1 User profile creation User profiles indicate the information needs or preferences on items that users are interested in. A user profile consists of five profile vectors and each profile
2 Qing Li and Byeong Man Kim semantic contents of items. For this reason, hybrid recommender systems have been provided, which can exploit both user preferences and semantic contents. Proposed approaches to the hybrid system which combine content-based and collaborative filters together can be categorized into three groups. The first one is the linear combination of results of collaborative and content-based filtering, such as systems that are described by Claypool [5]. The second group is the sequential combination of content-based filtering and collaborative filtering. In these systems, firstly, content-based filtering algorithm is applied to find users, who share similar interests. Secondly, collaborative algorithm is applied to make predictions, such as RAAP [7] and Fab filtering systems [1]. The last one is the mixed combination, both the semantic contents and ratings are applied to make recommendations, such as the probabilistic model [13] and Ripper system for recommendation [2]. In this paper we describe a web-based movie recommender, which can recommend the movies based on not only the user ratings but also the significant amount of other information that is often available about the nature of each user. The application of clustering technique in our approach provides a way for introducing the semantic contents of user profiles into collaborative recommendation. In the following Sections, we describe our approach in detail. 2 Our approach In our approach, we integrate the information of user profiles and user ratings to calculate the user-user similarity. From the user profiles, we get the feature-based information of users(liked movies with Meg Ryan with average score 2.45),while, from the user ratings, we get the atomic-item-based rating information(liked this particular movie with score 3). In Fab system, feature-based user profiles replace the atomic-item-based rating information. It help alleviate sparsity, but the abstraction does lose information. However, in our approach we use both kinds of information - those based on ratings of individuals and those based on features. The Figure 1 shows the overview of our approach. Rating Data + Group Rating Matrix Group Rater Clustering Engine Collaborative Filter User Rating Matrix Profile Group Vector User Profiles Fig. 1. Overview of our approach 2.1 User profile creation User profiles indicate the information needs or preferences on items that users are interested in. A user profile consists of five profile vectors and each profile
Constructing User Profiles for Collaborative Recommender System vector represents an aspect (or dimension of his preferences). In our current ystem, five aspects-actor, actress, director, genre and synopsis- have been considered. A profile vector consists of several attribute- value pairs as Figure hows To reduce the burden of users the user can indicate interesting movies instead of specifying his preference explicitly; the profile vectors can be automatically constructed from movie description data by the following simple equation Numitem C attribute of aspect, I item >threshold here Numitem threshold is the number of items, in which ranking value is ager than the threshold; Numitem C attribute of aspect, is the number of items containing the attribute m in the aspect n of user profile and its rating is larger than the threshold. In our present experiment, we set the value of the threshold as 3. For example, if user Tom make ratings on three movie- You'u Got Mail, Sleepless in Seattle, and Swordfish. among them, only the ratings of You'v Got Mail and Sleepless in Seattle are larger than the threshold 3. From the genre information, we know that You'v Got Mail and Sleepless in Seattle belong to the love genre. So the genre aspect of Tom's profile is as Tom (love(2/2)) Hanks Fig. 2. User Profiles 2.2 Construction of user profiles through Fuzzy Inference The characteristics of the synopsis aspect are a little different from other aspects So, we take a different approach. a weighted keyword vector is constructed for each user. Firstly, keywords are extracted from synopsis field of movie descrip- tion record, which user shows interests on. Secondly, calculate the weight of each keyword. Although tf x idf formula is popularly used in information retrieval literature [16 for calculating the keyword weight, we adopt the method of cal- culating weight through fuzzy inference from our previous work 3, which sho more effectiveness In order to construct the user profiles based on the synopsis, firstly, the synopses of user preferred movies are transformed into a set of candidate term through eliminating stopwords and stemming by Porter's algorithm. The DTF, DF, and idF of each term are calculated based on this set and used as inputs of
Constructing User Profiles for Collaborative Recommender System 3 vector represents an aspect (or dimension of his preferences). In our current system, five aspects - actor, actress, director, genre and synopsis - have been considered. A profile vector consists of several attribute-value pairs as Figure 2 shows. To reduce the burden of users, the user can indicate interesting movies instead of specifying his preference explicitly; the profile vectors can be automatically constructed from movie description data by the following simple equation. Wn,m = Numitem ⊆ attributem of aspectn | item > threshold Numitem > threshold (1) where Numitem > threshold is the number of items, in which ranking value is lager than the threshold; Numitem ⊆ attributem of aspectn is the number of items containing the attribute m in the aspect n of user profile and its rating is larger than the threshold. In our present experiment, we set the value of the threshold as 3. For example, if user Tom make ratings on three movie - You’v Got Mail, Sleepless in Seattle, and Swordfish. Among them, only the ratings of You’v Got Mail and Sleepless in Seattle are larger than the threshold 3. From the genre information, we know that You’v Got Mail and Sleepless in Seattle belong to the love genre. So the genre aspect of Tom’s profile is as Tom {love (2/2)}. Assumption Total Movie No. : 3 Item Rating of Tom: Sleepless in Seattle 5 You’ve got mail 5 Swordfish 2 Movie ID:3 Title: Swordfish Actor: John Travolta Actress: Halle Berry Director: Sena Genre: Action Synopsis: Terrorist ransom Movie ID:2 Title: Sleepless in Seattle Actor: Tom Hanks Actress: Meg Ryan Director: Norman Rene Genre: Love Synopsis: lover in Seattle… Movie ID:1 Title: You’ve Got Mail Actor: Tom Hanks Actress: Meg Ryan Director: Nora Genre: Love Synopsis: Lovers meet online … User Profile: Tom Actor: (Tom Hanks, 1.0) Actress: (Meg Ryan, 1.0) Director:((Norman, 0.5),(Nora,0.5)) Genre: (lover, 1.0) Synopsis: (lover (0.18), meet (0.48), online (0.48)…) Fig. 2. User Profiles 2.2 Construction of user profiles through Fuzzy Inference The characteristics of the synopsis aspect are a little different from other aspects. So, we take a different approach. A weighted keyword vector is constructed for each user. Firstly, keywords are extracted from synopsis field of movie description record, which user shows interests on. Secondly, calculate the weight of each keyword. Although tf × idf formula is popularly used in information retrieval literature [16] for calculating the keyword weight, we adopt the method of calculating weight through fuzzy inference from our previous work [3], which shows more effectiveness. In order to construct the user profiles based on the synopsis, firstly, the synopses of user preferred movies are transformed into a set of candidate terms through eliminating stopwords and stemming by Porter’s algorithm. The DTF, DF, and IDF of each term are calculated based on this set and used as inputs of
Qing Li and Byeong Man Kim fuzzy inference. The DTF(Distributed Term Frequency) reflects the frequency nd distributed status of a term in a set of documents(user preferred movie synopses), which is calculated by dividing total occurrences of the term in a set of documents by the number of documents in the set containing the term. It needs to be normalized for being used in fuzzy inference. The following shows the normalized distributed term frequency (NDt NDTF There, T Fi is the frequency of term t; in the example documents(user preferred movie synopses ), DFi is the number of documents having term t; in the example documents The DF(Document Frequency) represents the frequency of documents having specific term the example documents Like DtF, Df also provides one measure of how well that term describes the contents of document set. The normalized document frequency, NDF, is defined in Equation 3, where DFi is the number of documents having term ti in the example documents DE NDF may The IDF(Inverse Document Frequency)represents the inverse document fre- quency of a specific term over an entire document collection(all movie synopses the database) not example documents. The motivation for usage of IDF factor that terms which appear in many documents are not very useful for distin- guishing a relevant document from a non-relevant one. The normalized inverse document frequency, NidF, is defined as follows NIDF IDE IDF=l ma x,IDF where, N is the total number of documents and ni is the number of documents in which term ti appears Figure 3 shows the membership functions of the input/output variables -3 inputs(NDTF, NDF, NIDF)and 1 output(TW)-used in our method. As you can see in Figure 3(a), NDTF variable has I S(Small), L(Large)1, and NDF ind NIDF variables have[ S(Small), M(Middle), L(Large)) as linguistic labels (or terms). The fuzzy output variable, TW (Term Weight) which represents the importance of a term, has six linguistic labels as shown in Figure 3(b) The 18 fuzzy rules are involved to infer the term weight(TW). The rules are constructed based on the intuition that the important or representative terms may occur across many positive example documents(user preferred movies)but not in general documents, i. e, their NDF and NIDF are very high. As shown in Figure 4, the Tw of a term is Z in most cases regardless of its NDF and NDTF
4 Qing Li and Byeong Man Kim fuzzy inference. The DTF (Distributed Term Frequency) reflects the frequency and distributed status of a term in a set of documents (user preferred movie synopses), which is calculated by dividing total occurrences of the term in a set of documents by the number of documents in the set containing the term. It needs to be normalized for being used in fuzzy inference. The following shows the normalized distributed term frequency (NDTF). NDT Fi = T Fi DFi maxj h T Fj DFj i (2) where, T Fi is the frequency of term ti in the example documents(user preferred movie synopses), DFi is the number of documents having term ti in the example documents. The DF (Document Frequency) represents the frequency of documents having a specific term within the example documents. Like DTF, DF also provides one measure of how well that term describes the contents of document set. The normalized document frequency, NDF, is defined in Equation 3, where DFi is the number of documents having term ti in the example documents. NDFi = DFi maxj DFj (3) The IDF (Inverse Document Frequency) represents the inverse document frequency of a specific term over an entire document collection(all movie synopses in the database) not example documents. The motivation for usage of IDF factor is that terms which appear in many documents are not very useful for distinguishing a relevant document from a non-relevant one. The normalized inverse document frequency, NIDF, is defined as follows: NIDFi = IDFi maxj IDFj , IDFi = log N ni (4) where, N is the total number of documents and ni is the number of documents in which term ti appears. Figure 3 shows the membership functions of the input/output variables - 3 inputs (NDTF, NDF, NIDF) and 1 output (TW) - used in our method. As you can see in Figure 3(a), NDTF variable has { S(Small), L(Large) }, and NDF and NIDF variables have { S(Small), M(Middle), L(Large) } as linguistic labels (or terms). The fuzzy output variable, TW (Term Weight) which represents the importance of a term, has six linguistic labels as shown in Figure 3(b). The 18 fuzzy rules are involved to infer the term weight (TW). The rules are constructed based on the intuition that the important or representative terms may occur across many positive example documents (user preferred movies) but not in general documents, i.e., their NDF and NIDF are very high. As shown in Figure 4, the TW of a term is Z in most cases regardless of its NDF and NDTF
Constructing User Profiles for Collaborative Recommender System H11I N⊥1 Fig 4. Fuzzy inference rules if its nidf is s. because such term equently thus its NdF and ndtf can be high. When ndF of a term is high and its NIDF is also high, the term is considered as a representative keyword and then the output value is between X and xX. The other rules were set similarl We can get the term weight Tw through the following procedure. But, the output is in the form of fuzzy set and thus has to be converted to the crisp value. 16.paper, the center of gravity method(COG)is used to defuzzify the output 1. Apply the NDTF, NDF, and NidF fuzzy values to the antecedent portions of 18 fuzzy rules. 2. Find the minimum value among the membership degrees of three input fuzzy 3. Classify every 18-membership degree into 6 groups according to the fuzzy output variable TW. 4. Calculate the maximum output value for each group and then generate 6 For instance, let us assume, there is one term, which NIdF is 0.35, NDF 0.2 and NDTF is 0.3. The degree of membership is determined by plugging the selected input parameter(NIDF, NDF or NDTF) into the horizontal axis and projecting vertically to the upper boundary of the membership function(s) in Figure 3. So the result is as follows NIDF=0.35:S=0.00,M=0.57; NDF=0.20:S=0.43,M=0.14 NDTF=0.30:S=0.53,L=0.07 Now referring back to the rules, only 8 rules out of 18 rules need to be lected. which are marked in Figure 4. The effective rules are listed as follows
Constructing User Profiles for Collaborative Recommender System 5 of documents, which is calculated by dividing total occurrences of the term in a set of documents by the number of documents in the set containing the term. It needs to be normalized for being used in fuzzy inference. The following shows the normalized term frequency (NTF). 0 0.2 0.4 0.6 0.8 1 Z S M L X XX 1 TW 1 0.25 0.75 1 L S L 1 0.15 0.35 0.65 0.85 1 M 1 NDTF NDF, NIDF 1 (a) Input variable (b) Output variable Z: zero S:small M: middle L : large X: x large XX:xx larger S reweighting Degree of Membership Degree of Membership Fig. 3. Fuzzy variables inverse document frequency of a specific term over an entire document collection not example documents. The normalized inverse document frequency, NIDF, is defined is the Table 1. Fuzzy inference rules NIDF NDF S M L NIDF NDF S M L S Z Z S S Z S M M Z M L M Z L X L S L X L S X XX NDTF = S NDTF = L Fig. 4. Fuzzy inference rules if its NIDF is S, because such term may occur frequently in any document and thus its NDF and NDTF can be high. When NDF of a term is high and its NIDF is also high, the term is considered as a representative keyword and then the output value is between X and XX. The other rules were set similarly. We can get the term weight TW through the following procedure. But, the output is in the form of fuzzy set and thus has to be converted to the crisp value. In this paper, the center of gravity method(COG) is used to defuzzify the output [6]. 1. Apply the NDTF, NDF, and NIDF fuzzy values to the antecedent portions of 18 fuzzy rules. 2. Find the minimum value among the membership degrees of three input fuzzy values. 3. Classify every 18-membership degree into 6 groups according to the fuzzy output variable TW. 4. Calculate the maximum output value for each group and then generate 6 output values. For instance, let us assume, there is one term, which NIDF is 0.35, NDF is 0.2 and NDTF is 0.3. The degree of membership is determined by plugging the selected input parameter(NIDF, NDF or NDTF) into the horizontal axis and projecting vertically to the upper boundary of the membership function(s) in Figure 3. So the result is as follows. NIDF=0.35 : S=0.00, M=0.57; NDF=0.20 : S=0.43, M=0.14; NDTF=0.30 : S=0.53, L=0.07. Now referring back to the rules, only 8 rules out of 18 rules need to be selected, which are marked in Figure 4. The effective rules are listed as follows