Collaborative Filtering for Information Recommendation Systems Anne Yun-An Chen and Dennis mcleod Department of Computer Science and Integrated Media System Center University of Southern California, Los Angeles, California, USA INTRODUCTION In order to draw users'attention and to increase their satisfaction towards online information search results, search engine developers and vendors try to predict user preference based on the user behavior. Recommendations are provided by the search engines or online vendors to the users. Recommendation systems are implemented in commercial and non-profit web sites to predict the user preferences. For commercial web sites, accurate predictions may result in higher selling rates. The main functions of recommendation systems include analyzing user data and extracting useful information for further predictions. Recommendation systems are designed to allow users to locate the preferable items quickly and to avoid the possible information overloads Recommendation systems apply data mining techniques to determine the similarity among thousands or even millions of data Collaborative filtering techniques have been successful in enabling the prediction of user preferences in the recommendation systems(Hill et al., 1995, Shardanand Maes, 1995). There are three major processes in the recommendation systems: object data collections and representations, similarity decisions, and recommendation computations. Collaborative filtering aims at finding the relationships among the new ndividual and the existing data in order to further determine the similarity and provide recommendations. How to define the similarity is an important issue. How similar should two objects be in order to finalize the preference prediction? Similarity decisions are concluded differently by collaborative filtering techniques. For example people that like and dislike movies in the same categories would be considered as the ones with similar behavior( Chee et al., 2001). The concept of the nearest-neighbor algorithm has been included in the implementation of the recommendation systems (Resnick et al., 1994). The designs of pioneer recommendation systems focus on entertainment fields(Resnick et al., 1994, Hill et al, 1995, Shardanand maes, 1995 Dahlen et al., 1998). The challenge of conventional collaborative filtering algorithms is the scalability issue(Sarwar et al., 2000a). Conventional algorithms explore the relationships among system users in large datasets. User data are dynamic, which means the data vary within a short time period. Current users may change their behavior patterns, and new users may enter the system at any moment. Millions of
Collaborative Filtering for Information Recommendation Systems Anne Yun-An Chen and Dennis McLeod Department of Computer Science and Integrated Media System Center University of Southern California, Los Angeles, California, USA INTRODUCTION In order to draw users’ attention and to increase their satisfaction towards online information search results, search engine developers and vendors try to predict user preference based on the user behavior. Recommendations are provided by the search engines or online vendors to the users. Recommendation systems are implemented in commercial and non-profit web sites to predict the user preferences. For commercial web sites, accurate predictions may result in higher selling rates. The main functions of recommendation systems include analyzing user data and extracting useful information for further predictions. Recommendation systems are designed to allow users to locate the preferable items quickly and to avoid the possible information overloads. Recommendation systems apply data mining techniques to determine the similarity among thousands or even millions of data. Collaborative filtering techniques have been successful in enabling the prediction of user preferences in the recommendation systems (Hill et al., 1995, Shardanand & Maes, 1995). There are three major processes in the recommendation systems: object data collections and representations, similarity decisions, and recommendation computations. Collaborative filtering aims at finding the relationships among the new individual and the existing data in order to further determine the similarity and provide recommendations. How to define the similarity is an important issue. How similar should two objects be in order to finalize the preference prediction? Similarity decisions are concluded differently by collaborative filtering techniques. For example, people that like and dislike movies in the same categories would be considered as the ones with similar behavior (Chee et al., 2001). The concept of the nearest-neighbor algorithm has been included in the implementation of the recommendation systems (Resnick et al., 1994). The designs of pioneer recommendation systems focus on entertainment fields (Resnick et al., 1994, Hill et al., 1995, Shardanand & Maes, 1995, Dahlen et al., 1998). The challenge of conventional collaborative filtering algorithms is the scalability issue (Sarwar et al., 2000a). Conventional algorithms explore the relationships among system users in large datasets. User data are dynamic, which means the data vary within a short time period. Current users may change their behavior patterns, and new users may enter the system at any moment. Millions of
data. which are called neighb to be examined in real time in orde provide recommendations(Herlocker et al., 1999). Searching among millions of process. To solve this, filtering algorithms are proposed to enable reductions of computations because properties of items are relatively static(Sarwar et al., 2001). Suggest is a Top-N recommendation engine implemented with item-based recommendation algorithms (Karypis. 2000, Deshpande Karypis, 2004 ). Meanwhile, the amount of items is usually less than the number of users. In early 2004, Amzn Investor Relations(2004) states that Amazon. com Apparel Accessories Store provides about one hundred and fifty thousands of items but has more than one million customer accounts that have ordered from this store. Amazon. com employs item-based algorithm for collaborative-filtering-based recommendations(Linden et al., 2003)to avoid the disadvantages of conventional collaborative filtering algorithms BACKGROUND Collaborative filtering techniques collect and establish profiles, and determine the relationships among the data according to similarity models. The possible categories of the data in the profiles include user preferences, user behavior patterns, or item properties. Collaborative filtering solves several limitations in content-based filtering techniques(Balabanovic Shoham, 1997), which decides user preference only based on the individual profile. Collaborative filtering has been expressed in different terminologies in literatures. Social Filtering and automated Collaborative filtering (ACF) are two frequently referred terminologies. Collaborative-filtering-based recommendation systems are also referred as Collaborative Filtering Recommender systems and Automated Collaborative Filtering systems Several existing collaborative-filtering-based recommendation systems have been designed and implemented since early 90 s. Collaborative filtering techniques have been proven to provide satisfying recommendations to users(Hill et al., 1995, Shardanand Maes, 1995). Grouplens project, a recommendation system for netnews, has investigated the issues on automated collaborative filtering since 1992 (Resnick et al., 1994, Konstan et al., 1997). In the system design, the Better Bit Bureaus(BBBs) has been developed to predict user preferences based on computing the correlation coefficients between users and on averaging ratings for one news article from all MovieLens is a movie recommendation system based on GroupLens technology (Miller et al., 2003 ) RECommendation Tree(Rec Tree)is one method using divide-and-conquer approach to improve correlation-based collaborative filtering and performing clustering on movie ratings from users( Chee et al., 2001). The ratings are
user data, which are called neighbors, are to be examined in real time in order to provide recommendations (Herlocker et al., 1999). Searching among millions of neighbors is a time-consuming process. To solve this, item-based collaborative filtering algorithms are proposed to enable reductions of computations because properties of items are relatively static (Sarwar et al., 2001). Suggest is a Top-N recommendation engine implemented with item-based recommendation algorithms (Karypis. 2000, Deshpande & Karypis, 2004). Meanwhile, the amount of items is usually less than the number of users. In early 2004, Amzn Investor Relations (2004) states that Amazon.com Apparel & Accessories Store provides about one hundred and fifty thousands of items but has more than one million customer accounts that have ordered from this store. Amazon.com employs item-based algorithm for collaborative-filtering-based recommendations (Linden et al., 2003) to avoid the disadvantages of conventional collaborative filtering algorithms. BACKGROUND Collaborative filtering techniques collect and establish profiles, and determine the relationships among the data according to similarity models. The possible categories of the data in the profiles include user preferences, user behavior patterns, or item properties. Collaborative filtering solves several limitations in content-based filtering techniques (Balabanovic & Shoham, 1997), which decides user preference only based on the individual profile. Collaborative filtering has been expressed in different terminologies in literatures. Social Filtering and Automated Collaborative Filtering (ACF) are two frequently referred terminologies. Collaborative-filtering-based recommendation systems are also referred as Collaborative Filtering Recommender systems and Automated Collaborative Filtering systems. Several existing collaborative-filtering-based recommendation systems have been designed and implemented since early 90’s. Collaborative filtering techniques have been proven to provide satisfying recommendations to users (Hill et al., 1995, Shardanand & Maes, 1995). Grouplens project, a recommendation system for netnews, has investigated the issues on automated collaborative filtering since 1992 (Resnick et al., 1994, Konstan et al., 1997). In the system design, the Better Bit Bureaus (BBBs) has been developed to predict user preferences based on computing the correlation coefficients between users and on averaging ratings for one news article from all. MovieLens is a movie recommendation system based on GroupLens technology (Miller et al., 2003). RECommendation Tree (RecTree) is one method using divide-and-conquer approach to improve correlation-based collaborative filtering and performing clustering on movie ratings from users (Chee et al., 2001). The ratings are
extracted from MovieLens Dataset. Ringo(Shardanand Maes, 1995) provides music recommendations using a word of mouth recommendation mechanism. The terminology"social information filtering" was used instead of collaborative filtering in the paper. Ringo determines the similarity of users based on user rating profiles Firefly and Gustos are two recommendation systems which employed the word-of-mouth recommendation mechanism to recommend products. WebWatcher has been designed for assisting information searches on the World Wide Web (Armstrong et al., 1995). Web Watcher suggests users which hyperlinks would lead to the information that users want. The general function serving as the similarity model is generated by learning from a sample of training data logged from users. Yenta is a multi-agent matchmaking system implemented with the clustering algorithm and the referral mechanism(Foner, 1997). Jester is an online joke recommendation system based on Eigentaste algorithm, which was proposed to reduce dimensionality of offline clustering and to perform online computations in constant time( Goldberg et al 2000). The clustering is based on continuous user ratings of jokes OneofthemostfamousrecommendationsystemsnowadaysistheAmazon.com Recommendation(Linden et al., 2003). This recommendation system incorporates a matrix of the item similarity. The formulation of the matrix is performed offline Launch. music on Yahoo!. Cinemax. com. Moviecritic. Tv Recommender. vided Guide and the suggestion box, and CDnow. com are other successful examples of collaborative-filtering-based recommendation systems in the entertainment domain Many methods, algorithms, and models have been proposed to resolve the similarity decisions in collaborative-filtering-based recommendation systems. One of the most common methods to determine the similarity is the cosine angle computation Amazon. com Recommendation system(Linden et al., 2003)uses this cosine measure to decide the similarity between every two items bought by each customer and to establish the item matrix, which contains item-to-item relationships. Several algorithms that combine the knowledge from Artificial Intelligence(AI)(Mobasher et al. 2004), Network( Chien et al, 1999), and other fields have also been implemented in the recommendation systems. Genetic algorithm along with Naive Bayes Classifier is to define the relationships among users and items(Ko et al., 2001). genetic algorithm first completes clustering for discovering relationships among system users in order to find the global optimum. On the other hand, Naive Bayes classifier defines the association rules of items. Then, similarity decisions would be performed to match the clusters of users or clusters of items, and the system can decide the final user profiles. The user profiles only consist of associated rules. Expectation Maximization
extracted from MovieLens Dataset. Ringo (Shardanand & Maes, 1995) provides music recommendations using a word of mouth recommendation mechanism. The terminology “social information filtering” was used instead of collaborative filtering in the paper. Ringo determines the similarity of users based on user rating profiles. Firefly and Gustos are two recommendation systems which employed the word-of-mouth recommendation mechanism to recommend products. WebWatcher has been designed for assisting information searches on the World Wide Web (Armstrong et al., 1995). WebWatcher suggests users which hyperlinks would lead to the information that users want. The general function serving as the similarity model is generated by learning from a sample of training data logged from users. Yenta is a multi-agent matchmaking system implemented with the clustering algorithm and the referral mechanism (Foner, 1997). Jester is an online joke recommendation system based on Eigentaste algorithm, which was proposed to reduce dimensionality of offline clustering and to perform online computations in constant time (Goldberg et al., 2000). The clustering is based on continuous user ratings of jokes. One of the most famous recommendation systems nowadays is the Amazon.com Recommendation (Linden et al., 2003). This recommendation system incorporates a matrix of the item similarity. The formulation of the matrix is performed offline. Launch, music on Yahoo!, Cinemax.com, Moviecritic, TV Recommender, Video Guide and the suggestion box, and CDnow.com are other successful examples of collaborative-filtering-based recommendation systems in the entertainment domain. Many methods, algorithms, and models have been proposed to resolve the similarity decisions in collaborative-filtering-based recommendation systems. One of the most common methods to determine the similarity is the cosine angle computation. Amazon.com Recommendation system (Linden et al., 2003) uses this cosine measure to decide the similarity between every two items bought by each customer and to establish the item matrix, which contains item-to-item relationships. Several algorithms that combine the knowledge from Artificial Intelligence (AI) (Mobasher et al. 2004), Network (Chien et al., 1999), and other fields have also been implemented in the recommendation systems. Genetic algorithm along with Naïve Bayes Classifier is to define the relationships among users and items (Ko et al., 2001). Genetic algorithm first completes clustering for discovering relationships among system users in order to find the global optimum. On the other hand, Naïve Bayes classifier defines the association rules of items. Then, similarity decisions would be performed to match the clusters of users or clusters of items, and the system can decide the final user profiles. The user profiles only consist of associated rules. Expectation Maximization
(EM)algorithm( Charalambous Logothetis, 2000) provides a standard procedure to estimate the maximum likelihood of latent variable models and this algorithm has been applied to estimate different variants of the aspect model for the collaborative filtering(Hoffman, 1999). Heuristic of EM algorithm can be applied on latent class models to perform aspect extracting or clustering Meanwhile, hierarchical structures are employed to describe the relationships among users (Jung et al., 2001). The preferences of each user can be described in a hierarchical structure. The structure represents the index of categories, which are the labels of the nodes. Matching one structure to another with all category labels results in that each node contains a group of users with similar preferences hierarchical structures can also be applied on similarity computations for items( Ganesan et al 2003). Edges in the structure clearly define how items are related for the item-to-item relationships. A hierarchical structure, a tree, specified the relative weights for the edges provide information on how much two items are related. A method of the order-based similarity measurement has been proposed for building a personal computer recommendation system(PCFinder)(Xiao et al., 2003). Instead of using O/ for the search, this method uses the concept in Fuzzy logic to estimate the similarity Two popular approaches, the coefficient correlation computation and earest-neighbor algorithm, have their limitations on scalability and sparsity Clustering(Breese et al., 1998), Eigentaste algorithm( Goldberg et al., 2000), and Singular Value Decomposition(SVD)( Sarwar et al., 2000) are introduced to collaborative-filtering-based recommendation systems to break these barriers Eigentaste and genetic algorithms enable the constant time computations for online processes. Item-based collaborative filtering algorithms are proposed to further decrease the computation time(Linden et al., 2003) MAIN THRUST OF THE MANUSCRIPT Privacy Issues and User Identification Do users always agree on being monitored by the systems? Not every user is comfortable if each page the user visited is recorded. Some users even disable the cookies in their browsers. Recommendation systems usually require user registrations in order to utilize user data for future recommendations. There exist users that prefer not to login systems every time they visit. Can the behavior patterns of random users be included in the data mining processes? It depends on the properties of the similarity models. Unregistered users only provide few continuously behavior patterns
(EM) algorithm (Charalambous & Logothetis, 2000) provides a standard procedure to estimate the maximum likelihood of latent variable models, and this algorithm has been applied to estimate different variants of the aspect model for the collaborative filtering (Hoffman, 1999). Heuristic of EM algorithm can be applied on latent class models to perform aspect extracting or clustering. Meanwhile, hierarchical structures are employed to describe the relationships among users (Jung et al., 2001). The preferences of each user can be described in a hierarchical structure. The structure represents the index of categories, which are the labels of the nodes. Matching one structure to another with all category labels results in that each node contains a group of users with similar preferences. Hierarchical structures can also be applied on similarity computations for items (Ganesan et al., 2003). Edges in the structure clearly define how items are related for the item-to-item relationships. A hierarchical structure, a tree, specified the relative weights for the edges provide information on how much two items are related. A method of the order-based similarity measurement has been proposed for building a personal computer recommendation system (PCFinder) (Xiao et al., 2003). Instead of using 0/1 for the search, this method uses the concept in Fuzzy Logic to estimate the similarity. Two popular approaches, the coefficient correlation computation and the nearest-neighbor algorithm, have their limitations on scalability and sparsity. Clustering (Breese et al., 1998), Eigentaste algorithm (Goldberg et al., 2000), and Singular Value Decomposition (SVD) (Sarwar et al., 2000) are introduced to collaborative-filtering-based recommendation systems to break these barriers. Eigentaste and genetic algorithms enable the constant time computations for online processes. Item-based collaborative filtering algorithms are proposed to further decrease the computation time (Linden et al., 2003). MAIN THRUST OF THE MANUSCRIPT Privacy Issues and User Identification Do users always agree on being monitored by the systems? Not every user is comfortable if each page the user visited is recorded. Some users even disable the cookies in their browsers. Recommendation systems usually require user registrations in order to utilize user data for future recommendations. There exist users that prefer not to login systems every time they visit. Can the behavior patterns of random users be included in the data mining processes? It depends on the properties of the similarity models. Unregistered users only provide few continuously behavior patterns
These data may be hardly useful if the similarity models require the quantity of the behavior patterns to reach a certain level. At the same time, these data may be treated as neighbors and included in the clustering processes for the recommendation computations. The computational time will be increased when more neighbors are included. The necessity to include the data of segmental user behavior patterns depends. If enlarging the data coverage enables the increase on the prediction accuracy, there is a trade-off between the computation time length and the coverage scale Drawbacks There are still several drawbacks of the collaborative filtering. First. the lack of ne information would affect the recommendation results. For the relationship mining new items not-yet-rated or not- yet-labeled can be abandoned in the recommendation processes. The second problem is that the collaborative filtering may not cover the extreme case. If the scales of the user profiles are small or the users have unique tastes similarity decisions are unable to be established. The third problem is the updat frequency. If any new information of users has to be included in the recommendation processes in real time, data latency will increase the waiting time for the query result The complexity of the computation for the recommendation affects the waiting time of the user directly. Synchronization is another issue of the profile upda system. When hundreds of users query the system within a very short time period, two new problems occur: who should be considered in one certain clustering process and how to pipeline the computational power of the system server. Hybrid Methods A new approach is designed to comprise both content-based and collaborative filtering techniques in order to provide the accurate prediction on user preferences The decisions of how accurate the predictions are depend on the subjective opinions from the users. A recommendation system including both technologies is a hybrid recommendation system(Balabanovic Shoham, 1997). Hybrid methods solve the problem of extreme case coverage that collaborative filtering techniques unable to The Next Evaluation Tool for Information Retrieval (IR) Precision and Recall are two conventional measurements of data accuracy. Use satisfaction has become an important issue in the IR area since a decade ago Recommendation system developers need to focus on what the users prefer and avoid what the users dislike. Evaluating user satisfaction is not an easy job. There are two
These data may be hardly useful if the similarity models require the quantity of the behavior patterns to reach a certain level. At the same time, these data may be treated as neighbors and included in the clustering processes for the recommendation computations. The computational time will be increased when more neighbors are included. The necessity to include the data of segmental user behavior patterns depends. If enlarging the data coverage enables the increase on the prediction accuracy, there is a trade-off between the computation time length and the coverage scale. Drawbacks There are still several drawbacks of the collaborative filtering. First, the lack of the information would affect the recommendation results. For the relationship mining, new items not-yet-rated or not-yet-labeled can be abandoned in the recommendation processes. The second problem is that the collaborative filtering may not cover the extreme case. If the scales of the user profiles are small or the users have unique tastes, similarity decisions are unable to be established. The third problem is the update frequency. If any new information of users has to be included in the recommendation processes in real time, data latency will increase the waiting time for the query result. The complexity of the computation for the recommendation affects the waiting time of the user directly. Synchronization is another issue of the profile updates in the system. When hundreds of users query the system within a very short time period, two new problems occur: who should be considered in one certain clustering process and how to pipeline the computational power of the system server. Hybrid Methods A new approach is designed to comprise both content-based and collaborative filtering techniques in order to provide the accurate prediction on user preferences. The decisions of how accurate the predictions are depend on the subjective opinions from the users. A recommendation system including both technologies is a hybrid recommendation system (Balabanovic & Shoham, 1997). Hybrid methods solve the problem of extreme case coverage that collaborative filtering techniques unable to handle. The Next Evaluation Tool for Information Retrieval (IR) Precision and Recall are two conventional measurements of data accuracy. User satisfaction has become an important issue in the IR area since a decade ago. Recommendation system developers need to focus on what the users prefer and avoid what the users dislike. Evaluating user satisfaction is not an easy job. There are two