n a recommender system, selecting and tuning the appropriate recommender algorithm for both the user and the user's current information seeking task will generate a more useful recommendation list than a generic or un-tuned algorithn In this dissertation, we develop new theoretical models, run offline simulation experiments, and conduct user studies in support of this thesis. Before outlining our research approach, there are two points we must first consider 1. Importance of external validation criteria As recommenders move into domains with additional validation criteria. the importance of considering the user's information seeking task will increase 2. The current state of recommender metrics Current predictive accuracy and decision support metrics are poor at evaluating the suitability of a recommender algorithm for an information seeking task [86] External Validation One of the criticisms of recommender systems up until this point is that they have mostly appeared in low-cost entertainment domains where decisions are based on taste, such as movies,music,television, books, jokes, etc. This is a valid criticism, if recommenders are going to be accepted by a larger audience, they need to generate high quality recommendations in domains where taste is not the main deciding factor of consuming an Item We posit that users come to a recommender as part of an information seeking task. More important, we believe the importance of the information seeking task varies from domain to domain. In entertainment-based domains the decision of whether to consume a recommendation is a tasted-based decision. The user determines if she likes the item, and then acts accordingly. The information-seeking task is one-dimensional how much do i like it? Reproduced with permission of the copyright owner. Further reproduction prohibited without permission
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission
If we move into a non-entertainment domain, e.g. the domain of peer-reviewed research papers, we claim the decision of whether to consume the item becomes more complex. While taste may be one component, other non-taste criteria also must be satisfied. For example, if I am looking to add references to a paper I am writing, high quality paper recommendations from a different research area will not help me, independent of how much I might enjoy reading them. My external, non-taste criteria places constraints on what I choose to consume. Until this point, recommenders have dealt with one criterion at a time, historically, a taste-based criterion. Users may have any number of criteria when visiting a recommender system, independent of domain. For example, a user of a movie recommender may be limited to items released on VHS tape To understand context, recommenders need to generate recommendations based on all of user's criteria. We claim these external criteria are representations of the users information seeking task. Thus, as the number and/or importance of these criteria increase, the more important our thesis becomes. If we want to tailor recommendations o a user's information seeking task, we first need to understand what these criteria are for a given user in a given domain The Curse of Accuracy Researchers and practioners have long used accuracy to judge the goodness of recommender algorithms. Many metrics have been proposed and used to measure accuracy, including ROC curves[55], modifications to precision and recall [131 Breese's Half-Life metric [13], and, most commonly used, Mean Absolute Error (MAe) 13, 55, 135]. For example, [51] provides an analysis of 432 variants of User-USer Collaborative Filtering algorithm run against ll accuracy metrics. While accuracy is an mportant component of a recommender algorithm, focusing solely on it leads to two different problems The first problem comes from the way accuracy-centric analysis views the ecommendation process. At a high level, this s process is one where a user, either with or without establishing a user model, makes a request of the recommender in the form of a basket of items and ratings. The recommender algorithm performs a computation based Reproduced with permission of the copyright owner. Further reproduction prohibited without permission
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission
on this basket and returns a recommendation list to this user. In this setting, each request sent to the recommender is an independent event done in isolation of all other recommendation events. While this may be true for the algorithm itself, it is not true for e user Users see each recommendation in the context of other recommendations -in a list. Independently good recommendations may create a bad recommendation list. For example, if we assume that Tolkien,'s Lord of the Rings would be a good book recommendation, would a list containing ten different editions of that book(e. g hardcover,paperback, split into three volumes, combined into one volume, etc. )be a d recommendation list? As we argue in Chapter 4, we believe recommendation list need to be evaluated as a single recommendation entity. Moreover, the recommendation process is iterative. One recommendation list is evaluated in the context of previous lists he user has seen. The user will have different opinions of a consistent algorithm compared to an inconsistent one. a metric that judges independent events,even recommendation lists, will miss this temporal component of the recommendation process The fact that users return to recommenders is an important aspect of recommendation process The second problem is an artifact of the standard approach used to measure accuracy. The leave-n-out methodology [131, works by splitting collected ratings data into test and train datasets, removing n ratings from each test line, and recording how well the recommender can predict back the removed ratings. In essence, you hide a portion of the data and check how well the recommender reconstructs it. Leave-n-out is commonly used in machine learning to test the accuracy of classification algorithms [971 If we approach leave-n-out from the user's perspective, it is equivalent to recommending items the user has already rated. This is not as useful as it could be. For example, when looking for recommendations on new places to visit on vacation, a travel guide book only containing information on places you have visited before is not helpful Moreover, if the guide recommended some places you've never been--say Beijing and Prague--then the leave-n-out methodology penalizes it for failing to recommend instead 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission
the places you have been, even if you'd like Beijing better than Boston or Prague better than Paris. Even if the book correctly ordered the places such that your favorites were first, this book is useless for planning a vacation. Accuracy metrics which reward algorithms for generating recommendation of withheld items are not measuring how well an algorithm performs on the task users care about: generating recommendations for items they have not seen The problem is rooted in the difference between a classifier and a recommender. Classifiers segment spaces into their most probable classes. In recommendation terms,a classifier would find, and recommend, the items the user is most like to rate next. while these items would have high"ratability"for that user, there is no guarantee that these items will help users with their information seeking tasks. For instance, an online music store used a User-based collaborative filtering algorithm to generate recommendations The most common recommendation was for the Beatle's"White Album". From an accuracy perspective, these recommendations were dead-on: most users like that album ery much. From a usefulness perspective, though, the recommendations were a complete failure: every user either already owned the "White Album", or had specificall y chosen not to own it. Even though it was highly ratable, White Album recommendations were almost never acted on by users, because they added almost no value In order to tailor recommendation lists not only to a user, but to a user information seeking task, we will need to judge recommender algorithms using a variety of metrics, each of which measure a different property of that algorithm and each of which correspond to different aspects of a user's information seeking task. When we have that information, we can select and tune the appropriate algorithm(s)for each user and task. In short, we need to re-think how to generate a'good'recommendation list Building Bridges In Figure 1-1, we show the current state of recommender systems, the state of the world before this dissertation. Between the user and the recommender is a space we call the Reproduced with permission of the copyright owner. Further reproduction prohibited without permission
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission
gap of intention. We argue that not only is the information channel between user and recommender too narrow, but that the two sides may not understand the cues currently transmitted across this channel. Information the user wishes to send concerning her intentions for using the recommender is lost to the gap. In much the same way, contextual information may not come back to the user; the recommender's intentions also fall into the gap Recommender Algorithm ntention models One algor thn for Figure 1-1: The Intention Gap between Users and Recommenders Our solution to this problem is to build a bridge provide an organizing language he two sides can use to communicate their intentions, and categorize recommender algorithms in terms of this language abstraction. As shown in Figure 1-2, our bridge is a process model connecting users and their needs to recommender algorithms. It adds two nodes: Human Recommender Interaction theory(HRD), and a new set of recommender metrics, as well as a set of processes connecting those nodes hRI is a new framework and methodology for analyzing both user information seeking tasks and recommendation algorithms in the context of a recommender system with the end goal of generating useful recommendation lists. hRi was developed by re examining the recommendation process from the end user's perspective and categorize ing transparency in the recommendation process has been long advocated [137], but few existing menders provide insight as to how or why items were recommended 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission