d ver ion to appear in First International ference on Autonomous Agents, February 1997, Marina del Rey CA An Adaptive Web Page recommendation Service Marko balabanov Department of Computer Science Stanford University Abstract nformation retrieval(IR) system to match web pages n adaptive recom mendat ion service seeks to adapt to user-supplied queries, and will deliver the pages via to its users, providing increasingly personalized rec the web. A news clipping service collects items by sub ommendat ions over time. In this paper we introduce scri bing to news wires, selects items for users according the "Fab"adaptive web page recommendation ser to profiles they submit, and might deliver them by fax vice. There has been much research on analyzing doc der to in or search results. More recent ly rese archers have be Our work explores adaptive met hods for the collec tion and selection of web pages. In order to facilitate be exploited to the same ends. The Fab system strike this exploration we have designed the"Fabarchitec a balance between these two approaches, taking ad ture as a test-bed for trying out different collection and ant age of the shared interests among users without election methods, supporting a popul ation of users losing the content analysis. Running since March 1996, it has d ent al results been populated wit h a collection of agents for the col lection and selection of we b pages, whose interaction The two major classes of adaptive recommendation sters emergent collaborative properties. In this pa service are content-based (recommending items based per we explain the design of the system architecture on some analysis of their content)and collaborative and report the results of our first experiment, eval ( recommending items based on the recommendations uat ing recommendat ions provided to a group of test of other users). We have attempted to establish a mid dle of the shared among users without losing the benefits of the repre 1 Introduction sent ations provided by content analy sis When faced with an insurmountably large array of In this paper we report resul ts from our ongoing choices, people often turn to some kind of recommen first experiment. The Fab sy stem has been instant dation service. For inst ance they will look in their ated with classes of adaptive content-based collection newspaper for movie reviews, or in a travel guide for and selection agents w hose interaction fosters emer tourist sights. We are interested in creating an auto- gent coll abor ative properties. By agent" we simpl latic recommendation service which over time adapts mean a process which maint ains a long-term state-in to its users, so that they receive increasingly person- our case a profile w hich can be used to induce a rank alized recommendations We can think of on-line rec- ing among web pages. The operation of the system is as follows: users can request recommendations at any Collection first collect the items to be recom- time, and will be shown the ten highest-ranking web mended pages according to their profile. They rate each page rding to how well it matches their interests. and Selection Next select from the collected items those the collection and selection agents use this feedback to best for a particular user refine their profiles Delivery Finally deliver the selected items to the The following section pl aces the work in context by describing related projects from both the IR and the lost on-line recommendation or search services can Al communities. We then describe the design of be described in this way. For inst ance a web index(eg chitecture and of the agents which popul ate it. Finally Mauldin Leavitt 1994))will have an exhaustive col we describe our experimental design and give results t which i from our first exper
Stanford University Digital Libraries Pro ject Working Paper SIDL-WP-1996-0041 Revised version to appear in First International Conference on Autonomous Agents, February 1997, Marina del Rey CA An Adaptive Web Page Recommendation Service Marko Balabanovic Department of Computer Science, Stanford University Abstract An adaptive recommendation service seeks to adapt to its users, providing increasingly personalized recommendations over time. In this paper we introduce the \Fab" adaptive web page recommendation service. There has been much research on analyzing doc- ument content in order to improve recommendations or search results. More recently researchers have begun to explore how the similarities between users can be exploited to the same ends. The Fab system strikes a balance between these two approaches, taking advantage of the shared interests among users without losing the benets of the representations provided by content analysis. Running since March 1996, it has been populated with a collection of agents for the collection and selection of web pages, whose interaction fosters emergent collaborative properties. In this paper we explain the design of the system architecture and report the results of our rst experiment, evaluating recommendations provided to a group of test users. 1 Introduction When faced with an insurmountably large array of choices, people often turn to some kind of recommendation service. For instance they will look in their newspaper for movie reviews, or in a travel guide for tourist sights. We are interested in creating an automatic recommendation service which over time adapts to its users, so that they receive increasingly personalized recommendations. We can think of on-line recommendation as a three-stage process: Collection First collect the items to be recommended. Selection Next select from the collected items those best for a particular user. Delivery Finally deliver the selected items to the user. Most on-line recommendation or search services can be described in this way. For instance a web index (e.g., (Mauldin & Leavitt 1994)) will have an exhaustive collection component, a selection component which is an information retrieval (IR) system to match web pages to user-supplied queries, and will deliver the pages via the web. A news clipping service collects items by subscribing to news wires, selects items for users according to proles they submit, and might deliver them by fax or e-mail. Our work explores adaptive methods for the collection and selection of web pages. In order to facilitate this exploration we have designed the \Fab" architecture as a test-bed for trying out dierent collection and selection methods, supporting a population of users and recording experimental results. The two major classes of adaptive recommendation service are content-based (recommending items based on some analysis of their content) and col laborative (recommending items based on the recommendations of other users). We have attempted to establish a middle ground, taking advantage of the shared interests among users without losing the benets of the representations provided by content analysis. In this paper we report results from our ongoing rst experiment. The Fab system has been instantiated with classes of adaptive content-based collection and selection agents, whose interaction fosters emergent collaborative properties. By \agent" we simply mean a process which maintains a long-term state|in our case a prole which can be used to induce a ranking among web pages. The operation of the system is as follows: users can request recommendations at any time, and will be shown the ten highest-ranking web pages according to their prole. They rate each page according to how well it matches their interests, and the collection and selection agents use this feedback to rene their proles. The following section places the work in context by describing related projects from both the IR and the AI communities. We then describe the design of the architecture and of the agents which populate it. Finally we describe our experimental design and give results from our rst experiment
2 Related work 3 System Design Two strands of rel ated research are of particular rel 3.1 Background evance: work on document classi fication(which con For the content-based approach which we are tak siders al ternative techniques for selection) and collab there are four essenti al requirements or ative systems (w hich propose a di fferent architecture based on a novel selection technique) w A representation of a web page m A representation of the users interests 2.1 Document classification r(w, m) A function to determine the pertinence of a Document cl assification lies at the intersection of web page given a user's interests chine learning(ML)and IR, and there is a large lit- u(w, m, s)A function returning an updated user pro erature. To enable the application of Ml techniques file m given the user's feedback s on a page w there has been much interest recently in dimension ality reduction (e. g,(Deerwester et al. 1990)). In The assumption underlying content-based systems is that the content (rather than appearance, inter activity the IR community variants of relev ance feedback have speed of loading, etc )of a page is what determines the TRECconferences(Harman 1995), for example(Buck user 's interest. No ley et al. 1995; All an 1995). There have also been nu that we can represent the content of a page purely nerous comparisons between these vari ants and non by considering the words contained in the text. We incremental ML techniques (Schutze, Hull, Peder gnore all mark-up tags, images and other multimedi a sen 1995; Lang 1995; Pazz ani, Muramatsu, billsus information The vector-space model of information retrieval f such technic the large number of examples that are required before (Salton Mcgill 1983) provides us with an appropri ate representation for documents based on their con- he learning algorithm can be applied. In contrast we stituent words. This model has been used and studied assume an on-line model w here the user gradually see bet ter and bet ter pages, and we do not assume a fixed xtensively forms the basis for commercial web search document collection-users' back influences which systems and has been shown to be competitive with pages are collected and shown to them al ternative IR methods(Harman 1995) Some of the work on document filtering(where In this model documents and queries are represented as vectors. Assume some dictionary vector d. where the classification is used to pick relevant documents each element di is a word. Each document then has a from an incoming stream) shares our on-line model of learning(Sheth Maes 1993; Foltz Dumais 1992) vector w, where element w; is the weight of word d; fo that document. If the document does not contain di Assisted browsing systems(Armstrong et al. 1995 Lieberman 1995) share our goal of recommending web then W;=0 In our formul ation we reduce words to their stems pages, but restrict themselves to the section of the web just ahead of the user's current browser location, and using the Porter algorithm(Porter 1980), ignore words te links to f from a standard stop list of 57l words, and calculate T FIDF weight the weight u; of a word d; in a docu- ment w is given by 2.2 Collaborative Systems Collaborative recommendation services (Shardanand =05+05 &e Maes 1995; Resnick et al. 1994) present an inter df(i) esting al ternative to traditional IR techniques. Ty p- re tf(i) is the number of times word d; appears in cally they work by identify ing the nearest neighbors document W(the term frequency), df(i)is the number to a user in the space of past ratings, and then recom- of documents in the collection which contain d:(the mending something one of these neighbors has rated document frequency), n is the number of documents in the collection and tmar is the maximum term fre- Domains with const ant stream of ficult for coll aborative systems: a single iter ncle re approximated by be rated very often by different users, and much of the using a fixed dictionary of approximately 70 information gained by the system is rooted in the past termed words created from a sample of 5, 229 rar and therefore unusable. Examples of such domains are domly picked web pages (so in the above equation eb pages or news feeds. In contrast more static do 29). If d, do he dictionary those with a much smaller num et df(i) ber of new items appearing(eg, movies) will enable In an attempt to avoid over-fitting, and reduce mem collaborative systems to work well, as lots of different ory and communications load, we use just the 100 users will rate the same items. Additionally for these highest-weighted words from any document. Recent domains it is very hard to do any oontent analysis so experiments described in(Pazz ani, Muramatsu, bill content-based systems will be harder to implemer sus 1996) have shown that words lead
2 Related Work Two strands of related research are of particular relevance: work on document classication (which considers alternative techniques for selection) and collaborative systems (which propose a dierent architecture based on a novel selection technique). 2.1 Document Classication Document classication lies at the intersection of machine learning (ML) and IR, and there is a large literature. To enable the application of ML techniques there has been much interest recently in dimensionality reduction (e.g., (Deerwester et al. 1990)). In the IR community variants of relevance feedback have been studied in the context of the routing task at the TREC conferences (Harman 1995), for example (Buckley et al. 1995; Allan 1995). There have also been nu- merous comparisons between these variants and nonincremental ML techniques (Schutze, Hull, & Pedersen 1995; Lang 1995; Pazzani, Muramatsu, & Billsus 1996). One of the disadvantages of such techniques is the large number of examples that are required before the learning algorithm can be applied. In contrast we assume an on-line model where the user gradually sees better and better pages, and we do not assume a xed document collection|users' feedback in uences which pages are collected and shown to them. Some of the work on document ltering (where the classication is used to pick relevant documents from an incoming stream) shares our on-line model of learning (Sheth & Maes 1993; Foltz & Dumais 1992). Assisted browsing systems (Armstrong et al. 1995; Lieberman 1995) share our goal of recommending web pages, but restrict themselves to the section of the web just ahead of the user's current browser location, and then recommend appropriate links to follow. 2.2 Collaborative Systems Collaborative recommendation services (Shardanand & Maes 1995; Resnick et al. 1994) present an interesting alternative to traditional IR techniques. Typically they work by identifying the nearest neighbors to a user in the space of past ratings, and then recommending something one of these neighbors has rated highly. Domains with a constant stream of new items are dif- cult for collaborative systems: a single item may not be rated very often by dierent users, and much of the information gained by the system is rooted in the past and therefore unusable. Examples of such domains are web pages or news feeds. In contrast more static domains (e.g., music) or those with a much smaller num- ber of new items appearing (e.g., movies) will enable collaborative systems to work well, as lots of dierent users will rate the same items. Additionally for these domains it is very hard to do any content analysis so content-based systems will be harder to implement. 3 System Design 3.1 Background For the content-based approach which we are taking, there are four essential requirements: w A representation of a web page. m A representation of the user's interests. r(w; m) A function to determine the pertinence of a web page given a user's interests. u(w; m; s) A function returning an updated user pro- le m given the user's feedback s on a page w. The assumption underlying content-based systems is that the content (rather than appearance, interactivity, speed of loading, etc.) of a page is what determines the user's interest. Now we make the further assumption that we can represent the content of a page purely by considering the words contained in the text. We ignore all mark-up tags, images and other multimedia information. The vector-space model of information retrieval (Salton & McGill 1983) provides us with an appropriate representation for documents based on their constituent words. This model has been used and studied extensively, forms the basis for commercial web search systems and has been shown to be competitive with alternative IR methods (Harman 1995). In this model documents and queries are represented as vectors. Assume some dictionary vector d, where each element di is a word. Each document then has a vector w, where element wi is the weight of word di for that document. If the document does not contain di then wi = 0. In our formulation we reduce words to their stems using the Porter algorithm (Porter 1980), ignore words from a standard stop list of 571 words, and calculate a TFIDF weight: the weight wi of a word di in a docu- ment W is given by: wi = 0:5+0:5 tf (i) tfmax log n df (i) where tf (i) is the number of times word di appears in document W (the term frequency), df (i) is the number of documents in the collection which contain di (the document frequency), n is the number of documents in the collection and tfmax is the maximum term frequency over all words in W. The document frequencies are approximated by using a xed dictionary of approximately 70,000 stemmed words created from a sample of 5,229 randomly picked web pages (so in the above equation n = 5;229). If di does not occur in the dictionary we set df (i) = 1. In an attempt to avoid over-tting, and reduce memory and communications load, we use just the 100 highest-weighted words from any document. Recent experiments described in (Pazzani, Muramatsu, & Billsus 1996) have shown that using too many words leads
Excellent ld be easy to add Very Good extra users, agents and resource Neutral To take advantage of potential overl aps betweer Bad The architecture is best explained by considering the Terrible path of a document w through the system. Lets say the document is found by agent A, one of the various Figure 1: Ordinal scale users use to rank documents types of collect ent(more details in section 3.3) and immediately converted to its vector representa- on w. At regular intervals collection agents submit to a decrease in performance when classify ing web the pages they have gathered w hich best match their pages using supervised learning methods. Our ex- search profiles to the central repository, replacing the ploratory experiments have agreed with their findings pages they previously submitted. Thus at any time the that the optimums between 30 and 100. We are using the figure of 100 pending more extensive experiments which will be possi ble once more user ratings have been When a user requests their Fab recommendations gathered their selection agent (of which there is one per user Once the top 100 words have been picked we nor picks, from the entire repository, those pages which malize w to be of unit length, to allow comparisons best match the user's personal profile. The user then between documents of di fferent lengt hs ranks these pages. These rankings are used both for This vector representation is used both for pages aluation purposes(discussed in section 4)and as as explained, and for the model of the user's interests feedback for the agents to learn from. The selection the user profile m (which corresponds to the query in agent uses the feed back to update the user's personal a retrospective IR system). In order to measure how profile(using the function u). It also forwards the well a page w mat ches a profile m, we use a vari ant of feedback, via the central repository, to the originat the standard IR cosine measure ing agent A, which will update its search profile in the There are several advantageous features of th chitect Updating m also corresponds to a normal operation n retrospective IR: relevance feed back(Rocchio 1971) A user's personal profile is a valuable resource, rep We use a simple update rule eventing considerable effort and time spent ranking s representation as part of the user s se lection agent means that the user ' s information will uw.ms=m sw never be lost through some adaptation scheme or where s is the user's score for page w(we transl ate the "drowned out"by other users (it is only updated with the user's own feedback ). Fur thermore it makes 2,-3. Relevance feedback has been demonstrated the profile avail able to the user for use with other ap to be a powerful techniq Lents have shown plications(eg-, e-mail or news filtering), as opposed that queries automatically generated through relevance to in a purely coll aborative sy stem, where there is feed back can outperform human searchers( Foltz Du ingful way to consider one user's profile in mais 1992: Harman 1995) d ation We assume that a user s interests change over time a br and new user to the system is shown a selection nd furthermore that this happens in real time, re of pages which are randomly chosen from the repos gardless of the number of recommendations requested itory. However the repos tory contains pages w hich modelled us various agents believe will best match the current function--each night all the weights in the profiles are Iser popul ation. Thus the new user is already st art multiplied by 0.97 ing from a much higher level than would be expected from an empty profile, especially if the system is de 3.2 Architecture ployed in an organiz ation or special interest group Figure 2 shows the main components of the architec here there will be significant overlap between use ture: selection agents. collection agents and the central epository. All agents maintain a profile: each user has It is possi ble to cheaply maintain a popul ation of a selection agent, which maintains their user profile each collection agent maintains a search profile which ceived (from all users) into an amalgamated profile s used to guide it in its collection of web pages which represents an average of the user communit The goals underlying the design of the architecture We can then pick pages from the repos tory accord ere twofold to thi d able thus
Excellent Very Good Good Neutral Poor Very Bad Terrible Figure 1: Ordinal scale users use to rank documents. to a decrease in performance when classifying web pages using supervised learning methods. Our exploratory experiments have agreed with their ndings that the optimum is between 30 and 100. We are using the gure of 100 pending more extensive experiments which will be possible once more user ratings have been gathered. Once the top 100 words have been picked we nor- malize w to be of unit length, to allow comparisons between documents of dierent lengths. This vector representation is used both for pages, as explained, and for the model of the user's interests, the user prole m (which corresponds to the query in a retrospective IR system). In order to measure how well a page w matches a prole m, we use a variant of the standard IR cosine measure: r(w; m) = w:m Updating m also corresponds to a normal operation in retrospective IR: relevance feedback (Rocchio 1971). We use a simple update rule: u(w; m; s) = m + sw where s is the user's score for page w (we translate the scale shown in Figure 1 to the integers 3, 2, 1, 0, 1, 2, 3). Relevance feedback has been demonstrated to be a powerful technique|experiments have shown that queries automatically generated through relevance feedback can outperform human searchers (Foltz & Dumais 1992; Harman 1995). We assume that a user's interests change over time, and furthermore that this happens in real time, regardless of the number of recommendations requested per day. This process is modelled using a simple decay function|each night all the weights in the proles are multiplied by 0:97. 3.2 Architecture Figure 2 shows the main components of the architecture: selection agents, collection agents and the central repository. All agents maintain a prole: each user has a selection agent, which maintains their user prole; each collection agent maintains a search prole which is used to guide it in its collection of web pages. The goals underlying the design of the architecture were twofold: To allow scaling up, so that it would be easy to add extra users, agents and resources. To take advantage of potential overlaps between users' interests. The architecture is best explained by considering the path of a document W through the system. Let's say the document is found by agent A, one of the various types of collection agent (more details in section 3.3), and immediately converted to its vector representation w. At regular intervals collection agents submit the pages they have gathered which best match their search proles to the central repository, replacing the pages they previously submitted. Thus at any time the repository contains each collection agent's best pages in their own opinions. When a user requests their Fab recommendations their selection agent (of which there is one per user) picks, from the entire repository, those pages which best match the user's personal prole. The user then ranks these pages. These rankings are used both for evaluation purposes (discussed in section 4) and as feedback for the agents to learn from. The selection agent uses the feedback to update the user's personal prole (using the function u). It also forwards the feedback, via the central repository, to the originating agent A, which will update its search prole in the same way. There are several advantageous features of this architecture: A user's personal prole is a valuable resource, representing considerable eort and time spent ranking pages. Its representation as part of the user's selection agent means that the user's information will never be lost through some adaptation scheme or \drowned out" by other users (it is only updated with the user's own feedback). Furthermore it makes the prole available to the user for use with other applications (e.g., e-mail or news ltering), as opposed to in a purely collaborative system, where there is no meaningful way to consider one user's prole in isolation. A brand new user to the system is shown a selection of pages which are randomly chosen from the repository. However the repository contains pages which various agents believe will best match the current user population. Thus the new user is already starting from a much higher level than would be expected from an empty prole, especially if the system is deployed in an organization or special interest group where there will be signicant overlap between users' interests. It is possible to cheaply maintain a population of \parasitic" users. We can add every evaluation received (from all users) into an amalgamated prole, which represents an average of the user community. We can then pick pages from the repository according to this prole and make them available, thus
Figure 2: Overview cf the Fab architecture. allowing "parasites "to view pages preselected by an rankings of pages. However there also exists a need tablished group of users, without any requirement for a longer term evolution of the agents, to insure a to provide feedback spread of profiles in the space of topics. It is une In addi tion to these features, our design relies or sirable to maintain collection agents which have very he hypot hesis that the following two behav iors will be mil ar profiles (it woul d be more efficient to combine exhibited them), or agents whose pages are rarely picked or re- ceive very low scores. We have definedour architecture Specialization By only giving feedback to the col so that this long-term adaptation component is easily lection agent which originally found a page, we are replaceable, as we intend to conduct further research to investig ate different schenes at tempting to ecializati begins to specialize to a topic, users not interested in that topic shoul d no longer be shown pages from that 3.3 Collection Agents gent, which should narrow its specializ ation further We have implemented two kinds of adaptive collection Eventually we expect the agent will set tle into a niche agents, along with a variety of non-adaptive kinds for where there is a group of users interested in the topi comparative purposes which its profile represents. In future experiments w hope to see the overlap between users' interests lead Search Agents First described in(Bal abanovic ng to an economy of scale, whereby several users in Shoham 1995), these agents perform a best-first search terested in one topic can be served by a single agent of the web. For each agent the heuristic function used and conversely users interested in sever al topics can be erved by sever al agents to evaluate a web page w is r(w, m), where mis the agents search profile. In order to better cope with the increasing size and br aching factor of the web Serendipity If any collection agent finds a page we limit the size of the search fringe, and restrict the which matches a users profile, the user shoul d see it number of nodes from any particul ar site. A record This includes agents who have specialized to finding of all nodes visited is maintained separately, with a pages for other groups of users, agents who monitor one month timeout. In effect we have a hy brid best first/beam search agents which merely produce random pages, etc. This property can be thought of as allowing for serendip tous "wordof-mouth" recommendation, whereby the Index agents Rather than actually searching the shown pages from outsi de of their regu web, these agents attempt to construct queries for e hich happen to match their interests isting web indexes in an attempt to avoid duplicating work. The indexes used are alta vista Inktomi and So far we have only discussed the short-term learn Excite. since little information about the ir models g of the agents: updating profiles based employed by these commercial systems is forthcoming it is hard to des ptimal queries. Furtherore the es from Fab selected according to its amalgamated pRofile are vailable daily 2L ins to all the on line services mentioned can be foumd http://fab.stanordedu onthewebathttp://fab.stanford.edu/papers/aa97/
Figure 2: Overview of the Fab architecture. allowing \parasites" to view pages preselected by an established group of users, without any requirement to provide feedback1 . In addition to these features, our design relies on the hypothesis that the following two behaviors will be exhibited: Specialization By only giving feedback to the collection agent which originally found a page, we are attempting to encourage specialization. If an agent begins to specialize to a topic, users not interested in that topic should no longer be shown pages from that agent, which should narrow its specialization further. Eventually we expect the agent will settle into a niche where there is a group of users interested in the topic which its prole represents. In future experiments we hope to see the overlap between users' interests leading to an economy of scale, whereby several users interested in one topic can be served by a single agent, and conversely users interested in several topics can be served by several agents. Serendipity If any collection agent nds a page which matches a user's prole, the user should see it. This includes agents who have specialized to nding pages for other groups of users, agents who monitor various existing web page recommendation services, agents which merely produce random pages, etc. This property can be thought of as allowing for serendipitous \word-of-mouth" recommendation, whereby the user is shown pages from outside of their regular sources which happen to match their interests. So far we have only discussed the short-term learning of the agents: updating proles based on users' 1Top pages from Fab selected according to its amalgamated prole are available daily at http://fab.stanford.edu. rankings of pages. However there also exists a need for a longer term evolution of the agents, to insure a spread of proles in the space of topics. It is undesirable to maintain collection agents which have very similar proles (it would be more ecient to combine them), or agents whose pages are rarely picked or receive very low scores. We have dened our architecture so that this long-term adaptation component is easily replaceable, as we intend to conduct further research to investigate dierent schemes. 3.3 Collection Agents We have implemented two kinds of adaptive collection agents, along with a variety of non-adaptive kinds for comparative purposes. Search Agents First described in (Balabanovic & Shoham 1995), these agents perform a best-rst search of the web. For each agent the heuristic function used to evaluate a web page w is r(w; m), where m is the agent's search prole. In order to better cope with the increasing size and branching factor of the web, we limit the size of the search fringe, and restrict the number of nodes from any particular site. A record of all nodes visited is maintained separately, with a one month timeout. In eect we haveahybrid best- rst/beam search. Index Agents Rather than actually searching the web, these agents attempt to construct queries for existing web indexes in an attempt to avoid duplicating work. The indexes used are Alta Vista2 , Inktomi and Excite. Since little information about the IR models employed by these commercial systems is forthcoming, it is hard to design optimal queries. Furthermore the 2Links to all the on-line services mentioned can be found on the web at http://fab.stanford.edu/papers/aa97/
indexes are tailored mainly to short queries, and ence to the standard IR measures of p recision and re. tering more than 20 seard terms often results in zero call or the standard mad ine learning measure of clas ecall. Thus the current imp lementation is fairly ru ification accuracy. The ndpm reasure is a distance mentary: submit as a disjunctive q uery the top p words between the user's ranking of a set of documents and from the agent p rofile, where p has been op timized by the systems ranking of the sare documents, normal hand for different indexes. and varies from 10 to 20 ized to lie between 0 and 1. Space p recluses a defi- tion but wewill summarize the advantages of this new Non- Adaptive agents Forpurp oses of comparison measure. wehaveimplemented some simp leragents w hich do not The notion of relevance as used in IR is problem maintain their ow n profile atic in general, and in particular for our domain. User no-memory index agents These agents work ex tudies have show n that users have difficulty raking act ly like regular index agents, but they draw their onsistert relevance judgments over a long period of ords from the amalgamted profile. Thus these time when asked to rate documerts on an absolute agents are not trying to sp ecialize but are serving relevance scale (Lesk salton 1971; Saracevic 1975) the“ mass market and in fact this is a general feature of human judg. ments(Rorvig 1988). There will also be considerable various sources of random pages availabe ss from r andom agent s These agents retrieve page disagreement between the judgments of different users on the (Saracevic 1995). The usual way to circuMent this are no I pages problem is to test IR systems on standardized collec- but rather randomly p icked fromvarious preselected tions of documerts and queries, so as to make recall collections). Currently pages are draw n from Alta and p recision figures comparable. Vista. Yahoo. ROulEtte Since our dorain is the web, we cannot rely on using cool agents These agents retrieve humaI standardized collection Furthermore since there is recomended pages from nine "cool page of no query inivolved in the use of our system the con- the day? sites around the web. These pages have cept of relevance is inapp IOp riate. Saracevic(1975 been selected for their interest to the general has drawn a distinction between the users unartic community, not any particular user, and are often ulated inform ation need and the query whid results new ly released sites from that. Relevance is relative to the query, whereas Long-term evolution As exp lained in section 3.2 pertinence is relative to the information need. In our ore mechanismto insure that agents' p rofiles sp read system it is unnecessary to formulate a query, so we the information out in the vector space is required. This remains a need is met nly attempt to measure l topic for further research, but for the time being we In contrast to the negative results from user studie cedure is followed separately for ead agent type, once mentioned, it has been show n that human j udges are a weel good at making relative judgments given a collecti of documents. and will be consistent over time and 1. F ind best and worst agents over last week accordin ng compared to other judges(Lesk Salton 1971) to their median feedback score e ndpm measure only requires relative judgments 2. If they both scored worse than Neutr al, restart the from users, and since there is no q uery involved the judgments relate to pertinence not relevance. Thus it 3. If the worst agent scored worse than Ne ut r al but the is more app Iop riate than precision and recall, whicl best agent scored better, dup licate the best agent are based on absolute relevance judgments. Further and kill the worst more, directly comp aring the user and system rankings gives us a single valued measure, simp ler to interp ret 3. 4 Selection Agents than precision- recall graphs(interestingly in the de Currently only one selection method is used, based on generate case where two-level rankings are used, ndpm n be show n to be a comp osite measure of recall and valueof thisfunction(using the r We calculate the p recision, and is related to several other measures pio- page in the repository, The highest-scoring pages are apart fromthese considerations, recall is impossible to measure and difficult even to estimate when the docu dentical or from the same site, and that the user has ment collection is the whole web not seen an identical page in the last month An alternative app roach would be to use classifica. tio However n and recall w Id have to ask users 4.1 Backg to classify items on an absolute scale which is prob- We have decided to use a erformance measure leveloped by Yao(1995)in ourexperiments, in prefer ormalized distance-based perfo
indexes are tailored mainly to short queries, and entering more than 20 search terms often results in zero recall. Thus the current implementation is fairly rudi- mentary: submit as a disjunctive query the top p words from the agent prole, where p has been optimized by hand for dierent indexes, and varies from 10 to 20. Non-Adaptive Agents For purposes of comparison we have implemented some simpler agents which do not maintain their own prole: no-memory index agents These agents work exactly like regular index agents, but they draw their words from the amalgamated prole. Thus these agents are not trying to specialize but are serving the \mass market". random agents These agents retrieve pages from various sources of random pages available on the web (i.e. they are not truly randomly picked pages, but rather randomly picked from various preselected collections). Currently pages are drawn from Alta Vista, Yahoo, URouLette. cool agents These agents retrieve humanrecommended pages from nine \cool page of the day" sites around the web. These pages have been selected for their interest to the general community, not any particular user, and are often newly released sites. Long-term evolution As explained in section 3.2, some mechanism to insure that agents' proles spread out in the vector space is required. This remains a topic for further research, but for the time being we have a preliminary implementation. The following procedure is followed separately for each agent type, once a week: 1. Find best and worst agents over last week, according to their median feedback score. 2. If they both scored worse than Neutral, restart the worst agent from scratch. 3. If the worst agent scored worse than Neutral but the best agent scored better, duplicate the best agent and kill the worst. 3.4 Selection Agents Currently only one selection method is used, based on the comparison function r(w; m). We calculate the value of this function (using the user's prole) for every page in the repository. The highest-scoring pages are shown to the user, with the proviso that no two are identical or from the same site, and that the user has not seen an identical page in the last month. 4 Experimental Design 4.1 Background We have decided to use a new performance measure developed by Yao (1995) in our experiments, in preference to the standard IR measures of precision and recall or the standard machine learning measure of classication accuracy. The ndpm3 measure is a distance between the user's ranking of a set of documents and the system's ranking of the same documents, normalized to lie between 0 and 1. Space precludes a denition, but we will summarize the advantages of this new measure. The notion of relevance as used in IR is problematic in general, and in particular for our domain. User studies have shown that users have diculty making consistent relevance judgments over a long period of time when asked to rate documents on an absolute relevance scale (Lesk & Salton 1971; Saracevic 1975), and in fact this is a general feature of human judg- ments (Rorvig 1988). There will also be considerable disagreement between the judgments of dierent users (Saracevic 1995). The usual way to circumvent this problem is to test IR systems on standardized collections of documents and queries, so as to make recall and precision gures comparable. Since our domain is the web, we cannot rely on using a standardized collection. Furthermore, since there is no query involved in the use of our system the concept of relevance is inappropriate. Saracevic (1975) has drawn a distinction between the user's unarticulated information need and the query which results from that. Relevance is relative to the query, whereas pertinence is relative to the information need. In our system it is unnecessary to formulate a query, so we can only attempt to measure how well the information need is met. In contrast to the negative results from user studies mentioned, it has been shown that human judges are good at making relative judgments given a collection of documents, and will be consistent over time and compared to other judges (Lesk & Salton 1971). The ndpm measure only requires relative judgments from users, and since there is no query involved the judgments relate to pertinence not relevance. Thus it is more appropriate than precision and recall, which are based on absolute relevance judgments. Furthermore, directly comparing the user and system rankings gives us a single-valued measure, simpler to interpret than precision-recall graphs (interestingly in the degenerate case where two-level rankings are used, ndpm can be shown to be a composite measure of recall and precision, and is related to several other measures proposed over the years|details in (Yao 1995)). Even apart from these considerations, recall is impossible to measure and dicult even to estimate when the docu- ment collection is the whole web. An alternative approach would be to use classication accuracy as our performance measure. However as with precision and recall we would have to ask users to classify items on an absolute scale, which is prob- 3Normalized Distance-based Performance Measure