VI
vi
knoW LE D GEM TS This re search was con ducted under the supervision of Yoav Shoham, whose intellec- tual rigor and insight have been a con st ant source of en couragement and inspiration Valuable perspectives and gui dance h ave al so been provided by Terry Winograd and Rob Barrett, who both induced greater clarity of thinking and present ation Daphne Koller and Cliff Nass supplied helpful pointers to related areas of re search as members of my orals committee. A thorough groun ding in AI systems and archi lectures was provided by earlier advisors Carole Klein at Cambridge University and both barbara haves-Roth and Nils Nilsson at St an ford Universit ddition, Andy Hopper at Cambridge University provided the initial en couragement to un dertake doctoral rese arch Over the ne arly six years I have been at the Depart ment of Computer Scien ce at St anford University. t here have been const ant valuable interactions with members of the various research groups focussed on creating AI agents(or 'bots) which reason or learn(Alot s, Bots, Nobotics and Nobot s), as well as the People, Computers and Design group focus sed on interaction design, and, most recently, the Digit al Libraries Project, working on many different technologies in support of information seeking tasks. Outside of St anford, I have gained a wider perspective on interaction de sign and prototyping from Mik Lamming and Mike Flynn at the Xerox Research Center in Cambridge, and a better underst anding of real-world technology design and development from James Rucker and Marcos Polan co at Imana in San francisco
Acknowledgements This research was conducted under the supervision of Yoav Shoham, whose intellectual rigor and insight have been a constant source of encouragement and inspiration. Valuable perspectives and guidance have also been provided by Terry Winograd and Rob Barrett, who both induced greater clarity of thinking and presentation. Daphne Koller and Cli Nass supplied helpful pointers to related areas of research as members of my orals committee. A thorough grounding in AI systems and architectures was provided by earlier advisors Carole Klein at Cambridge University and both Barbara Hayes-Roth and Nils Nilsson at Stanford University; in addition, Andy Hopper at Cambridge University provided the initial encouragement to undertake doctoral research. Over the nearly six years I have been at the Department of Computer Science at Stanford University, there have been constant valuable interactions with members of the various research groups focussed on creating AI agents (or `bots) which reason or learn (AIbots, Bots, Nobotics and Nobots), as well as the People, Computers and Design group focussed on interaction design, and, most recently, the Digital Libraries Pro ject, working on many dierent technologies in support of information seeking tasks. Outside of Stanford, I have gained a wider perspective on interaction design and prototyping from Mik Lamming and Mike Flynn at the Xerox Research Center in Cambridge, and a better understanding of real-world technology design and development from James Rucker and Marcos Polanco at Imana in San Francisco. vii
Funding was provided by nsF grant IRI-9503109, the NSF/DARPA/NASA Dig ital Libr ary project(NSF IRI-9411306, NSF grant IRI-9220645, and ArPa grant F49620-94-1-0900. The implementations to be described were con structed with the help of a lot of free Software. Thanks. therefore are also cue to the authors and maintainerS of Apache, Sub Arctic [Hudson and Smith, 1996, Berkeley DB[Seltzer andYigit, 1991, mSQL [Hughes, 1993], CH ACO [Hendrick Son and Leland, 1994]and above all the Python programming langu age [van Rossum, 1995 A number of user tests were performed as part of this work. I am grateful to all of the people who took the time to partici pate, either on the web or in perSon here at stanford rrentlydocumentedathttp://www.apache.org
Funding was provided by NSF grant IRI-9503109, the NSF/DARPA/NASA Digital Library pro ject (NSF IRI-9411306), NSF grant IRI-9220645, and ARPA grant F49620-94-1-0900. The implementations to be described were constructed with the help of a lot of free software. Thanks, therefore, are also due to the authors and maintainers of Apache1 , SubArctic [Hudson and Smith, 1996], Berkeley DB [Seltzer and Yigit, 1991], mSQL [Hughes, 1993], CHACO [Hendrickson and Leland, 1994] and above all the Python programming language [van Rossum, 1995]. A number of user tests were performed as part of this work. I am grateful to all of the people who took the time to participate, either on the Web or in person here at Stanford. 1Currently documented at http://www.apache.org viii
Istrat ACknowledgements Introductio 1.1 The recommen dation tas esearch issues 1.2.1 How to exploit overlaps between userS'interests 1.2.2 How to decide upon the composition of recommendation sets 1.3 Summary of contributions 1.4 Overview of the remainder of this thesis 10 2 BcG lound ad definitions 2. 1 Representing text 2.1.1 Vector space model 2. 1. 2 Special con siderations when gath ering Web pages 2.13 Alternative features 2.2 Repr eventing users 2.2.1 Prefer ence rankin gs: Definition and notation 21 2.2 Information need and relevance
Contents Abstract v Acknowledgements vii 1 Introduction 1 1.1 The recommendation task ........................ 3 1.2 Research issues .............................. 6 1.2.1 How to exploit overlaps between users' interests . . . . . . . . 6 1.2.2 How to decide upon the composition of recommendation sets . 7 1.3 Summary of contributions ........................ 8 1.4 Overview of the remainder of this thesis ................ 10 2 Background and Denitions 13 2.1 Representing text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.1 Vector space model . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.2 Special considerations when gathering Web pages ....... 17 2.1.3 Alternative features . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Representing users ............................ 21 2.2.1 Preference rankings: Denition and notation . . . . . . . . . . 21 2.2.2 Information need and relevance ................. 23 ix
2.2.3 USERS'jUDGM ENIS OF DOCUM ENIS 2.2.4 USERPeCfIES 2.2.5 SIM UIATING USERS' DOCUM ENT FANKNG BEHAVIOR 3 EVALUATIONS 2.3.1 RELEVANCE BASED EVALUATION 222333 2.3.2 EVAIUATION USING DISIANCE BETWEN FANK 2.3.3 EXPERIM ENIALM EIHODOIOGY 2. 4 INIEFACIIONS 2.4.1 INP UIS: USER FFEDBAC k 2.4.2 OUIP UIS: RECOM M ENDATION SEIS 42 2.5 SUM MARY OF NCIATION 3 Single Agent Content-B ased recommendation 3.1 DESIGN 3.1.1 colLeCtION 3.12 SELECtiON 3.1.3 LEARNING 3. 2 IMPIEM ENIATION 3. 3 SIM ULATION 3. 4 EXPERIMENIS 3.4.1 CS227 CIASS PRCIECIS 5556 3. 4.2 TESIS OF LIRA 3.43 SIM ULATION HESUIIS 3.5 PHOBIEMS WIH THE PURE CONIENF BASED APPROACH 4 Collaborative recommendation 4.1 DESI 74 4. 2 A BRIEF SUFVEY
2.2.3 Users' judgments of documents ................. 25 2.2.4 User proles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.5 Simulating users' document ranking behavior ......... 29 2.3 Evaluations ................................ 32 2.3.1 Relevance-based evaluation . . . . . . . . . . . . . . . . . . . . 32 2.3.2 Evaluation using distance between rankings .......... 33 2.3.3 Experimental methodology . . . . . . . . . . . . . . . . . . . . 35 2.4 Interactions ................................ 38 2.4.1 Inputs: user feedback. . . . . . . . . . . . . . . . . . . . . . . 39 2.4.2 Outputs: recommendation sets ................. 42 2.5 Summary of notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3 Single Agent Content-Based Recommendation 45 3.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.1.1 Collection ............................. 48 3.1.2 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.1.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4.1 CS227 Class Pro jects . . . . . . . . . . . . . . . . . . . . . . . 58 3.4.2 Tests of LIRA . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.4.3 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.5 Problems with the pure content-based approach ............ 71 4 Collaborative Recommendation 73 4.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2 A brief survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 x