Flex Recs: Expressing and combining Flexible recommendations Georgia Koutrika Bercovitz. Hector Garcia-Molina Computer Science Depa Stanford University, Stanford, California, USA outr O, hector/@cs. stanford. edu ABSTRACT Recommendation systems have become very popular but most rec- bedded in the system code, not expressed declaratively. From ommendation methods are 'hard-wired into the system making ex the designer viewpoint, this fact makes it hard to modify the perimentation with and implementation of new recommendation algorithm, or to experiment with different approaches radigms cumbersome. In this paper, we propose FlexRecs,a No Flexibility: The recommendations provided are fixed. End framework that decouples the definition of a recommendation pro. users are given few choices. For example, a user may be unable cess from its execution and supports flexible recommendations over to request recommendations for movies that could be jointly structured data. In FlexRecs, a recommendation approach can be seen by her and her friend, or that her recommendations be defined declaratively as a high-level parameterized workflow com- based on what people in her age group are watching. Users rising traditional relational operators and new operators that gen may expect diverse recommendations in different contexts erate or combine recommendations. We describe a prototype flex Limited world model: Recommendation approaches deal with ible recommendation engine that realizes the proposed framework two types of entities, users and items(e. g, movies), represented and we present example workflows and experimental results that as sets of ratings or features. Providing recommendations using show its potential for capturing multiple, existing or novel, recom- richer data representations is not straightforward. For example mendations easily and having a flexible recommendation system a user may want recommendations for courses from users with that combines extensibility with reasonable performance. similar grades and similar ratings Categories and Subject Descriptors <. In this paper, we propose FlexRecs, a framework that allows flex- nendations to be easily defined, customiz H.3.3 (Information Storage and Retrieval Information Search cessed over structured data. FlexRecs(a)decouples the definition and Retrieval-Search process of a recommendation process from execution, (b)declaratively de- fines a recommendation process as a high-level workflow and (c) General terms enables generating any recommendations with the same engine A given recommendation approach can be expressed as a high- Algorithms, Languages, Performa level workflow, which may contain traditional relational operators such as select, project and join, plus new recommendation opera- Keywords tors that generate or combine recommendations. A workflow handle data(including recommendations) in relational form. flexible recommendations, recommendation operators, recommen- a designer can easily create multiple, customizable workflows dation queries for content-based, collaborative, as well as novel recommendation paradigms. The end user can select from them, depending on her 1. INTRODUCTION information needs. This selection is done through a gul, which Recommendation systems provide advice on movies also allows the user to enter parameters for workflows in order travel, leisure activities, and many other topics, and I urate and personalized recommendations. For in- ery popular in systems, such as Google News [10]. Al stance, the user may specify that her recommendations be based and MovieLens 19]. Since the appearance of the first on what people in her age group are watching. This choice gets dation systems [12, 22, 261, many recommendation approaches ranslated into a select condition, which is passed to the appropri- have been proposed both by the industry and academia. However ate workflow. This functionality is essentially similar to advanced most recommendation systems have a number of limitations searches: a designer builds a set of parameterized SQL queries End users can execute these queries choosing parameter values to receive different results through an advanced search interface ble recommendation en realizes the proposed framework. The system allows executie is granted without fee provided that copies are workflow over a conventional DBMs by"compiling" it into a se not made or di rofit or commercial advantage and that copies quence of SQL calls. The recommendation operators may call upon bear this notice and the full citation on the first page. To copy otherwise, to functions in a library that implement common tasks for generating publish, to post on servers or to redistribute to lists, requires prior specifi recommendations, such as computing the Jaccard or Pearson sim of two sets of objects, dictions. When possible, library functions are compiled into the
FlexRecs: Expressing and Combining Flexible Recommendations Georgia Koutrika, Benjamin Bercovitz, Hector Garcia-Molina Computer Science Department, Stanford University, Stanford, California, USA {koutrika, berco, hector}@cs.stanford.edu ABSTRACT Recommendation systems have become very popular but most recommendation methods are ‘hard-wired’ into the system making experimentation with and implementation of new recommendation paradigms cumbersome. In this paper, we propose FlexRecs, a framework that decouples the definition of a recommendation process from its execution and supports flexible recommendations over structured data. In FlexRecs, a recommendation approach can be defined declaratively as a high-level parameterized workflow comprising traditional relational operators and new operators that generate or combine recommendations. We describe a prototype flexible recommendation engine that realizes the proposed framework and we present example workflows and experimental results that show its potential for capturing multiple, existing or novel, recommendations easily and having a flexible recommendation system that combines extensibility with reasonable performance. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Search Process General Terms Algorithms, Languages, Performance Keywords flexible recommendations, recommendation operators, recommendation queries 1. INTRODUCTION Recommendation systems provide advice on movies, products, travel, leisure activities, and many other topics, and have become very popular in systems, such as Google News [10], Amazon [17] and MovieLens [19]. Since the appearance of the first recommendation systems [12, 22, 26], many recommendation approaches have been proposed both by the industry and academia. However, most recommendation systems have a number of limitations: Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD’09, June 29–July 2, 2009, Providence, Rhode Island, USA. Copyright 2009 ACM 978-1-60558-551-2/09/06 ...$5.00. • Hard Wired: The recommendation algorithm is typically embedded in the system code, not expressed declaratively. From the designer viewpoint, this fact makes it hard to modify the algorithm, or to experiment with different approaches. • No Flexibility: The recommendations provided are fixed. End users are given few choices. For example, a user may be unable to request recommendations for movies that could be jointly seen by her and her friend, or that her recommendations be based on what people in her age group are watching. Users may expect diverse recommendations in different contexts. • Limited world model: Recommendation approaches deal with two types of entities, users and items (e.g., movies), represented as sets of ratings or features. Providing recommendations using richer data representations is not straightforward. For example, a user may want recommendations for courses from users with similar grades and similar ratings. In this paper, we propose FlexRecs, a framework that allows flexible recommendations to be easily defined, customized, and processed over structured data. FlexRecs (a) decouples the definition of a recommendation process from execution, (b) declaratively de- fines a recommendation process as a high-level workflow and (c) enables generating any recommendations with the same engine. A given recommendation approach can be expressed as a highlevel workflow, which may contain traditional relational operators such as select, project and join, plus new recommendation operators that generate or combine recommendations. A workflow can handle data (including recommendations) in relational form. A designer can easily create multiple, customizable workflows for content-based, collaborative, as well as novel recommendation paradigms. The end user can select from them, depending on her information needs. This selection is done through a GUI, which also allows the user to enter parameters for workflows in order to get more accurate and personalized recommendations. For instance, the user may specify that her recommendations be based on what people in her age group are watching. This choice gets translated into a select condition, which is passed to the appropriate workflow. This functionality is essentially similar to advanced searches: a designer builds a set of parameterized SQL queries. End users can execute these queries choosing parameter values to receive different results through an advanced search interface. We describe a prototype flexible recommendation engine that realizes the proposed framework. The system allows executing a workflow over a conventional DBMS by “compiling” it into a sequence of SQL calls. The recommendation operators may call upon functions in a library that implement common tasks for generating recommendations, such as computing the Jaccard or Pearson similarity of two sets of objects, or perform more fancy statistical predictions. When possible, library functions are compiled into the
Year, Term, InstrlD, Location, T ime Slot, Days what are yau coking far? Figure 1: Extract from the course database nts: in other cases we can rely on external functions led by the sQl statements. T 1.1 Motivation Our work on FlexRecs was motivated by, and has been imple- mented as part of CourseRank, a social tool we have developed 2: Fixed recommendations in Course Rank in our lab. CourseRank helps students in our university to make informed choices about classes and take advantage of the avail- proach X be more effective than approach Y in our environment? able learning options. It displays official university information What are the right weights for blending two recommendations(e.g and statistics, such as bulletin course descriptions, grade distri one based on what students like. and another based on what courses tions, and results of official course evaluations. Students can are more critical for completing a major)? What is the best way to anonymously rank courses they have taken, add comments, and predict, not good courses, but likely grades a student will obtain in rank the accuracy of each others'comments. They can also get future courses? Implementing many different recommendation ap- recommendations, organize their classes into a quarterly schedule proaches and their variants from scratch, possibly as different sy or devise a four year plan and track their progress. CourseRank also tem modules, can be time-consuming and counter-productive and functions as a feedback tool for faculty and administrators, ensur- does not lend to high code reusability. It also leads to a recommen ing that information is as accurate as possible. To support this func- dation system that is not easily expandable and manageable and it nakes experimenting with different recommendation possibilities the courses offered, the instructors, the students, the textbooks cumbersome and slow and so forth. Figure 1 provides a small snapshot of the database hema. The use inside the university, out of a total of about 14, 000 students. The describe recommendations. These descriptions would also need to vast majority of users are undergraduates, and there are only about handle parameters, so that end-users could at run time personal 7,000 undergraduates in our university. ize their recommendations, e.g., by choosing a selection condition, recommendations The need for flexibility and expressivity. Although the ini- ing Flex Recs framework has indeed made it possible to(a) easily tial version of Course Rank has been very popular [2], we received capture multiple recommendation paradigms and allow users dy nany requests, from students and administrators, for more flexible namically personalize them to fit their needs, (b)experiment with recommendations: As most commercial recommendation systems, novel recommendation strategies, (c)design a flexible and extensi our initial version offered no choices, just a list of recommended ble system and increase developer productivity: instead of adding courses for the student to consider. as shown in Figure 2. The new modules, a developer needs to define a new recommendation ure displays a general list of collaborative recommendations fo workflow or expand the library of reusable functions if a new com- particular student containing a Robotics course and a Spanish lan arison or prediction model is needed. The new version of Cours- guage course. If the student is interested in Spanish courses, she cRank employs the flexible recommendation engine to support di lay prefer to see more Spanish courses. If she is interested in ferent features and to let end-users tailor their recommendations French courses, she may not want to see any of our recommenda- We have also designed a user interface for flexible recommenda- tions. Students want to specify the type of course they are inter- tions in CourseRank [16 ested in(e.g, a biology class, something that satisfies the univer sity's writing requirement). They also want to request recommen- 1.2 Contributions dations based on a peer group they select (e. g, students in their major, or freshmen only)and based on different criteria For exam In summary, the contributions of our work are as follows: ple, a student may want recommendations for CS courses from C We propose decoupling the definition of a recommendation pro students with similar grades(i.e, with similar performance)and for cess from its execution and present FlexRecs, a framework for dance classes from students with similar ratings (i.e. with similar fining flexible recommendations over structured data. In Flex astes). In some cases, students want to see the recommendations Recs, a recommendation approach can be defined declaratively the system would give their best friend, not themselves s a high-level parameterized workflow comprising traditional The need for experimentation and higher productivity. As rec- relational operators ommendations comprise an integral part of the system, we want to We introduce an extend operator that generates a virtual experiment with different recommendation approaches: Would relation. For example, a tuple in an extended student rel ar stories are known from other domains [1, 5]. For y contain an attribute that gives all the courses taken by Amazon customer that once bought a book for his student. Although there is nothing new in the concept of a favor of milar to other kidsbooks buyers and is bound to see makes the whole framework work: extended relations simplify recommendations in his/her future transactions with the system. the definition of our other operator
Departments(DepID, DepCode, Name) Courses(CourseID, DepID, Title, Description, Units, Url) CourseSched(CourseID, Year, Term, InstrID, Location, TimeSlot, Days) Instructors(InstrID, Name, Url) Students(SuID, Name, Class, GPA, Status) StudentStudies(SuID, StudyPrgID) StudyPrograms(StudyPrgID, ProgramName, Classification, DepID) StudentHistory(SuID, CourseID, Year, Term, Grade, Rating) Comments(SuID, CourseID, Year, Term, Text, Rating, Date) Figure 1: Extract from the course database SQL statements; in other cases we can rely on external functions that are called by the SQL statements. 1.1 Motivation Our work on FlexRecs was motivated by, and has been implemented as part of CourseRank, a social tool we have developed in our lab. CourseRank helps students in our university to make informed choices about classes and take advantage of the available learning options. It displays official university information and statistics, such as bulletin course descriptions, grade distributions, and results of official course evaluations. Students can anonymously rank courses they have taken, add comments, and rank the accuracy of each others’ comments. They can also get recommendations, organize their classes into a quarterly schedule or devise a four year plan and track their progress. CourseRank also functions as a feedback tool for faculty and administrators, ensuring that information is as accurate as possible. To support this functionality, we maintain a database that stores rich information about the courses offered, the instructors, the students, the textbooks, and so forth. Figure 1 provides a small snapshot of the database schema. The system is already used by more than 9,000 students inside the university, out of a total of about 14,000 students. The vast majority of users are undergraduates, and there are only about 7,000 undergraduates in our university. • The need for flexibility and expressivity. Although the initial version of CourseRank has been very popular [2], we received many requests, from students and administrators, for more flexible recommendations: As most commercial recommendation systems, our initial version offered no choices, just a list of recommended courses for the student to consider, as shown in Figure 2. The figure displays a general list of collaborative recommendations for a particular student containing a Robotics course and a Spanish language course. If the student is interested in Spanish courses, she may prefer to see more Spanish courses. If she is interested in French courses, she may not want to see any of our recommendations1 . Students want to specify the type of course they are interested in (e.g., a biology class, something that satisfies the university’s writing requirement). They also want to request recommendations based on a peer group they select (e.g., students in their major, or freshmen only) and based on different criteria. For example, a student may want recommendations for CS courses from CS students with similar grades (i.e., with similar performance) and for dance classes from students with similar ratings (i.e., with similar tastes). In some cases, students want to see the recommendations the system would give their best friend, not themselves. • The need for experimentation and higher productivity. As recommendations comprise an integral part of the system, we want to experiment with different recommendation approaches: Would ap- 1 Similar stories are known from other domains [1, 5]. For instance, an Amazon customer that once bought a book for his 9-year-old will be considered a kids’ books fan or be mistakenly considered similar to other kids’ books buyers and is bound to see irrelevant recommendations in his/her future transactions with the system. Figure 2: Fixed recommendations in CourseRank proach X be more effective than approach Y in our environment? What are the right weights for blending two recommendations (e.g., one based on what students like, and another based on what courses are more critical for completing a major)? What is the best way to predict, not good courses, but likely grades a student will obtain in future courses? Implementing many different recommendation approaches and their variants from scratch, possibly as different system modules, can be time-consuming and counter-productive and does not lend to high code reusability. It also leads to a recommendation system that is not easily expandable and manageable and it makes experimenting with different recommendation possibilities cumbersome and slow. Faced with the above challenges, we need a way to declaratively describe recommendations. These descriptions would also need to handle parameters, so that end-users could at run time personalize their recommendations, e.g., by choosing a selection condition, or by weighting recommendations that are blended. The resulting FlexRecs framework has indeed made it possible to (a) easily capture multiple recommendation paradigms and allow users dynamically personalize them to fit their needs, (b) experiment with novel recommendation strategies, (c) design a flexible and extensible system and increase developer productivity: instead of adding new modules, a developer needs to define a new recommendation workflow or expand the library of reusable functions if a new comparison or prediction model is needed. The new version of CourseRank employs the flexible recommendation engine to support different features and to let end-users tailor their recommendations. We have also designed a user interface for flexible recommendations in CourseRank [16]. 1.2 Contributions In summary, the contributions of our work are as follows: • We propose decoupling the definition of a recommendation process from its execution and present FlexRecs, a framework for defining flexible recommendations over structured data. In FlexRecs, a recommendation approach can be defined declaratively as a high-level parameterized workflow comprising traditional relational operators and new recommendation operators. • We introduce an extend operator that generates a virtual nested relation. For example, a tuple in an extended student relation may contain an attribute that gives all the courses taken by that student. Although there is nothing new in the concept of a nested relation, this particular flavor of nested relation is what makes the whole framework work: extended relations simplify the definition of our other operators
We define recommend and blend operators that capture the essence ommend top N items to a user. Users and items are represented of most recommendation workflows. The operators can be com- by rudimentary profiles(e.g, as sets of ratings or keywords) and posed and combined with more traditional select, project, and recommendations are based on profile comparisons. For example The main challenge in designing these opera the content-based component of the Fab system [6] recommends ing the appropriate functionality and generality. web pages to users and represents web page content with the 100 care, one can easily come up with operators that most important words. Similarly, the Syskill Webert system are not useful for common recommendations represents documents with the 128 most informative words [211 or are unnecessarily complex While recommendations are based on a limited understanding of ovide several examples that illustrate how common rec- the world, many real applications use much richer data residing ndation strategies( found in literature or implemented in in databases. Different types of entities may co-exist in a single Rank) can be expressed as parameterized workflows with database(e.g, books, authors, publishers, and so forth) with sev- ur operators, as well as new recommendation types. eral attributes. Current approaches are not very expressive, e.g they cannot provide recommendations for different entities taking into account different subsets of their attributes realizes the proposed framework and executes multiple work The inherent limitations of recommendation systems have been flows transparently. Our strategy reduces the amount of data saved in temporary relations during execution. For instance, knowledged [5, 15] and some extensions have been recently pro- osed, such as incorporating multi-criteria ratings into recommen extended relations are never materialized. Furthermore, our dations [3]. The language RQL has been the first proposal that al- new operators can be compiled into standard SQL for execu tion. Hence recommendation workflows are converted into se- lows users to formulate recommendations in a flexible manner [41 However, RQL queries are not very expressive because they quences of standard SQL calls, which are then executed by a formulated on a pre-specified multidimensional cube of ratings conventional database system taking advantage of the underly ng query optimization. The recommendation operators may In contrast to existing works on recommendations, in this paper, call upon functions in a library. When possible, library func- we do not propose or extend a particular recommendation method. tions are also compiled into the sQL statements. We propose a general and open framework for expressing and pro- cessing fexible recommendations over relational data that covers We present experimental results that show the potential of our existing as well as allows capturing novel recommendation paradign approach for capturing multiple, existing or novel, recommen- As part of our effort, we have also implemented a flexible recom- dations easily and having a flexible recommendation system mendation interface in our system that allows students to request that combines extensibility with reasonable performance. For nd customize their recommendations [16] our evaluation, we use real data from our production system. 2. RELATED WORK 3. RECOMMENDATION FRAMEWORK Since the appearance of the first papers on collaborative filter- 3.1 Data model [12, 22, 261, a lot of work has been done from improving and evaluating recommendation methods [3, 13, 27] to designing trust- Large amounts of data reside in structured form, and particularl in relational form, such as e-store, university, corporate, and scien- thy systems [18, 20). In their core, these systems provide (a) tific data. Our own university database is relational too. Therefore, content-based recommendations for items similar to those the use we focus on databases that follow the relational model. which is preferred in the past [7, 21]or(b)collaborative recommendations enriched with some additional features for items that people with similar tastes liked [8, 22] or(c)h brid recommendations by combining recommendation methods [6, a database comprises a set of relations. A relation R(denoted 24]. Although a popular and highly researched area, recommen- Ri when more than one relation is implied) has a set A of at- tributes. We will use R. A, to refer to an attribute in A or simply dation systems present important limitations for users as well as A. when R is understood. A tuple r in R is a vector of single, researchers and developers [ 5] The recommendation algorithm as well as its inputs are hard refer to the value of A, in tuple r. An attribute that can be instan- menting with new methods can be time-consuming and counter tiated to a single value v is called base attribute. A relation with only base attributes is called base relation productive [15]. Furthermore, hard-wired algorithms generate only a predefined and fixed set of recommendations, which cannot ca In addition, we introduce the concept of an extended relation. ture the real-time user information needs. For example, in content- DEFINITIOn 1. An extended relation S has base attributes and based methods, the user is limited to see items that are similar to extended attributes. An extended attribute Ei(Al,. Ae)is instan those she liked in the past. On the other hand, collaborative filter- tiated to a relation with attributes A,.A ing methods provide serendipitous recommendations but they have problems, such as the inability to recommend new items [6]. There Given a tuple s in the relation S, s[Eu is a with attribute ave been many approaches that try to solve some of these prob Al,..Ae. If t is a tuple in this nested relation, then t[A, gives lems, such as filtering out items if they are too similar to those seen the value of attribute A, in this relation. We will use Ei. A, to refer before [7 or using genetic algorithms [27]. Several recommen to an attribute that is part of an extended attribute E, or simply Aj dation systems use a hybrid approach to avoid certain limitations when Ei is understood, and we call A, an embedded attribute of content-based and collaborative systems [6, 24]. However, of Consequently, under our semantics, the notion of an attribute ten different recommendations may be required under different cir- value has a broader meaning capturing scalar values as well as re- cumstances by different users. Current approaches do not support lations. We consider that only base relations can be part of a tuple fexible customizable recommendations allowing one level of nesting While our model and language could Furthermore, they deal with applications that have two types of be generalized to arbitrary nesting, we have not seen a need for entities, users and items(e.g, movies, web pages) and they rec- this generality in any of the practical recommendation scenarios we
• We define recommend and blend operators that capture the essence of most recommendation workflows. The operators can be composed and combined with more traditional select, project, and join operators. The main challenge in designing these operators is in capturing the appropriate functionality and generality. Without a lot of care, one can easily come up with operators that do not compose, are not useful for common recommendations or are unnecessarily complex. • We provide several examples that illustrate how common recommendation strategies (found in literature or implemented in CourseRank) can be expressed as parameterized workflows with our operators, as well as new recommendation types. • We describe a prototype flexible recommendation engine that realizes the proposed framework and executes multiple work- flows transparently. Our strategy reduces the amount of data saved in temporary relations during execution. For instance, extended relations are never materialized. Furthermore, our new operators can be compiled into standard SQL for execution. Hence, recommendation workflows are converted into sequences of standard SQL calls, which are then executed by a conventional database system taking advantage of the underlying query optimization. The recommendation operators may call upon functions in a library. When possible, library functions are also compiled into the SQL statements. • We present experimental results that show the potential of our approach for capturing multiple, existing or novel, recommendations easily and having a flexible recommendation system that combines extensibility with reasonable performance. For our evaluation, we use real data from our production system. 2. RELATED WORK Since the appearance of the first papers on collaborative filtering [12, 22, 26], a lot of work has been done from improving and evaluating recommendation methods [3, 13, 27] to designing trustworthy systems [18, 20]. In their core, these systems provide (a) content-based recommendations for items similar to those the user preferred in the past [7, 21] or (b) collaborative recommendations for items that people with similar tastes liked [8, 22] or (c) hybrid recommendations by combining recommendation methods [6, 24]. Although a popular and highly researched area, recommendation systems present important limitations for users as well as researchers and developers [5]. The recommendation algorithm as well as its inputs are hard wired in the system code. Designing, implementing and experimenting with new methods can be time-consuming and counterproductive [15]. Furthermore, hard-wired algorithms generate only a predefined and fixed set of recommendations, which cannot capture the real-time user information needs. For example, in contentbased methods, the user is limited to see items that are similar to those she liked in the past. On the other hand, collaborative filtering methods provide serendipitous recommendations but they have problems, such as the inability to recommend new items [6]. There have been many approaches that try to solve some of these problems, such as filtering out items if they are too similar to those seen before [7] or using genetic algorithms [27]. Several recommendation systems use a hybrid approach to avoid certain limitations of content-based and collaborative systems [6, 24]. However, often different recommendations may be required under different circumstances by different users. Current approaches do not support flexible, customizable, recommendations. Furthermore, they deal with applications that have two types of entities, users and items (e.g., movies, web pages) and they recommend top N items to a user. Users and items are represented by rudimentary profiles (e.g., as sets of ratings or keywords) and recommendations are based on profile comparisons. For example, the content-based component of the Fab system [6] recommends web pages to users and represents web page content with the 100 most important words. Similarly, the Syskill & Webert system represents documents with the 128 most informative words [21]. While recommendations are based on a limited understanding of the world, many real applications use much richer data residing in databases. Different types of entities may co-exist in a single database (e.g., books, authors, publishers, and so forth) with several attributes. Current approaches are not very expressive, e.g., they cannot provide recommendations for different entities taking into account different subsets of their attributes. The inherent limitations of recommendation systems have been acknowledged [5, 15] and some extensions have been recently proposed, such as incorporating multi-criteria ratings into recommendations [3]. The language RQL has been the first proposal that allows users to formulate recommendations in a flexible manner [4]. However, RQL queries are not very expressive because they are formulated on a pre-specified multidimensional cube of ratings. In contrast to existing works on recommendations, in this paper, we do not propose or extend a particular recommendation method. We propose a general and open framework for expressing and processing flexible recommendations over relational data that covers existing as well as allows capturing novel recommendation paradigms. As part of our effort, we have also implemented a flexible recommendation interface in our system that allows students to request and customize their recommendations [16]. 3. RECOMMENDATION FRAMEWORK 3.1 Data Model Large amounts of data reside in structured form, and particularly in relational form, such as e-store, university, corporate, and scientific data. Our own university database is relational too. Therefore, we focus on databases that follow the relational model, which is enriched with some additional features. A database comprises a set of relations. A relation R (denoted Ri when more than one relation is implied) has a set A of attributes. We will use R.Aj to refer to an attribute in A or simply Aj when R is understood. A tuple r in R is a vector of single, numerical or categorical, values. We will use the notation r[Aj ] to refer to the value of Aj in tuple r. An attribute that can be instantiated to a single value v is called base attribute. A relation with only base attributes is called base relation. In addition, we introduce the concept of an extended relation. DEFINITION 1. An extended relation S has base attributes and extended attributes. An extended attribute Ei(A1, ... Ae) is instantiated to a relation with attributes A1, ... Ae. Given a tuple s in the relation S, s[Ei] is a relation with attributes A1, ... Ae. If t is a tuple in this nested relation, then t[Aj ] gives the value of attribute Aj in this relation. We will use Ei.Aj to refer to an attribute that is part of an extended attribute Ei or simply Aj when Ei is understood, and we call Aj an embedded attribute. Consequently, under our semantics, the notion of an attribute value has a broader meaning capturing scalar values as well as relations. We consider that only base relations can be part of a tuple allowing one level of nesting. While our model and language could be generalized to arbitrary nesting, we have not seen a need for this generality in any of the practical recommendation scenarios we
Students a is a list of base. embedded or extended attributes from r If A contains only base attributes, then the resulting relation is always a base one. Join, Mc, creates a new relation by combining tuples in Ri Ext_Students SuD Name Comments(CourselD), Rating. I and R, that meet some condition c, which refers only to base Paul Lite C1 I 2Feb2008 attributes of the two relations. i.e R:∞xB1:={(r,r)r∈RAr;∈B;An,r; satisfy c 15Mar2007 12Dec2007 A common type of join is natural join, which connects two c56622Jn2007 relations by equating attributes of the same name, and project- c7722Jn20 ing out one copy of each pair of equated attributes. This is The base opera handle extended relations Figure 3: A base and an extended relation and be combined with the new recommendation operators to be de- fined next but we have kept their original semantics. One could go have studied and implemented and we do not want to unnecessarily further and define operators with extended semantics. For example burden our design one could define that the select operator operates also on extended Example: We consider the course database with the base rela- attributes. There is a rich literature on nested relational algebras tions shown in Figure 1. figure 3 shows an instance of the Students that discuss extended, recursive operations such as nesting, unn base relation. We can define the extended relation Ext_ Students that g and projections and joins on embedded relations [9, 11, 25 contains all the base attributes of a student plus an extended at- We believe that such generality is not necessary for practical rec- tribute that represents the ratings made by each student as a singl ommendations. Our framework can be easily extended to handle "unit of information"per student more generalized relational operators Ext_Students(SulD, Name, Comments( CourselD, Rating, Date)) 3.3 The Extend Operator Figure 3 shows an instance for this relation. To refer to the ex with many(normalized) database schemas, information that con- tended attribute we use Ext Students. Comments, and to the rating at ceptually refers to a single entity(e. g, a student's course ratings) tribute within Ext._ Students, we use Comments. Rating or simply Rating. is often found in several relations. In our s. ded)relation.For we want to repre- In a similar way, we can define other extended relations, such sent such application entities with a single(e the extended relation Ext Courses that extends the base course a tuple in an extended relation could contain base infor- relation with an attribute that shows all the instructors per year and n a student(e. g, name, GPA), plus the set of courses a term that a course was taught as taken. For this purpose, we introduce a new operator Ext_ Courses(CourselD, DepID, Title, Instructors( Name, Year, Term)) es such extended attributes in the tuples of a relation In Section 3.3, we will show how these example extended rela- DEFINITION 2. Extend, E, creates an extended relation by em tions can be defined using the extend operator bedding a base relation Ri into another relation Ri, i.e. Extended relations can be thought of as"views e;:={(r;,v)lr∈R:At:=A(n凶R;)} group together information related to an individual resent it as a single tuple that can be easily handled by the recom- where A is the set of attributes of R mendation operators irrespective of the database structure (as we In words, Ri E Ri is an extended relation that contains all the will see in Sections 3.4-3.6). attributes and tuples from Ri plus an extended attribute whose(ex- Although the issue of whether extended relations are material- tended)value for each tuple ri from Ri is a set of tuples from R ized or not is orthogonal to their definition, in practice, extended that join to ri on the common attributes. If there are no joining relations may not be stored in the database. In the following sub- tuples, then the extended attribute will be an empty relation for ri ections, we describe the operators required to handle extended re- The default name of this extended attribute is the name of R, lations and formulate recommendations. We start with the standard In the above definition, only R is required to be a bas (core)relational algebra operators with set semantics and we extend because we allow one-level nesting. Ri can be a base or them enabling interaction with extended relations as well relation, since any extended relation can have more than ended attribute. The extend operator can be defined for 3.2 Base Operators level nesting as well. We consider the following operators that can operate on base and Example: Using the extend operator we can generate the ex- Select, ac, selects tuples from relation R, for which the con- PComments: =(SulD, CourselD, Rating. Date)(Comments) dition c holds. ie Ext_Students : T (SulD, Name)(Students) PComments c(B):={r|r∈R∧ r satisfies c} R can be a base or extended relation and c is a condition that combine information found in more than two relations as follows. refers to base attributes of R. o(R)will be a base or extended relation depending on the type of R. Schedules T(CourselD, ear, Term, strid)( Course Sched) reates a new relation by projecting R into a Ext_ Courses =T(CourselD, DeplD, Tte( Courses) Instruct_ Sched aller set of its attributes. i.e. Note that joining over attributes with common names simplifies 丌A(B):={r{A]r∈R the presentation and the definition of the extend operator. It does
Students SuID Name Class Status 1 2 Paul Little John Doe 2009 2010 H N Ext_Students SuID Name Comments (CourseID, Rating, Date) 1 2 Paul Little John Doe C1 C2 5 6 2 Feb 2008 3 Dec 2007 C1 C2 5 6 15 Mar 2007 12 Dec 2007 C5 6.6 22 Jun 2007 C7 7 22 Jun 2007 Figure 3: A base and an extended relation have studied and implemented and we do not want to unnecessarily burden our design. Example: We consider the course database with the base relations shown in Figure 1. Figure 3 shows an instance of the Students base relation. We can define the extended relation Ext_Students that contains all the base attributes of a student plus an extended attribute that represents the ratings made by each student as a single “unit of information” per student: Ext_Students(SuID, Name, Comments(CourseID, Rating, Date)) Figure 3 shows an instance for this relation. To refer to the extended attribute, we use Ext_Students.Comments, and to the rating attribute within Ext_Students, we use Comments.Rating or simply Rating. In a similar way, we can define other extended relations, such as the extended relation Ext_Courses that extends the base course relation with an attribute that shows all the instructors per year and term that a course was taught: Ext_Courses(CourseID, DepID, Title, Instructors(Name, Year, Term)) In Section 3.3, we will show how these example extended relations can be defined using the extend operator. Extended relations can be thought of as “views” that collect and group together information related to an individual entity and represent it as a single tuple that can be easily handled by the recommendation operators irrespective of the database structure (as we will see in Sections 3.4 − 3.6). Although the issue of whether extended relations are materialized or not is orthogonal to their definition, in practice, extended relations may not be stored in the database. In the following subsections, we describe the operators required to handle extended relations and formulate recommendations. We start with the standard (core) relational algebra operators with set semantics and we extend them enabling interaction with extended relations as well. 3.2 Base Operators We consider the following operators that can operate on base and extended relations. • Select, σc, selects tuples from relation R, for which the condition c holds, i.e., σc(R) := {r|r ∈ R ∧ r satisfies c} R can be a base or extended relation and c is a condition that refers to base attributes of R. σc(R) will be a base or extended relation depending on the type of R. • Project, πA, creates a new relation by projecting R into a smaller set of its attributes, i.e., πA(R) := {r[A]|r ∈ R } A is a list of base, embedded or extended attributes from R. If A contains only base attributes, then the resulting relation is always a base one. • Join, ./c, creates a new relation by combining tuples in Ri and Rj that meet some condition c, which refers only to base attributes of the two relations, i.e., Ri ./cRj := {(ri, rj )|ri ∈ Ri ∧ rj ∈ Rj ∧ ri, rj satisfy c} A common type of join is natural join, which connects two relations by equating attributes of the same name, and projecting out one copy of each pair of equated attributes. This is denoted Ri ./ Rj (the condition c is implied.) The base operators defined above can handle extended relations and be combined with the new recommendation operators to be de- fined next but we have kept their original semantics. One could go further and define operators with extended semantics. For example, one could define that the select operator operates also on extended attributes. There is a rich literature on nested relational algebras that discuss extended, recursive operations such as nesting, unnesting and projections and joins on embedded relations [9, 11, 25]. We believe that such generality is not necessary for practical recommendations. Our framework can be easily extended to handle more generalized relational operators. 3.3 The Extend Operator With many (normalized) database schemas, information that conceptually refers to a single entity (e.g., a student’s course ratings) is often found in several relations. In our system, we want to represent such application entities with a single (extended) relation. For instance, a tuple in an extended relation could contain base information on a student (e.g., name, GPA), plus the set of courses a student has taken. For this purpose, we introduce a new operator that creates such extended attributes in the tuples of a relation. DEFINITION 2. Extend, ε, creates an extended relation by embedding a base relation Rj into another relation Ri, i.e., Ri ε Rj := {(ri, v)|ri ∈ Ri ∧ v := πA(ri ./ Rj )} where A is the set of attributes of Rj . In words, Ri ε Rj is an extended relation that contains all the attributes and tuples from Ri plus an extended attribute whose (extended) value for each tuple ri from Ri is a set of tuples from Rj that join to ri on the common attributes. If there are no joining tuples, then the extended attribute will be an empty relation for ri. The default name of this extended attribute is the name of Rj . In the above definition, only Rj is required to be a base relation because we allow one-level nesting. Ri can be a base or extended relation, since any extended relation can have more than one extended attribute. The extend operator can be defined for multiplelevel nesting as well. Example: Using the extend operator we can generate the extended relation Ext_Students described earlier as follows: PComments := π{SuID,CourseID,Rating,Date}(Comments) Ext_Students := π{SuID,Name}(Students) ε PComments In order to generate the extended relation Ext_Courses, we need to combine information found in more than two relations, as follows: Schedules := π{CourseID,Year,Term,InstrID}(CourseSched) Instruct_Sched := π{InstrID,Name}(Instructors) ./ Schedules Ext_Courses := π{CourseID,DepID,Title}(Courses) ε Instruct_Sched Note that joining over attributes with common names simplifies the presentation and the definition of the extend operator. It does
nationspaond antr retes thas can be used witp esis iye rat or e relations and attributes can be renamed accordingly. We can use Probability[A, B, RI(t, s)=oRA=LLAI(R)DAB OR A=sAJ(R)I a rename operator PIR(B1,.Bn)(R)that produces a relation JoR.A=s(Aj(R) identical to but with name R and attributes. in order. named This function computes the number of pairs of tuples from a re B1,... Bn. For example, we can rename COmments to Comments in lation R with the same value on an attribute B and with t(a as order to have the relation Ext Students as shown in Figure 3 the value of the attribute A in one of the pair's tuples and s(A Ext_Students: T(SMD, Name)(Students)E P[Comments(PComments) as the value in the other pair tuple divided by the number of tu- les in R with s a as the value of the attribute A. For example 3.4 The Recommend operator Probability[CourselD, SulD, StudentHistory(t, s)finds the similarity etween two courses based on the number of students (identified by Recommendations are based on comparisons(e.g rated by comparing their topics to a student 's interests, students are their SulD)that took both courses t and s(identified by their course mpared to a particular student based on their ratings in order to id CourselD)divided by the total number of students that took s find similar students, and so forth). In order to"build" recommen- for comparing two extended values, each one belonging to one of lations, we will have a library of c on les omparison functions that imple- the input tuples, using some metric, such as the Euclidean distance, ks of the form defined belo Pearson coefficient. etc. DEFINITION 3. A comparison function cf is a parameterized function that maps a tuple t to a single scalar value u by comparing it to another tuple s. Euclidean(E, A1,A2(t,s=2([]-jlA2) u=cfIP(t, s) i∈fE 135=1 where t and s are the two inputs to be compared and P denotes the parameters used in the function. This function compares two tuples t and s on their extended at- tribute E by summing up the squared differences of the values of For example, we may wish to apply a comparison function to attribute A, in all tuples i and j from the nested relations t(E ome part of two tuples, i.e to an attribute A of them; then, we and s[E], respectively, that join on Al. Then, we could, for in- would specify cf[](t, s). It could also be to a set of attributes for stance, use Euclidean(Comments, CourselD, RatingI(t, s)to compare multi-criteria recommendations [4]. We do not make any particular students based on their common ratings, where students t ands are ssumptions for the tuples t and s, e.g, that they should have the taken from Ext._Students ame number of attributes or that corresponding attributes should ave the same type, allowing in this way a wide spectrum of func- Comparisons of single values to extended values. We could defin tions to be defined, as the forthcoming examples will illum a comparison function that compares two values of different e. g, a single value to an extended value, i. e, to a relation E. For function that computes a relationship between its two inputs based example, we could define a function that tries to locate a single n either probabilistic approaches(e.g,[14 )or more traditional tem-to-item correlations. such as pearsons correlation Kendalls attribute B in the tuple where the value was found, i.e. tau, Euclidean distance, and Jaccard index, as well as user-defined Identify[A,E,B(t,s)=叫[Bif彐∈sEst.t4]=v[4 lomain-specific metrics. It could also be a prediction model learn For instance. if t is a course and s is a student from Ext Students offline(e.g, (81). We illustrate several possibilities below Identify[CourselD, Comments, Ratingl(t, s)returns the rating of Comparisons of string values. For example, we could consider dent s for course t based on the course id. In Section 3.6, we will a function that measures the Jaccard similarity between two string ee examples that illustrate how these functions can be used for values, each one belonging to one of the input tuples of the function generating recommendations and treated, for the purposes of comparison, as a bag of words, i.e Comparison functions compare one tuple to another tuple. It is Jaccard[4(t,s)=14】ns4 frequently desirable to compare one tuple to a set of tuples, e. g |t[4Us[4]l a student to a group of students. For this purpose, we define an Using this function, we could, for example, compare course de- aggregation comparison function scriptions using Jaccard[ DescriptionI(t, s), where t and s represent DEFINITION 4. An aggregation comparison function a is a pa- ourses, or compare instructors on the basis of their names with rameterized function that maps a tuple t to a single scalar value Jaccard[Namel(t, s), where t and s are two tuples in Instructors by comparing it to a set of tuples s Comparisons of numerical values. For example, we could use a iunction that measures the distance of two numbers. i.e.. u=aef, PI(t, S)=aPef(t, si)ls: E Sn) Essentially, a takes as parameter the comparison func Distance[A(t, s)=t[A-s[Al the tuple-to-tuple comparisons and generates a list of comparisor Such a function could be used to compare students based on their function values cf(t, s1),cf(t, sn), Vsi E S, which are then GPAs,i.e,Distance (GPAl(t, s), where t and s belong to Students, combined through a into a single value or to compare courses based on their units, i.e., Distance[ Units(t, s) An aggregation comparison function combines all partial values where t and s represent courses. into a final one using a function, such as the maximum, the average. Using conditional probabilities. An alternate way of computing and so forth, and essentially signifies the relationship of tuple t the similarity between two tuples is to use a measure that is based to the set of tuples s. P denotes other parameters to be used in of th the function. For example, we may wish to compute the weig that the other tuple has already been observed. For example, a sim- average of the partial comparison values using as weights the ple comparison function that computes conditional probability is of attribute A of the tuples in S
not impose any real constraints on its expressivity or on the relations and attributes that can be used with this operator because relations and attributes can be renamed accordingly. We can use a rename operator ρ[R 0 (B1, ...Bn)](R) that produces a relation identical to R but with name R 0 and attributes, in order, named B1, ... Bn. For example, we can rename PComments to Comments in order to have the relation Ext_Students as shown in Figure 3: Ext_Students := π{SuID,Name}(Students) ε ρ[Comments(PComments)] 3.4 The Recommend Operator Recommendations are based on comparisons (e.g., courses are rated by comparing their topics to a student’s interests, students are compared to a particular student based on their ratings in order to find similar students, and so forth). In order to “build” recommendations, we will have a library of comparison functions that implement common recommendation tasks of the form defined below. DEFINITION 3. A comparison function cf is a parameterized function that maps a tuple t to a single scalar value u by comparing it to another tuple s. u = cf[P](t, s) where t and s are the two inputs to be compared and P denotes the parameters used in the function. For example, we may wish to apply a comparison function to some part of two tuples, i.e., to an attribute A of them; then, we would specify cf[A](t, s). It could also be to a set of attributes for multi-criteria recommendations [4]. We do not make any particular assumptions for the tuples t and s, e.g., that they should have the same number of attributes or that corresponding attributes should have the same type, allowing in this way a wide spectrum of functions to be defined, as the forthcoming examples will illuminate. Also, we do not assume any properties for cf. It could be any function that computes a relationship between its two inputs based on either probabilistic approaches (e.g., [14]) or more traditional item-to-item correlations, such as Pearson’s correlation, Kendall’s tau, Euclidean distance, and Jaccard index, as well as user-defined, domain-specific metrics. It could also be a prediction model learnt offline (e.g., [8]). We illustrate several possibilities below. • Comparisons of string values. For example, we could consider a function that measures the Jaccard similarity between two string values, each one belonging to one of the input tuples of the function and treated, for the purposes of comparison, as a bag of words, i.e.: Jaccard[A](t, s) = |t[A] T s[A]| |t[A] S s[A]| Using this function, we could, for example, compare course descriptions using Jaccard[Description](t, s), where t and s represent courses, or compare instructors on the basis of their names with Jaccard[Name](t, s), where t and s are two tuples in Instructors. • Comparisons of numerical values. For example, we could use a function that measures the distance of two numbers, i.e.: Distance[A](t, s) = t[A] − s[A] Such a function could be used to compare students based on their GPAs, i.e., Distance[GPA](t, s), where t and s belong to Students, or to compare courses based on their units, i.e., Distance[Units](t, s), where t and s represent courses. • Using conditional probabilities. An alternate way of computing the similarity between two tuples is to use a measure that is based on the conditional probability of observing one of the tuples given that the other tuple has already been observed. For example, a simple comparison function that computes conditional probability is: Probability[A, B, R](t, s) = |σR.A=t[A] (R) ./B σR.A=s[A] (R)| |σR.A=s[A] (R)| This function computes the number of pairs of tuples from a relation R with the same value on an attribute B and with t[A] as the value of the attribute A in one of the pair’s tuples and s[A] as the value in the other pair tuple divided by the number of tuples in R with s[A] as the value of the attribute A. For example, Probability[CourseID, SuID, StudentHistory](t, s) finds the similarity between two courses based on the number of students (identified by their SuID) that took both courses t and s (identified by their course id CourseID) divided by the total number of students that took s. • Comparisons of extended values. We could also define functions for comparing two extended values, each one belonging to one of the input tuples, using some metric, such as the Euclidean distance, Pearson coefficient, etc. For example, a comparison function using the Euclidean distance is: Euclidean[E, A1, A2](t, s) = vuuuut X i∈t[E] j∈s[E] i[A1]=j[A1] (i[A2] − j[A2])2 This function compares two tuples t and s on their extended attribute E by summing up the squared differences of the values of attribute A2 in all tuples i and j from the nested relations t[E] and s[E], respectively, that join on A1. Then, we could, for instance, use Euclidean[Comments, CourseID, Rating](t, s) to compare students based on their common ratings, where students t and s are taken from Ext_Students. • Comparisons of single values to extended values. We could define a comparison function that compares two values of different types, e.g., a single value to an extended value, i.e., to a relation E. For example, we could define a function that tries to locate a single attribute value within E, and if it finds it, it returns the value of an attribute B in the tuple where the value was found, i.e.: Identify[A, E, B](t, s) = v[B] if ∃v ∈ s[E] s.t. t[A] = v[A] For instance, if t is a course and s is a student from Ext_Students, Identify[CourseID, Comments, Rating](t, s) returns the rating of student s for course t based on the course id. In Section 3.6, we will see examples that illustrate how these functions can be used for generating recommendations. Comparison functions compare one tuple to another tuple. It is frequently desirable to compare one tuple to a set of tuples, e.g., a student to a group of students. For this purpose, we define an aggregation comparison function. DEFINITION 4. An aggregation comparison function a is a parameterized function that maps a tuple t to a single scalar value u by comparing it to a set of tuples S: u = a[cf, P](t, S) = a[P]({cf(t, si)|si ∈ S}) Essentially, a takes as parameter the comparison function to use for the tuple-to-tuple comparisons and generates a list of comparison function values cf(t, s1), ...cf(t, sn), ∀si ∈ S, which are then combined through a into a single value. An aggregation comparison function combines all partial values into a final one using a function, such as the maximum, the average, and so forth, and essentially signifies the relationship of tuple t to the set of tuples S. P denotes other parameters to be used in the function. For example, we may wish to compute the weighted average of the partial comparison values using as weights the values of attribute A of the tuples in S: