Df note is that very few systems in the recommender system literature perform user trials using real users. To test classifier accuracy, most use either labelled benchmark document collections, such as Reuters news feed collection, or logged user data, such as 1.6 Overview of approach Our ontological approach to recommender systems uses a hybrid recommender system employing both collaborative and content-based recommendation techniques and representing user profiles in ontological terms. Two experimental systems have been built that follow this approach, called Quickstep and Foxtrot. Quickstep is a recommender ystem for a set of researchers within a computer science laboratory, while Foxtrot is a searchable database and recommender system for a computer science department. Figure I shows the generic structure of our ontological recommender systems Web World-Wide Browser Profile Profiler Web Recommender Classifier Email Search A web proxy is used to unobtrusively monitor each user's web browsing, adding new research papers to the central database as users discover them. The research paper database thus acts as a pool of shared knowledge, available to all users via search and recommendation. The database of research papers is classified using a research paper logy and a set of Recorded web browsing and relevance feedback elicited from users is used to ompute daily profiles of user's research interests. Interest profiles are represented in ontological terms, allowing other interests to be inferred that go beyond that just seen from directly observed behaviour. The interest profiles are visualized to allow elicitation
Of note is that very few systems in the recommender system literature perform user trials using real users. To test classifier accuracy, most use either labelled benchmark document collections, such as Reuters news feed collection, or logged user data, such as Usenet logs. 1.6 Overview of approach Our ontological approach to recommender systems uses a hybrid recommender system, employing both collaborative and content-based recommendation techniques and representing user profiles in ontological terms. Two experimental systems have been built that follow this approach, called Quickstep and Foxtrot. Quickstep is a recommender system for a set of researchers within a computer science laboratory, while Foxtrot is a searchable database and recommender system for a computer science department. Figure 1 shows the generic structure of our ontological recommender systems. Web Proxy Profiler Recommender Classifier Research Paper Search Database Recommendation Page Email World-Wide Web World-Wide Web Web Browser Visualized Profile OOnnt t ool l ogy ogy Web Proxy Profiler Recommender Classifier Research Paper Search Database Recommendation Page Email World-Wide Web World-Wide Web Web Browser Visualized Profile OOnnt t ool l ogy ogy Fig. 1. Our ontological approach to recommender systems A web proxy is used to unobtrusively monitor each user’s web browsing, adding new research papers to the central database as users discover them. The research paper database thus acts as a pool of shared knowledge, available to all users via search and recommendation. The database of research papers is classified using a research paper topic ontology and a set of training examples. Recorded web browsing and relevance feedback elicited from users is used to compute daily profiles of user’s research interests. Interest profiles are represented in ontological terms, allowing other interests to be inferred that go beyond that just seen from directly observed behaviour. The interest profiles are visualized to allow elicitation
of direct profile feedback, providing an additional source of information from which profiles can be compute Recommendations are compiled daily using collaborative filtering techniques to find sets of interesting papers. These papers are then constrained to match the top topics of ithin the content-based profiles. The papers left are used to create the recommendations Users can view their recommendations via a web page or weekly email message, look at and comment on visualizations of their profile via a web page or just search the research paper database for specific papers of interest. Quickstep, the earlier system supports only web page recommendation while Foxtrot supports all the interface features 1.7 Empirical evaluation This paper describes three experiments performed using our two recommender systems The first uses the Quickstep system to measure the effectiveness of using ontological nference in user profiling. Two 1.5 month trials were run using 24 members from the IAM research laboratory, comparing use of ontological profiles and inference to that of using unstr The second experiment integrated the Quickstep system with an external personnel and publication ontology. This experiment measured how effectively an external ontology can bootstrap a recommender system to reduce the recommender system cold-start problem. Behaviour logs from the previous experiment were used as the basis for this evaluation The third experiment took the Foxtrot recommender system and measures its overall effectiveness and the performance increase obtained when profiles are vis profile feedback acquired. A trial was run using 260 staff and students from the computer science department of the University of Southampton for an academic year, compar performance of those subjects who provided profile feedback to those who did not 2. ONTOLOGICAL USER PROFILING AND PROFILE INFERENCE Our ontological approach to recommender systems, shown in figure 2, involves various sub-processes. Our first experimental recommender system, called Quickstep[Middleton et al. 2001 implements all these processes but with just a web page interface. Quickstep is thus just a recommender system, without any search, email or visualization facilities. It was built to help researchers in a computer science laboratory setting, representing user profiling with a research topic ontology and using ontological inference to assist the profiling process. An experiment was run to compare the recommendation performance for subjects whose profiler used ontological inference with those whose profiler did not
of direct profile feedback, providing an additional source of information from which profiles can be computed. Recommendations are compiled daily using collaborative filtering techniques to find sets of interesting papers. These papers are then constrained to match the top topics of interest within the content-based profiles. The papers left are used to create the recommendations. Users can view their recommendations via a web page or weekly email message, look at and comment on visualizations of their profile via a web page or just search the research paper database for specific papers of interest. Quickstep, the earlier system, supports only web page recommendation while Foxtrot supports all the interface features. 1.7 Empirical evaluation This paper describes three experiments performed using our two recommender systems. The first uses the Quickstep system to measure the effectiveness of using ontological inference in user profiling. Two 1.5 month trials were run using 24 members from the IAM research laboratory, comparing use of ontological profiles and inference to that of using unstructured profiles. The second experiment integrated the Quickstep system with an external personnel and publication ontology. This experiment measured how effectively an external ontology can bootstrap a recommender system to reduce the recommender system cold-start problem. Behaviour logs from the previous experiment were used as the basis for this evaluation. The third experiment took the Foxtrot recommender system and measures its overall effectiveness and the performance increase obtained when profiles are visualized and profile feedback acquired. A trial was run using 260 staff and students from the computer science department of the University of Southampton for an academic year, comparing performance of those subjects who provided profile feedback to those who did not. 2. ONTOLOGICAL USER PROFILING AND PROFILE INFERENCE Our ontological approach to recommender systems, shown in figure 2, involves various sub-processes. Our first experimental recommender system, called Quickstep [Middleton et al. 2001], implements all these processes but with just a web page interface. Quickstep is thus just a recommender system, without any search, email or visualization facilities. It was built to help researchers in a computer science laboratory setting, representing user profiling with a research topic ontology and using ontological inference to assist the profiling process. An experiment was run to compare the recommendation performance for subjects whose profiler used ontological inference with those whose profiler did not
2. 1 Overview of the Quickstep recommender system Quickstep unobtrusively monitors user browsing behaviour via a web proxy, logging each URL browsed during normal work activity. A machine-learning algorithm classifies browsed URLs overnight, using classes within a research paper topic ontology, and saves each classified paper in a central paper store. Explicit relevance feedback and browsed topics form the basis of the interest profile for each user. Is-a relationships within the research paper topic ontology are also exploited to infer general interests when specific pics are observ Each day a set of recommendations is computed, based on correlations between user nterest profiles and classified paper topics. These recommendations are accessible to users via a web page. Any feedback offered on these recommendations is recorded when the user looks at them. Users can provide new examples of topics and correct paper classifications where wrong. In this way the training set improves over time as well as the ⊙e World- Wide Web Recommendation ..Page Profiler Prox Recommendation Recommend Classifier commendation Research Paper Database 2.2 Approach of the Quickstep recommender system The Quickstep system uses a java-based web proxy, which records time-stamped URL for each user. This proxy could handle about 30 users. The system ran on a Solaris platform and was mostly written in Java. The research paper topic ontology is based on the computer science classifications made by the dmoz open directory project [dmoz] and some minor customisations. We chose to re-use an existing taxonomy to speed development time and provide a potential
2.1 Overview of the Quickstep recommender system Quickstep unobtrusively monitors user browsing behaviour via a web proxy, logging each URL browsed during normal work activity. A machine-learning algorithm classifies browsed URLs overnight, using classes within a research paper topic ontology, and saves each classified paper in a central paper store. Explicit relevance feedback and browsed topics form the basis of the interest profile for each user. Is-a relationships within the research paper topic ontology are also exploited to infer general interests when specific topics are observed. Each day a set of recommendations is computed, based on correlations between user interest profiles and classified paper topics. These recommendations are accessible to users via a web page. Any feedback offered on these recommendations is recorded when the user looks at them. Users can provide new examples of topics and correct paper classifications where wrong. In this way the training set improves over time as well as the profiles. Web Proxy Profiler Recommender Classifier Research Paper Database Recommendation Page World-Wide Web World-Wide Web Web Browser OnOntotoll ooggyy Recommendation Page Recommendation Page Web Proxy Profiler Recommender Classifier Research Paper Database Recommendation Page World-Wide Web World-Wide Web Web Browser OnOntotoll ooggyy Recommendation Page Recommendation Page Fig. 2. The Quickstep system 2.2 Approach of the Quickstep recommender system The Quickstep system uses a java-based web proxy, which records time-stamped URLs for each user. This proxy could handle about 30 users. The system ran on a Solaris platform and was mostly written in Java. 2.2.1 Ontology The research paper topic ontology is based on the computer science classifications made by the dmoz open directory project [dmoz] and some minor customisations. We chose to re-use an existing taxonomy to speed development time and provide a potential
route for system integration with other external ontologies in the future. Our simple ontology holds is-a relationships between research paper topics, and has 27 classes; for the second trial this ontology was extended to 32 classes. Figure 3 shows a section from he ontology. Pre-trial interviews formed the basis of which additional topics would be added to the ontology to customize it for the target researchers. An expert review by two domain experts validated the ontology for correctness before use in our experiment Artificial-Agents IntelligenceBeliefN E-Commerce ulti-Agent-Systems H ndustrial Hypermed Literature [ hypermedial [hypertext] Web [ hypermedia to 2. 2. 2 Research paper representation Research papers are represented using term vectors. We use term'to mean a single word within the text of a paper, thus all words that appear in the training set of example papers add one dimension to our term vectors. Term vector weights are computed from the term frequency (tF)divided by total number of terms, representing the normalized equency in which a word appears within a research paper. Since many words are either too common or too rare to have useful discriminating power to a classifier, we use a few dimensionality reduction techniques to reduce the number of dimensions of the term vectors. Porter stemming [Porter 1980] is used to remove term suffixes and the SMarT [SMART Staff 1974] stop list is used to remove very common words like"the"and"or Term frequencies below 2 are removed since they have little discriminating power Dimensionality reduction is common in information system; [Sebastiani 2002] provides a good discussion of the issues
route for system integration with other external ontologies in the future. Our simple ontology holds is-a relationships between research paper topics, and has 27 classes; for the second trial this ontology was extended to 32 classes. Figure 3 shows a section from the ontology. Pre-trial interviews formed the basis of which additional topics would be added to the ontology to customize it for the target researchers. An expert review by two domain experts validated the ontology for correctness before use in our experiment. Artificial Intelligence Hypermedia E-Commerce Interface Agents Mobile Agents Multi-Agent-Systems Recommender Systems Agents Belief Networks Fuzzy Game Theory Genetic Algorithms Genetic Programming Knowledge Representation Information Filtering Information Retrieval Machine Learning Natural Language Neural Networks Philosophy [AI] Robotics [AI] Speech [AI] Vision [AI] Text Classification Ontologies Adaptive Hypermedia Hypertext Design Industrial Hypermedia Literature [hypermedia] Open Hypermedia Spatial Hypertext Taxonomic Hypertext Visualization [hypertext] Web [hypermedia] Content-Based Navigation Architecture [open hypermedia] Fig. 3. Section from the Quickstep research paper topic ontology 2.2.2 Research paper representation Research papers are represented using term vectors. We use ‘term’ to mean a single word within the text of a paper, thus all words that appear in the training set of example papers add one dimension to our term vectors. Term vector weights are computed from the term frequency (TF) divided by total number of terms, representing the normalized frequency in which a word appears within a research paper. Since many words are either too common or too rare to have useful discriminating power to a classifier, we use a few dimensionality reduction techniques to reduce the number of dimensions of the term vectors. Porter stemming [Porter 1980] is used to remove term suffixes and the SMART [SMART Staff 1974] stop list is used to remove very common words like “the” and “or”. Term frequencies below 2 are removed since they have little discriminating power. Dimensionality reduction is common in information system; [Sebastiani 2002] provides a good discussion of the issues
Most on-line research papers are in HTML, PS or PDF formats, with many papers being compressed. We support all these formats for maximum coverage in our proble domain, converting the papers to plain text and using this text to create the term vectors Unusual or corrupt formats are ignored. Several heuristics are used to determine if the research papers are converted to text correctly and look like a typical research paper with terms such as abstractand'references. In the later experiments, term vectors for papers had around 15,000 dimensions after dimensionality reduction 2.2.3 Classifier Research papers in the central database are classified by an IBk [Aha et al. 1991 classifier, which is boosted by the AdaBoostMI [Freund and Schapire 1996] algorithm The IBk classifier is a k-Nearest Neighbour type classifier that uses example documents, called a training set, added to a term-vector space. Example documents in the training set re manually labelled using the class names within the research paper topic ontology Figure 4 shows the basic k-Nearest Neighbour algorithm. The closeness of an unclassified vector to its neighbour vectors within the term-vector space determines its classification wdd=y∑ w(d db) knn distance between document a and b document vectors number of terms in document set weight of term j document a Fig 4. k-Nearest Neighbour algorithm lassifiers like k-Nearest Neighbour allow more training examples to be added to their term-vector space without the need to re-build the entire classifier. They also degrade well, so even whe neighbourhood" and so at least partially relevant. This makes k-Nearest Neighbour a robust choice of algorithm for research paper classification Boosting works by repeatedly running a weak learning algorithm on various distributions of the training set, and then combining the specialist classifiers produced by the weak learner into a single composite classifier. The"weak" learning algorithm here is the Ibk classifier. Figure 5 shows the AdaboostMi algorithm
Most on-line research papers are in HTML, PS or PDF formats, with many papers being compressed. We support all these formats for maximum coverage in our problem domain, converting the papers to plain text and using this text to create the term vectors. Unusual or corrupt formats are ignored. Several heuristics are used to determine if the research papers are converted to text correctly and look like a typical research paper with terms such as ‘abstract’ and ‘references’. In the later experiments, term vectors for papers had around 15,000 dimensions after dimensionality reduction. 2.2.3 Classifier Research papers in the central database are classified by an IBk [Aha et al. 1991] classifier, which is boosted by the AdaBoostM1 [Freund and Schapire 1996] algorithm. The IBk classifier is a k-Nearest Neighbour type classifier that uses example documents, called a training set, added to a term-vector space. Example documents in the training set are manually labelled using the class names within the research paper topic ontology. Figure 4 shows the basic k-Nearest Neighbour algorithm. The closeness of an unclassified vector to its neighbour vectors within the term-vector space determines its classification. w(da,db) = √ ____________ Σ j = 1..T (tja – tjb)2 w(da,db) kNN distance between document a and b da,db document vectors T number of terms in document set tja weight of term j document a Fig. 4. k-Nearest Neighbour algorithm Classifiers like k-Nearest Neighbour allow more training examples to be added to their term-vector space without the need to re-build the entire classifier. They also degrade well, so even when incorrect the class returned is normally in the right “neighbourhood” and so at least partially relevant. This makes k-Nearest Neighbour a robust choice of algorithm for research paper classification. Boosting works by repeatedly running a weak learning algorithm on various distributions of the training set, and then combining the specialist classifiers produced by the weak learner into a single composite classifier. The “weak” learning algorithm here is the IBk classifier. Figure 5 shows the AdaBoostM1 algorithm