JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL 2, NO 4, NOVEMBER 2010 a Hybrid recommender system guided b Semantic User Profiles for Search in the E-learning domain Leyla Zhuhadar and Olfa Nasraoui Knowledge Discovery and Web Mining Lab Department of Computer Engineering and Computer Science University of Louisville, Louisville, KY 40292, USA mails: leyla. zhuhadar @wku.edu, olfa nasraoui@louisville. edu Abstrack-Various concepts, methods, and technical user interface. We noticed, by tracking user behavior in architectures of recommender systems have been integrated our applied personalized vertical search engine, into E-commerce storefronts, such as Amazon. com, Netflix, HyperMany Media, that using general recommendation te.Thereby recently, Web users have become more methods was not sufficient to make users interested in familiar with the notion of recommendations. Nevertheless. Ising the recommendations provided by the system into scientific information retrieval repositories, such as However, if the recommender system was tailored to the libraries, content management systems, online learning user's specific needs via personalization, the user got platforms, etc. This paper presents an implementation of a more interested and engaged into the recommendation hybrid recommender system to p process. Our finding resulted experience on a real online learning repository and vertical personalization aspect. We considered personalization as search engine named HyperMany Media. This repository the main building block of the recommender system contains educational content of courses, lectures, architecture. This conclusion is noticeable in most of the multimedia resources, ete. The main objective of this paper recommender systems that succeeded. Their success was is to illustrate the methods, concepts, and architecture that a result not of the complexity of the theoretical HyperMany Media repository. This recommender system is methodology that has been used to design the system, but driven by two types of recommendations: content-based rather of the usability and the simplicity of the (domain ontology model and rule-based(learner's interest- recommender system interface which guides the user based and cluster-based). Finally, combining the content- without interrupting his/her activities. In this paper, we ased and the rule-based models provides the user with present an implementation of a hybrid recommender hybrid recommendations that influence the ranking of the system on a search engine frontend to a real online retrieved documents. with different weights. Our learning repository named HyperManyMedia. This experiments were carried out on the HyperManyMedia repository contains educational content of courses, mantie search engine at Western Kentucky University. We lectures. multimedia resources, etc. The main objective of effectiveness of re-ranking based on the learner' s semantic this paper is to illustrate the methods, concepts, and profile. Overall, the results demonstrate the effectiveness architecture that we used to integrate the recommender the re-ranking based on personalization system into the HyperMan a reposit recommender system is driven by two types of Index Terms- recommender system, search engine, clus- recommendations: content-based (domain ontology tering, personalization, semantic profi model) and rule-based (learner's interest-based and cluster-based). The domain ontology model which is used L. INTRODUCTION to represent the learning materials, is composed of a The work presented in this paper describes a hybrid ierarchy of concepts and subconcepts that represent recommendation based retrieval model that can filter colleges, courses, and lectures, whereas, the learner's information based on user needs. We believe that the ontology model represents a subset of the domain methodology for designing an efficient recommender ontology(an ontology that contains only a personalized system, regardless of the approach used, i.e,content Tune ubset from the whole domain which consists based, collaborative, or hybrid, is to incorporate the only of the college/courses/lectures that the learner is following essential elements: contextual information, user interested in). Finally, combining the content-based and nteraction with the system, flexibility of receiving commend a ations provides the user with recommendations in a less intrusive manner, detecting the hybrid recommendations that influence the ranking of the user's change of interest and responding accordingly retrieved documents via different weights. However, supporting user feedback, and finally the simplicity of the before describing the design of our system,we first C 2010 ACADEMY PUBLISHER doi:10.4304/etwi24.272-281
A Hybrid Recommender System Guided by Semantic User Profiles for Search in the E-learning Domain Leyla Zhuhadar and Olfa Nasraoui Knowledge Discovery and Web Mining Lab Department of Computer Engineering and Computer Science University of Louisville, Louisville, KY 40292, USA Emails: leyla.zhuhadar@wku.edu, olfa.nasraoui@louisville.edu Abstract—Various concepts, methods, and technical architectures of recommender systems have been integrated into E-commerce storefronts, such as Amazon.com, Netflix, etc. Thereby, recently, Web users have become more familiar with the notion of recommendations. Nevertheless, little work has been done to integrate recommender systems into scientific information retrieval repositories, such as libraries, content management systems, online learning platforms, etc. This paper presents an implementation of a hybrid recommender system to personal the user’s experience on a real online learning repository and vertical search engine named HyperManyMedia. This repository contains educational content of courses, lectures, multimedia resources, etc. The main objective of this paper is to illustrate the methods, concepts, and architecture that we used to integrate a hybrid recommender system into the HyperManyMedia repository. This recommender system is driven by two types of recommendations: content-based (domain ontology model) and rule-based (learner’s interestbased and cluster-based). Finally, combining the contentbased and the rule-based models provides the user with hybrid recommendations that influence the ranking of the retrieved documents with different weights. Our experiments were carried out on the HyperManyMedia semantic search engine at Western Kentucky University. We used Top-n-Recall and Top-n-Precision to measure the effectiveness of re-ranking based on the learner’s semantic profile. Overall, the results demonstrate the effectiveness of the re-ranking based on personalization. Index Terms— recommender system, search engine, clustering, personalization, semantic profile I. INTRODUCTION The work presented in this paper describes a hybrid recommendation based retrieval model that can filter information based on user needs. We believe that the methodology for designing an efficient recommender system, regardless of the approach used, i.e., contentbased, collaborative, or hybrid, is to incorporate the following essential elements: contextual information, user interaction with the system, flexibility of receiving recommendations in a less intrusive manner, detecting the user’s change of interest and responding accordingly, supporting user feedback, and finally the simplicity of the user interface. We noticed, by tracking user behavior in our applied personalized vertical search engine, HyperManyMedia, that using general recommendation methods was not sufficient to make users interested in using the recommendations provided by the system. However, if the recommender system was tailored to the user’s specific needs via personalization, the user got more interested and engaged into the recommendation process. Our finding resulted in generalizing the personalization aspect. We considered personalization as the main building block of the recommender system architecture. This conclusion is noticeable in most of the recommender systems that succeeded. Their success was a result not of the complexity of the theoretical methodology that has been used to design the system, but rather of the usability and the simplicity of the recommender system interface which guides the user without interrupting his/her activities. In this paper, we present an implementation of a hybrid recommender system on a search engine frontend to a real online learning repository named HyperManyMedia. This repository contains educational content of courses, lectures, multimedia resources, etc. The main objective of this paper is to illustrate the methods, concepts, and architecture that we used to integrate the recommender system into the HyperManyMedia repository. This recommender system is driven by two types of recommendations: content-based (domain ontology model) and rule-based (learner’s interest-based and cluster-based). The domain ontology model which is used to represent the learning materials, is composed of a hierarchy of concepts and subconcepts that represent colleges, courses, and lectures; whereas, the learner’s ontology model represents a subset of the domain ontology (an ontology that contains only a personalized, pruned subset from the whole domain which consists only of the college/courses/lectures that the learner is interested in). Finally, combining the content-based and rule-based recommendations provides the user with hybrid recommendations that influence the ranking of the retrieved documents via different weights. However, before describing the design of our system, we first 272 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 2, NO. 4, NOVEMBER 2010 © 2010 ACADEMY PUBLISHER doi:10.4304/jetwi.2.4.272-281
JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL 2, NO 4, NOVEMBER 2010 present a comprehensive background of the origin of relevance, but there is no reference to the type of measure recommender systems and other related work in Section used in the ACM Portal, as cited in [14,(2) behavior Il. Then, we present various methodologies that we used based recommendations presented as"Peer-to-Peer in Section Ill, followed with a detailed description of our readers of this article also read". According to [14], this implementation and the evaluation results in Section IV. recommendation is built using simple frequency counts Finally, we draw our conclusions and therefore fails to provide accurate recommendations According to [21, IEEE Xplore announced the mplementation of content-based recommendations on IL. PREVIOUS WORK heir portal. Nevertheless, to date, no such recommende The scope of literature review in this paper concerns system is embedded into the IEEE Xplore libraries recommender systems in academic repositories. More However, CiteSeerl showed a promising venue for the specifically, we are interested in answering the following usage of recommender systems. The first prototype question: What is the current state-of-the-Art and the recommendations: (I) link structure-based rec- next generation of recommender systems in academic ommendation those recommendations are based on link and e-books repositories consider the value of citations and they can be distinguished into tour types of embedding recommender systems into their system. inside the searched document, recommend documents To answer this question, we reviewed the most popular that cite the document, the Co-citation and the active scientific digital libraries. In addition, we investigated bibliography),(2) content-based recommendations using (TF-IDF) similarity metrics and (3) explicit some of the promising Web 2.0 digital libraries that use recommendations, where the user can rank the retrieved recommender systems documents on a scale of I to 5. In addition, the user can le believe that a few of the main services that write a review or a comment about the paper. However the progress of this portal apparently stopped since 2006. benefit from the usage of recommender systems are The success of Google Scholar is evident even though it digital libraries. In particular, when we compare the usability of search engines with digital libraries, we provides limited recommendations, e.g., finding similar documents based on content and the ranking of those notice that the design of search engines has changed documents may be inherited from Google's page ranking dramatically over the last decade. Web users can easily algorithm. Another limitation of Google Scholar is that it search for resources using search engines. This flexibility does not retrieve documents that are cited inside a interface.However,digital libraries did not adapt to those this specific document. As we noticed, a variety of changes.The complexity of using a combined recommender systems portals have been implemented methodology of Boolean operators with Metadata fields the domain of digital libraries and scientific repositories, to retrieve resources from databases is considered to be a tedious process, especially for the new generation of Web some of which succeeded while others failed to survive users who are not used to spending a long time to search In the following paragraph, we discuss two significant c recommender systems tor resources. For example, many Web users now prefer first is the Melvyl recommender system, which has been using Google Scholar to search for journal articles, research papers, and e-books, regardless of its limitation implemented by the California digital library2. This to provide the user with a complete access to the resource system uses a simple technique to provide recommendations to users. First, it generates a graph nless the user already set his/her digital libraries access all the purchased documents in the library, then each inside the advanced feature in Google Scholar), than the document is considered as a weighted node (with the digital libraries, such as ACM, IEEE Xplore or CiteSeer. weight representing the number of purchases).Therefore Thus it appears that the simplicity of the Google Scholar the recommendation for a given document is based on the interface surpasses the accuracy that major digital libraries provide. However, ACM, IEEE Xplore and eighboring nodes (documents) which are sorted CiteSeer incorporated some techniques that could be according to their edge ights. The second specialized for the dom considered as a form of recommendations (with little scientific papers, It uses hybrid recommendations success). For example, ACM Portal provides two types of recommendations:(1)a content-based research tool combining a collaborative filtering and a content-based known as find similar articles" The mechanism used to pproach. The system uses graph theory where each find similar pat research paper is considered as a node and the citations involves three techniques: cluster inside each paper are considered as recommended nodes analysis, dictionary and thesauri. The retrieved Also, the system uses a more complex collaborative documents are ranked based on date, publisher, or filtering(CF)technique that considers each cited paper as an input, therefore also considering all citation papers as http://citeseeristpsu.edu recommendations. This technique is referred to as Dense http://www.dlib.org/architexT/at-dlib2query.html CF ly, the system applies a content-based http://techlens.cs.umn.edu/tl3 recommendation technique (TF-IDF) on the list of all C 2010 ACADEMY PUBLISHER
present a comprehensive background of the origin of recommender systems and other related work in Section II. Then, we present various methodologies that we used in Section III, followed with a detailed description of our implementation and the evaluation results in Section IV. Finally, we draw our conclusions. II. PREVIOUS WORK The scope of literature review in this paper concerns recommender systems in academic repositories. More specifically, we are interested in answering the following question: What is the current state-of-the-Art and the next generation of recommender systems in academic repositories and do scientific portals, digital libraries, and e-books repositories consider the value of embedding recommender systems into their system? To answer this question, we reviewed the most popular scientific digital libraries. In addition, we investigated some of the promising Web 2.0 digital libraries that use recommender systems. We believe that a few of the main services that can benefit from the usage of recommender systems are digital libraries. In particular, when we compare the usability of search engines with digital libraries, we notice that the design of search engines has changed dramatically over the last decade. Web users can easily search for resources using search engines. This flexibility is provided by the simplicity of the search engines’ user interface. However, digital libraries did not adapt to those changes. The complexity of using a combined methodology of Boolean operators with Metadata fields to retrieve resources from databases is considered to be a tedious process, especially for the new generation of Web users who are not used to spending a long time to search for resources. For example, many Web users now prefer using Google Scholar to search for journal articles, research papers, and e-books, regardless of its limitation to provide the user with a complete access to the resource (unless the user already set his/her digital libraries’ access inside the advanced feature in Google Scholar), than the digital libraries, such as ACM, IEEE Xplore or CiteSeer. Thus it appears that the simplicity of the Google Scholar interface surpasses the accuracy that major digital libraries provide. However, ACM, IEEE Xplore and CiteSeer incorporated some techniques that could be considered as a form of recommendations (with little success). For example, ACM Portal provides two types of recommendations: (1) a content-based research tool known as “find similar articles”. The mechanism used to find similar papers involves three techniques: cluster analysis, dictionary and thesauri. The retrieved documents are ranked based on date, publisher, or relevance, but there is no reference to the type of measure used in the ACM Portal, as cited in [14], (2) behaviorbased recommendations presented as “Peer-to-Peer readers of this article also read”. According to [14], this recommendation is built using simple frequency counts, and therefore fails to provide accurate recommendations. According to [2], IEEE Xplore announced the implementation of content-based recommendations on their portal. Nevertheless, to date, no such recommender system is embedded into the IEEE Xplore libraries. However, CiteSeer1 showed a promising venue for the usage of recommender systems. The first prototype provided the users with three different types of recommendations: (1) link structure-based recommendation: those recommendations are based on link citations and they can be distinguished into four types of recommendations (recommend documents that are cited inside the searched document, recommend documents that cite the document, the Co-citation and the active bibliography), (2) content-based recommendations using (TF-IDF) similarity metrics and (3) explicit recommendations, where the user can rank the retrieved documents on a scale of 1 to 5. In addition, the user can write a review or a comment about the paper. However, the progress of this portal apparently stopped since 2006. The success of Google Scholar is evident even though it provides limited recommendations, e.g., finding similar documents based on content and the ranking of those documents may be inherited from Google’s page ranking algorithm. Another limitation of Google Scholar is that it does not retrieve documents that are cited inside a specific document, but rather only the documents that cite this specific document. As we noticed, a variety of recommender systems portals have been implemented in the domain of digital libraries and scientific repositories, some of which succeeded while others failed to survive. In the following paragraph, we discuss two significant implementations of scientific recommender systems. The first is the Melvyl recommender system, which has been implemented by the California digital library2. This system uses a simple technique to provide recommendations to users. First, it generates a graph of all the purchased documents in the library, then each document is considered as a weighted node (with the weight representing the number of purchases). Therefore, the recommendation for a given document is based on the neighboring nodes (documents) which are sorted according to their edge weights. The second is TechLens3, which is specialized for the domain of scientific papers, it uses hybrid recommendations combining a collaborative filtering and a content-based approach. The system uses graph theory where each research paper is considered as a node and the citations inside each paper are considered as recommended nodes. Also, the system uses a more complex collaborative filtering (CF) technique that considers each cited paper as an input, therefore also considering all citation papers as recommendations. This technique is referred to as Dense CF. Finally, the system applies a content-based recommendation technique (TF-IDF) on the list of all 1 http://citeseer.ist.psu.edu 2 http://www.dlib.org/Architext/AT-dlib2query.html 3 http://techlens.cs.umn.edu/tl3 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 2, NO. 4, NOVEMBER 2010 273 © 2010 ACADEMY PUBLISHER
JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL 2, NO 4, NOVEMBER 2010 recommended papers. Thus, the most similar papers are IIL methodolOGY recommended to the user. The system provides two The methodology of designing our hybrid options:(1)Pure content-based CF(the similarity recommender system is divided into two parts:(1)The measure is only based on two entities, the title of th paper and the abstract, and (2) Content-based Separated- First we relied on a fine grained taxonomy that he final recommendations provided to the user would be encapsulates all the domain of Education in general, and a list of sorted recommendations that combine multiple taxonomy from WordNet. This attempt ended with a factors based on the type that the user chose. Recently with the increased popularity of social tagging systems great disappointment since WordNet wider and different than what the portals such as CiteULike4 and BibSonomys, are Hyper Many Media domain contains. As a result, two considered promising projects that use social bookmarking to derive recommendations. [ 1], [5].(61, [4] major problems occur, the overloading of the fine grained on the user profiles, In this case by learning from implicit ambiguity. Therefore, a decision was made to create a hand-made ontology using a coarse-grained taxonomy. In feedback or past click history. Other ways to form a user Section IV, we describe in detail the design of the domain model include using data mining, such as by mining ontology.(2)The second part centers around designing association rules [11], or by partitioning a set of user the learner's ontology: Each learner has his or her own sessions into clusters or groups of similar sessions. The latter groups are called session clusters [12],[10], or user ontology based on his/her preferences. The learner's profiles [12], [10]. More recently, [13 presented a ogy is extracted from the domain ontology and presented as a pruned subset ontology. In Section IV, we Semantic Web usage mining methodology for mining describe in detail the design of the learner's ontology In evolving user profiles on dynamic Websites by clustering the following sections, we describe the methodology used profiles of one period with those discovered in previous to provide the learm er with hybrid recommendations: (D periods to detect profile evolution, and also to understand Ontology Content-based,(2)Cluster-based, and (3) what type of profile evolutions have occurred. This latter Interest-based branch of using data mining techniques to discover user models from Web usage data is referred to as Web Usage A. Building the Hyper Many Media Domain Ontology ning.A previous work that used Web mining for Recently, a variety of knowledge-based framework ap- developing smart E-learning systems [16] integrated Web plications became available that support modeling ontolo- usage mining, where patterns were automatically gies. The best known applications are Protege and Al- discovered from users'actions, and then fed into a tova. We used Protege as a framework application recommender system that could assist learners in the Figure I shows the design of the HyperManyMedia online learning activities by suggesting actions or ontology in Protege. Since our approach is based on a resources to a user. Another type of data mining in E- search engine recommender system, the content of each learning was performed on documents rather than on the lecture is considered as a document and the students'actions. This type of data mining is more akin to recommendation of pages is related to the degree of text mining(i.e, knowledge discovery from text data) and the re than Web usage mining [3].This approach helps alleviate indexing of the lecture (Webpage).The volume of data that can be overwhelming for a learner. It Model(VSM)and the score of a query g for a document works by organizing the articles and documents based on d is computed based on the cosine similarity between the the topics and also providing summaries for documents. document and the query vector. The implementation can [7] combines Web usage mining and text-based indexing be described as follows: (1)Preliminary crawling and recommendations. I8]uses a learning algorithm to select platform that contributes to the content of the sequential articles based on context and user-click recommendation;(2)We start by representing each feedback to recommend news articles to users. Our the n documents as a term vector d=< wI, w2,.wn>, approach shares some similarity with the above where w is the term weight for term (), combining the techniques. It is a Hybrid recommender system which term frequency, tf, and the Term's Inverse Document combines Content-based recommendations with two types of Rule-based recommendations. In Section Ill, we requency IDF;=log- if this term occurs in ni explain our methodology, followed by the imple- documents, as wi=tf* log -, and(3)Building the E entation section. Finally, we present our evaluations nd we conclude with our key findings learning Domain Ontology: Let R represent the root of C 2010 ACADEMY PUBLISHER
recommended papers. Thus, the most similar papers are recommended to the user. The system provides two options: (1) Pure content-based CF (the similarity measure is only based on two entities, the title of the paper and the abstract, and (2) Content-based SeparatedCF, where the whole text in the papers is considered as the final recommendations provided to the user would be a list of sorted recommendations that combine multiple factors based on the type that the user chose. Recently, with the increased popularity of social tagging systems, portals such as CiteULike4 and BibSonomy5, are considered promising projects that use social bookmarking to derive recommendations. [1], [5], [6], [4] used a different approach to recommend documents based on the user profiles, In this case by learning from implicit feedback or past click history. Other ways to form a user model include using data mining, such as by mining association rules [11], or by partitioning a set of user sessions into clusters or groups of similar sessions. The latter groups are called session clusters [12], [10], or user profiles [12], [10]. More recently, [13] presented a Semantic Web usage mining methodology for mining evolving user profiles on dynamic Websites by clustering the user sessions in each period and relating the user profiles of one period with those discovered in previous periods to detect profile evolution, and also to understand what type of profile evolutions have occurred. This latter branch of using data mining techniques to discover user models from Web usage data is referred to as Web Usage Mining. A previous work that used Web mining for developing smart E-learning systems [16] integrated Web usage mining, where patterns were automatically discovered from users’ actions, and then fed into a recommender system that could assist learners in their online learning activities by suggesting actions or resources to a user. Another type of data mining in Elearning was performed on documents rather than on the students’ actions. This type of data mining is more akin to text mining (i.e., knowledge discovery from text data) than Web usage mining [3]. This approach helps alleviate some of the problems in E-learning that are due to the volume of data that can be overwhelming for a learner. It works by organizing the articles and documents based on the topics and also providing summaries for documents. [7] combines Web usage mining and text-based indexing and search in the content to provide hybrid recommendations. [8] uses a learning algorithm to select sequential articles based on context and user-click feedback to recommend news articles to users. Our approach shares some similarity with the above techniques. It is a Hybrid recommender system which combines Content-based recommendations with two types of Rule-based recommendations. In Section III, we explain our methodology, followed by the implementation section. Finally, we present our evaluations and we conclude with our key findings. III. METHODOLOGY The methodology of designing our hybrid recommender system is divided into two parts: (1) The first part centers around designing the domain ontology: First, we relied on a fine grained taxonomy that encapsulates all the domain of Education in general, and E-learning in specific, by borrowing an already made taxonomy from WordNet. This attempt ended with a great disappointment since the terminology used in WordNet is far wider and different than what the HyperManyMedia domain contains. As a result, two major problems occur, the overloading of the fine grained taxonomy during the searching process and the ambiguity. Therefore, a decision was made to create a hand-made ontology using a coarse-grained taxonomy. In Section IV, we describe in detail the design of the domain ontology. (2) The second part centers around designing the learner’s ontology: Each learner has his or her own ontology based on his/her preferences. The learner’s ontology is extracted from the domain ontology and presented as a pruned subset ontology. In Section IV, we describe in detail the design of the learner’s ontology. In the following sections, we describe the methodology used to provide the learner with hybrid recommendations: (1) Ontology Content-based, (2) Cluster-based, and (3) Interest-based. A. Building the HyperManyMedia Domain Ontology Recently, a variety of knowledge-based framework applications became available that support modeling ontologies. The best known applications are Protégé6 and Altova7. We used Protégé as a framework application. Figure 1 shows the design of the HyperManyMedia ontology in Protégé. Since our approach is based on a search engine recommender system, the content of each lecture is considered as a document and the recommendation of pages is related to the degree of matching between a learner’s query and the reverseindexing of the lecture (Webpage). The HyperManyMedia search engine uses the Vector Space Model (VSM) and the score of a query q for a document d is computed based on the cosine similarity between the document and the query vector. The implementation can be described as follows: (1) Preliminary crawling and indexing (offline): crawling and indexing the E-learning platform that contributes to the content of the recommendation; (2) We start by representing each of the N documents as a term vector d = < w1, w2,...wn >, where ݓi is the term weight for term (i), combining the term frequency, ݐ݂i, and the Term’s Inverse Document Frequency ܨܦܫi = log ே if this term occurs in ݊ documents, as ݓi = ݐ݂i כ log ே , and (3) Building the Elearning Domain Ontology: Let R represent the root of 274 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 2, NO. 4, NOVEMBER 2010 © 2010 ACADEMY PUBLISHER
OURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL 2, NO 4, NOVEMBER 2010 Asserted model Inferred model la somunieaeit FO Elenivei Analysis for Bu financial acco untiorla d ueg el lt EOMgmt_ OF. NonprofiT_ org 。 Teoria d igure 1. Hierarchical Structure of the Hyper Mammy Media Ontology. the domain which is represented as a tree, and Ci follows, let docs(U =UKer dki be the documents represent a concept under R, as r= Ui=, Ci where n is he number of concepts in the domain. Each concept Ci visited by the i learner Un, then starting from the leaves, consists either of subconcepts (Ci=Ui=1 SCii) document visited by the learner in the"domain semantic leaves which are the actual documents(Ci=Uk=1 dki). structure" and then increments the visit count(initialized We encoded the above semantic information into a tree- with 0) of each visited node along with its ancestors all structured domain ontology in OWL, based on the hierarchy of the E-learning resources. The root concepts the way up to the root. After back-propagating the counts e the colleges, while the sub-concepts are the courses, ScosimeiaF-lF- 2=1d q /2p=(d)2. 2y=(q)2 (1) and the leaves are the resources of the domain(lectures) IV IMPLEMENTATION of all the documents in this way in the domain structure eps only the concepts(colleges) A. Ontology Content-based Recommendations ind sub-concepts (courses) related to the learners The idea of a Content-based recommender system in an interests along with their weighted interests(which are the number of visits). When a learner searches for a the lectures that the learner has visited, the platform lecture using a specific query q, the cosine similarity recommends other lectures with content that is similar to measure is used to retrieve the most similar documents d the content of the viewed lectures We build the learner's that contain the terms in the query, as shown in equation ontology profile by extracting the learner interests from (1). As we mentioned in Section the ne user's profile. Let docs U =UK-1 dki be the Hyper Media search engine's scoring algorithm is based on the VSM. For each field, the score is computed documents visited by the i learner, U. The learners as follows ontology is considered as a subset of the E-learning domain ontology from Section IlL. A. Since the activity score(q, d)= coord(q, d)x query Norm(q)x 2(tf(t in d)x log of the user's activities records the visited documents id f(t)2 xt. getBoosto x norm(t, d)) (which are the leaves), a bottom-up pruning algorithm is used to extract the semantic concepts that the learner is Lucene(Apache) defines each term for equation(2)as interested in. Each learner U, R has a dynamic semantic follows [9], where ff( in d) is the number of times term t representation. First, we collect the learner's activities appears in the currently scored document d, defined over a period of time to form an initial lear C 2010 ACADEMY PUBLISHER
Figure 1. Hierarchical Structure of the HyperManyMedia Ontology. the domain which is represented as a tree, and Ci represent a concept under R, as ܴ ൌ ୀଵ C୧, where n is the number of concepts in the domain. Each concept Ci consists either of subconcepts (C୧ ൌ ୀଵ ܵܥ (or of leaves which are the actual documents (C୧ ൌ ୀଵ ݀). We encoded the above semantic information into a treestructured domain ontology in OWL, based on the hierarchy of the E-learning resources. The root concepts are the colleges, while the sub-concepts are the courses, and the leaves are the resources of the domain (lectures). IV IMPLEMENTATION A. Ontology Content-based Recommendations The idea of a Content-based recommender system in an E-learning platform can be summarized as follows: Given the lectures that the learner has visited, the platform recommends other lectures with content that is similar to the content of the viewed lectures. We build the learner’s ontology profile by extracting the learner interests from the user’s profile. Let ݀ܿݏሺUiሻ ൌ ݇ൌ1 ݈ ݀݇݅ be the documents visited by the i th learner, Ui. The learner’s ontology is considered as a subset of the E-learning domain ontology from Section III.A. Since the activity log of the user’s activities records the visited documents (which are the leaves), a bottom-up pruning algorithm is used to extract the semantic concepts that the learner is interested in. Each learner Ui R has a dynamic semantic representation. First, we collect the learner’s activities over a period of time to form an initial learner profile, as follows, let ݀ܿݏሺUiሻ ൌ ݇ൌ1 ݈ ݀݇݅ be the documents visited by the i th learner Ui, then starting from the leaves, the bottom-up pruning algorithm searches for each document visited by the learner in the “domain semantic structure”, and then increments the visit count (initialized with 0) of each visited node along with its ancestors all the way up to the root. After back-propagating the counts of all the documents in this way in the domain structure, the pruning algorithm keeps only the concepts (colleges) and sub-concepts (courses) related to the learner’s interests along with their weighted interests (which are the number of visits). When a learner searches for a lecture using a specific query q, the cosine similarity measure is used to retrieve the most similar documents d that contain the terms in the query, as shown in equation (1). As we mentioned in Section III, the HyperManyMedia search engine’s scoring algorithm is based on the VSM. For each field, the score is computed as follows, ݏܿݎ݁ሺݍ݀ ,ሻ ൌ ܿݎ݀ሺݍ݀ ,ሻ ൈ ݑݍ݁ݕݎܰݎ݉ሺݍሻ ൈ ∑ሺݐ ݂ሺݐ݀ ݊݅ ሻ ൈ ݅݀ ݂ ሺݐሻ2 ൈ ݐ݁݃ .ݐݏܤݐሺሻ ൈ ݊ݎ݉ሺݐ݀ ,ሻሻ (2) Lucene8 (Apache) defines each term for equation (2) as follows [9], where tf (t in d) is the number of times term t appears in the currently scored document d, defined as ܵ௦ ൌ ௗ ||ௗ||మ·||||మ ൌ ∑ ݀ݍ ට∑ ሺ݀ሻ ଶ ୀଵ ൗ · ୀଵ ∑ ሺݍሻ ଶ ୀଵ (1) JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 2, NO. 4, NOVEMBER 2010 275 © 2010 ACADEMY PUBLISHER
JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL 2, NO 4, NOVEMBER 2010 ff(t in d=frequency/, idf(o) is the inverse document cosine similarity to the query g. Our search engine(based requency(related to the number of documents in which on Nutch) uses optional boosting scores to determine the fr he term t appears), defined idf (t)=1+ importance of each term in an indexed document, when log(numDocs/docFreq 1) and coord(g, d) is a score adding up the document-to-query term factor based on how many of the query terms are found in matches in the cosine similarity. Thus a higher boosting the specified document, and query Norm(g) is a factor for a term will force a larger contribution from that normalizing factor used to make scores between queries term in the sum. We modified the boosting score follows: field set Boost=a, in case of Category I B, in case of Category2, and query Norm(sumofSquaredWeights)= field set Boost(=Y, in case of Category3. Accordingly, 1/sumofsquaredWeights1/2 ( all documents have been boosted and re-ranked based on two factors. Here, we are going to introduce the first The sum of squared weights(of the query terms) is factor and in the following section, the second factor computed by the query weight object. For example, for a Algorithm I maps the ranked documents to the learner's Boolean query, we compute this value using, semantic profile(learner's previous visited lectures)as Category I, where each document di, belonging to a sumofSquaredWeights=g get Boost 02.2 (id f(t) learner's semantic profile, is assigned a priority ranking t. getBoosto)- tin g (4)(a= 5.0). This boosting score has been implemented using field. setBoostO, the weight is only added to the Where t get Boost( is a search time boost of term t in the documents that the learner is interested in, based on query q as specified in the query text, or as set by appli- his/her previous activities(sessions). Since we used the cation calls to set Boost, there is only multi-terms boost ontology to generate the user profile, we named this type access, and so the boost of a term in the query is of recommendation Ontology ible by calling the sub-query getBoost(0), and Recommendations (t, d) encapsulates a few(indexing time)boost and length factors: (1)document boost, is set by calling lgorithm I Re-ranking a learner's search results doc. set Boost before adding the document to the index (2)field boost, is set by calling field set Boost( before Input: :/user index dding the field to a document, and (3)lengthNorm Input: aBy//threshold (field), is computed when the document is added to the Output: Rank -fd do.d re-rank index in accordance with the number of tokens of this kRk-ald, sc: d, li de/faul searchresuls fo field in the document, so that shorter fields contribute Rc-Und A+du. more to the score, and LengthNorm is computed by the foreach di∈Rank if d; E UR, then dacument is in ser profile Similarity class in effect at indexing. When a document is di boost=a added to the index, all the above factors are multiplied. If he document has multiple fields with the same name, all if d, E RC then /documenr is in recommended cluster heir boosts are multiplied together orm(t, d)= doc. get Boost. LengthNorm ffield) f. getBoostO end Son Ranks based ow the documen boost field d;, boost When a learner searches for lectures using a specific query q, the cosine similarity measure is used to retrieve the most similar documents that contain the terms in the B. cluster-based recommendations query. In our approach, these results have been re-ranked A total corpus consisting of around 7, 424 document based on two main factors: (1)the semantic relation (lectures), was divided into 4, 888 English documents and between these documents and the learner's semantic 2, 536 Spanish documents. In both cases,we profile, and (2) the most similar cluster to the learner's experimented with partitional algorithms, direct K-way semantic profile(recommended cluster). Algorithm I clustering(similar to K-means ), and repeated bisection or maps the ranked documents to the learner semantic Bisecting K-Means with all criterion functions. We also profile(Category I), where each document d, belonging experimented with graph-partitioning-based clustering to a learners semantic profile, is assigned a priority algorithms [15]. First, for clustering English documents, ranking(a=5.0), and each document d, belonging to the we compared different hierarchical algorithms for the recommended cluster(Category 2)is assigned a priority English corpus consisting of 4, 888 documents using the ranking(B =3.0), while the rest of the documents Category 3)have the lowest priority (r =1.0). The clustering package Cluto [15]. The best clustering method threshold of each parameter was decided heuristically after several trials (a=5.0, B=3.0, and y=1.0). All the s documents, in each category, are then re-ranked based on http://lucene.apacheorg/java/2_4_0/api/org/apache/lucene/search/si C 2010 ACADEMY PUBLISHER
tf (t in d)= frequency½, id f (t) is the inverse document frequency (related to the number of documents in which the term t appears), defined as ݂݅݀ሺݐሻ ൌ1 logሺ݊ݑ݉ܦܿݏ݀ ܿݎܨ݁ݍ 1 ⁄ ሻ and coord(q,d) is a score factor based on how many of the query terms are found in the specified document, and queryNorm(q) is a normalizing factor used to make scores between queries comparable, ൌ ሻݏݐ݄ܹ݃݅݁݀݁ݎܽݑݍ݂ܱܵ݉ݑݏሺ݉ݎܰݕݎ݁ݑݍ (3⁄ ⁄ (2 1ݏݐ݄ܹ݃݅݁݀݁ݎܽݑݍ݂ܱܵ݉ݑݏ 1 The sum of squared weights (of the query terms) is computed by the query weight object. For example, for a Boolean query, we compute this value using, ሻݐሺ݅݀ ݂ ሺ . ∑ ሺሻ2ݐݏܤݐ݁݃ .ݍ ൌ ݏݐ݄ܹ݃݅݁݀݁ݎܽݑݍ݂ܱܵ݉ݑݏ ݐ݁݃ .ݐݏܤݐሺሻሻ2 t in q (4) Where t.getBoost() is a search time boost of term t in the query q as specified in the query text, or as set by application calls to setBoost(), there is only multi-terms boost access, and so the boost of a term in the query is accessible by calling the sub-query getBoost(), and norm(t,d) encapsulates a few (indexing time) boost and length factors: (1) document boost, is set by calling doc.setBoost() before adding the document to the index, (2)field boost, is set by calling field.setBoost() before adding the field to a document, and (3) lengthNorm (field), is computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score, and LengthNorm is computed by the Similarity class in effect at indexing. When a document is added to the index, all the above factors are multiplied. If the document has multiple fields with the same name, all their boosts are multiplied together, · ሻ݂݈݅݁݀ሺ݉ݎ݄ܰݐ݈݃݊݁ . ሺሻݐݏܤݐ݁݃ .ܿ݀ ൌ ሻ, ݀ݐሺ݉ݎ݊ (5 (ݐ ݏܽ ݀݁݉ܽ݊ ݀ ݊݅ ݂ ݈݂݀݁݅ ሺሻݐݏܤݐ݁݃ . ݂ ∏ When a learner searches for lectures using a specific query q, the cosine similarity measure is used to retrieve the most similar documents that contain the terms in the query. In our approach, these results have been re-ranked based on two main factors: (1) the semantic relation between these documents and the learner’s semantic profile, and (2) the most similar cluster to the learner’s semantic profile (recommended cluster). Algorithm 1 maps the ranked documents to the learner semantic profile (Category 1), where each document di, belonging to a learner’s semantic profile, is assigned a priority ranking (α = 5.0), and each document di belonging to the recommended cluster (Category 2) is assigned a priority ranking (β = 3.0), while the rest of the documents (Category 3) have the lowest priority (γ = 1.0). The threshold of each parameter was decided heuristically after several trials (α = 5.0, β = 3.0, and γ = 1.0). All the documents, in each category, are then re-ranked based on cosine similarity to the query q. Our search engine (based on Nutch) uses optional boosting scores to determine the importance of each term in an indexed document, when adding up the document-to-query term matches in the cosine similarity. Thus a higher boosting factor for a term will force a larger contribution from that term in the sum. We modified the boosting score as follows: field.setBoost() = α, in case of Category1, field.setBoost() = β, in case of Category2, and field.setBoost() = γ, in case of Category3. Accordingly, all documents have been boosted and re-ranked based on two factors. Here, we are going to introduce the first factor and in the following section, the second factor. Algorithm 1 maps the ranked documents to the learner’s semantic profile (learner’s previous visited lectures) as Category 1, where each document di, belonging to a learner’s semantic profile, is assigned a priority ranking (α = 5.0). This boosting score has been implemented using field.setBoost(), the weight is only added to the documents that the learner is interested in, based on his/her previous activities (sessions). Since we used the ontology to generate the user profile, we named this type of recommendation, Ontology Content-based Recommendations. B. Cluster-based Recommendations A total corpus consisting of around 7,424 documents (lectures), was divided into 4,888 English documents and 2,536 Spanish documents. In both cases, we experimented with partitional algorithms, direct K-way clustering (similar to K-means), and repeated bisection or Bisecting K-Means with all criterion functions. We also experimented with graph-partitioning-based clustering algorithms [15]. First, for clustering English documents, we compared different hierarchical algorithms for the English corpus consisting of 4,888 documents using the clustering package Cluto [15]. The best clustering method 8 http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Si milarity.html 276 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 2, NO. 4, NOVEMBER 2010 © 2010 ACADEMY PUBLISHER