NTERNATIONAL JOURNAL OF COMPUTERS ssue 3. Volume 5. 2011 Semantic Search Itinerary recommender System LIVIU ADRIAN COTFAS ANDREEA DIOSTEANU STEFAN DANIEL DUMITRESCU ALEXANDRU SMEUREANU Abstract- In this paner we present a novel aproach based on users to schedule th:时 Natural Language Processing and hybrid multi-objective genetic it during the trip algorithms for developing mobile tourism itinerary recommender Furthermore, another important research trend in the systems. The proposed semantic matching technique allows users to find Points of Interest- POls that match highly specific preferences integration of semantic technologies in many fields as they Moreover, it can also be used to further filter results from traditional enhance the process of knowledge retrieval recommender techniques, such as collaborative filtering, and only structuring. Paper [l] presents for example a Fuzzy-Neural requires a minimal initial input from the user to generate relevant Network (FNN) model to construct Q&a knowledge base recommendations. A hybrid multi-objective genetic algorithm has automatically. In this paper we use an approach based on been developed in order to allow the tourists to easily choose document indexing and semantic search in order to allow users between several Pareto optimal itineraries time. Furthermore, the proposed system is easy to use, thus it can be puted in near real- to easily find interesting tourist attractions ated that our solution is both complex and at the same time user Recommender systems also represent a trend of great interest for the current research. They consist of software tools and techniques providing suggestions for items to be of use to Keywords- semantic similarity, free text document indexing, a user[2]. In order to determine the most suitable suggestions, multi-objective genetic algorithm, itinerary recommender system most implementations rely on collaborative filtering and content-based methods. However, these solutions individually INTRODUCTION fail to provide good solutions in many situations. In order to Nowadays,designing flexible, efficient and user-friendly overcome these failures, 3] proposes an alternative method to mobile tour guide applications is of a great interest both content-based filtering based on neural networks. The article from a commercial and research point of view. Such systems presents a framework used for designing recommender are useful for tourists visiting a location in a limited period of systems for combining neural networks and collaborative time. Without any support from a system, a user manually filtering. Recommender Systems also have building an itinerary, should spend a lot of time prior to the applications in a wide range of domains from education and trip, searching for information regarding the tourist attractions, training to economics [4], or tourism as it is the case of the reading facts about them and designing possible itineraries by application presented in this paper taking into account factors such as opening hours, visiting One of the first attempts of applying recommender systems durations, distances and available means of public transport. in tourism is represented by the Cyberguide [5] project, in which the authors develop a mobile application targeted to model the tour guide activity. The paper also acknowledges ire the resuit of the importance of context, but limited only to location novation triangle" This project is co funded by European Social Fund awareness. The authors present the overall architecture and the through The Sectorial Operational Program for Human Resources development procedures for multiple different hand-held Development 2007-2013, coordinated by The Bucharest Academy of platforms. Moreover, the general research issues that have the education research and innovation triangle, DOC-ECr"). Parts of this emerged in context-aware applications development in a article are the result of posdru. financing contract POSDRU/6/1. 5/S/19 mobile environment are discusse L A. Cotfas is with the Economic Informatics Department, Bucharest cademy of Economic Studies, Bucharest, Romania(phone: 40-772-402-305 ides poi recommendations based on the user's profile and context. The A. Diosteanu is with the Economic Informatics Department, authors also present an evaluation of the visitor's experience with the system. A drawback of the approach is represented by S D. Dumitrescu is with the Computers and Information the amount of information required from the user in order to Department, Politehnica University of Bucharest, Bucharest, Romania (e- customize the tour. Guide offers users the possibility of ail: dumitrescu. stefan gmail. com). creating a tour by selecting objectives, presenting their the economic Academy of Economic Studies domains of interest and preferences. The restrictions tha alexandru. smeureanu a gmail. com). system takes into account refer to the schedule of objectives
Abstract— In this paper we present a novel approach based on Natural Language Processing and hybrid multi-objective genetic algorithms for developing mobile tourism itinerary recommender systems. The proposed semantic matching technique allows users to find Points of Interest – POIs that match highly specific preferences. Moreover, it can also be used to further filter results from traditional recommender techniques, such as collaborative filtering, and only requires a minimal initial input from the user to generate relevant recommendations. A hybrid multi-objective genetic algorithm has been developed in order to allow the tourists to easily choose between several Pareto optimal itineraries computed in near realtime. Furthermore, the proposed system is easy to use, thus it can be stated that our solution is both complex and at the same time useroriented. Keywords— semantic similarity, free text document indexing, multi-objective genetic algorithm, itinerary recommender system. INTRODUCTION owadays, designing flexible, efficient and user-friendly mobile tour guide applications is of a great interest both from a commercial and research point of view. Such systems are useful for tourists visiting a location in a limited period of time. Without any support from a system, a user manually building an itinerary, should spend a lot of time prior to the trip, searching for information regarding the tourist attractions, reading facts about them and designing possible itineraries by taking into account factors such as opening hours, visiting durations, distances and available means of public transport. Manuscript received April 12, 2011. Parts of this article are the result of the project „Doctoral Program and PhD Students in the education research and innovation triangle”. This project is co funded by European Social Fund through The Sectorial Operational Program for Human Resources Development 2007-2013, coordinated by The Bucharest Academy of Economic Studies (Project no. 7832, “Doctoral Program and PhD Students in the education research and innovation triangle, DOC-ECI”). Parts of this article are the result of POSDRU, financing contract POSDRU/6/1.5/S/19. L.A. Cotfas is with the Economic Informatics Department, Bucharest Academy of Economic Studies, Bucharest, Romania (phone: 40-772-402-305 e-mail: liviu.cotfas@gmail.com). A. Dioșteanu is with the Economic Informatics Department, Bucharest Academy of Economic Studies, Bucharest, Romania (e-mail: andreea.diosteanu@gmail.com). S.D. Dumitrescu is with the Computers and Information Technology Department, Politehnica University of Bucharest, Bucharest, Romania (email: dumitrescu.stefan@gmail.com). A. Smeureanu is with the Economic Informatics Department, Bucharest Academy of Economic Studies, Bucharest, Romania (e-mail: alexandru.smeureanu@gmail.com). Conversely, itinerary recommender systems not only assist users to schedule the initial itinerary, but can also easily revise it during the trip. Furthermore, another important research trend in the integration of semantic technologies in many fields as they enhance the process of knowledge retrieval, processing and structuring. Paper [1] presents for example a Fuzzy-Neural Network (FNN) model to construct Q&A knowledge base automatically. In this paper we use an approach based on document indexing and semantic search in order to allow users to easily find interesting tourist attractions. Recommender systems also represent a trend of great interest for the current research. They consist of software tools and techniques providing suggestions for items to be of use to a user [2]. In order to determine the most suitable suggestions, most implementations rely on collaborative filtering and content-based methods. However, these solutions individually fail to provide good solutions in many situations. In order to overcome these failures, [3] proposes an alternative method to content-based filtering based on neural networks. The article presents a framework used for designing recommender systems for combining neural networks and collaborative filtering. Recommender Systems also have a lot of applications in a wide range of domains from education and training to economics [4], or tourism as it is the case of the application presented in this paper. One of the first attempts of applying recommender systems in tourism is represented by the Cyberguide [5] project, in which the authors develop a mobile application targeted to model the tour guide activity. The paper also acknowledges the importance of context, but limited only to location awareness. The authors present the overall architecture and the development procedures for multiple different hand‐held platforms. Moreover, the general research issues that have emerged in context‐aware applications development in a mobile environment are discussed. The Guide system presented in [6] provides POI recommendations based on the user’s profile and context. The authors also present an evaluation of the visitor’s experience with the system. A drawback of the approach is represented by the amount of information required from the user in order to customize the tour. Guide offers users the possibility of creating a tour by selecting objectives, presenting their domains of interest and preferences. The restrictions that the system takes into account refer to the schedule of objectives, Semantic Search Itinerary Recommender System LIVIU ADRIAN COTFAS, ANDREEA DIOSTEANU, STEFAN DANIEL DUMITRESCU, ALEXANDRU SMEUREANU N INTERNATIONAL JOURNAL OF COMPUTERS Issue 3, Volume 5, 2011 370
NTERNATIONAL JOURNAL OF COMPUTERS ssue 3. Volume 5. 2011 ntervals. The to be change 1.r dynamically, based on the actual time the tourist already spent attractio at some attractions or based on the weather conditions For example, if the weather is inappropriate for visiting open-air points of interest, than the system automatically suggests indoors activities). The user profile is also continuously 2. Build the itinerary updated, using information from previous trips. [7 presents an artificial intelligence and metaheuristic approach for dealing with the tourist planning issue on mobile devices. The solution presented offers a decision support and a modeling mechanism 3. Guidance during the tour starting from the orienteering problem. The problem involves Fig. I The main functionalities of itinerary recommender systems set of candidate objectives for which a score is being associated. The problem is thus reduced to maximizing the Creating the list of POl: In the simpler approaches like total score of the visited places, while complying with the the one presented in P-Tour [101, the user is requested to constraints related to the total amount of time or the total select the desired PoI from a list without receiving any distance. The score of a point of Interest - Poi is defined as recommendations from the system. On the opposite side, the interest of a tourist in that place. Scores are calculated other systems [11 opt for a completely automatic Ising the vector space model. The trip planning problem is approach in which the user has no control over the solved using a guided local search metaheuristic. In order to selected Points of Interest -POl. mTripAssistent compare the performance of this approach with an algorithm combines the manual and automatic approaches for that appeared in the literature, both are applied to a real data building the list of Pol, by both allowing the user to set from the city of Ghent manually select locations and using an algorithm to Our approach is also based on semantic search as [7] choose the best locations for the remaining time uses neuronal network based document indexing mechanisms. Building the itinerary: Building tourist itineraries is an Furthermore, we present the user a list of possible points of extension of the Traveling Salesmen Problem which is interest as [6], but we dynamically create the tour and we known to be already NP-hard. Existing approaches either don't require data introduction effort from the user. Moreover only try to maximize the sum of scores associated to the our approach also generates public transport routes to link visited POls [Il], or if multiple criteria are taken into objectives and also offers information regarding the pollution onsideration, they are combined using a simple level from the areas in which pollution sensors are available weighted sum. [4] presents an approach based on using The trip planner presented in this paper is tested in Bucharest weighted sum that takes into consideration the preference expressed by the user and the distance between the This paper is organized as follows: in the second section we Points of Interest. mTripAssistent uses a novel multi make a short overall presentation of the main characteristics of objective algorithm that allows finding several Pareto- the proposed solution and we highlight the general features of optimal solutions in near real-time he recommender systems, in the third part we present the Guidance during the tour: As it was shown in the user semantic technologies and algorithm applied for semantic studies performed by [ 12], tourists typically modify the search and document indexing by using key words, in the list of Pol during the itinerary. Therefore mTripAssistent fourth section we formalize the model used for generating allows the user to easily change the itinerary at any itineraries, while the fifth section enlarges upon the overall moment during the tour. distributed architecture of our web service based platform, and As it can be seen from the features presented above the in the sixth section we present the entire itinerary generation novelty of our approach resides in the fact that it requires flow and the scenario, in addition we also introduce the mobile less input and effort from the user, is flexible and dynamic itinerary recommender application. The last section will by applying the multi-objective algorithm a wider set of conclude the article and present the future research guidelines. constraints and rules can be applied. Furthermore the proposed solution also recommends public transport ITINERARY RECOMMENDER SYSTEMS routes In order to offer users the possibility of flexibly selecting the itinerary based on their preferences, we propose an DOCUMENT INDEXING FOR EXTRACTING CANDIDATE POIS algorithm for indexing documents and a flexible semantic In order to determine the possible candidate points of search mechanism interest, we developed a semantic search algorithm that has Several reviews of existing itinerary recommender solutions two main steps: the semantic classification of the description can be found in [8] and [9]. In most approaches, the process of files of POi and also a document indexing step. Afterwards we generating itineraries and guiding users can be divided in three perform the semantic matching between the indexed words distinct steps, which also represent the main functionalities of and the search phrases introduced by the user. An approach for itinerary recommender systems(Fig 1.) key words extraction based on WordNet is also presented in
the best visiting intervals. The tour can be changed dynamically, based on the actual time the tourist already spent at some attractions or based on the weather conditions. For example, if the weather is inappropriate for visiting open-air points of interest, than the system automatically suggests indoors activities). The user profile is also continuously updated, using information from previous trips. [7] presents an artificial intelligence and metaheuristic approach for dealing with the tourist planning issue on mobile devices. The solution presented offers a decision support and a modeling mechanism starting from the orienteering problem. The problem involves a set of candidate objectives for which a score is being associated. The problem is thus reduced to maximizing the total score of the visited places, while complying with the constraints related to the total amount of time or the total distance. The score of a Point of Interest - POI is defined as the interest of a tourist in that place. Scores are calculated using the vector space model. The trip planning problem is solved using a guided local search metaheuristic. In order to compare the performance of this approach with an algorithm that appeared in the literature, both are applied to a real data set from the city of Ghent. Our approach is also based on semantic search as [7] but uses neuronal network based document indexing mechanisms. Furthermore, we present the user a list of possible points of interest as [6], but we dynamically create the tour and we don’t require data introduction effort from the user. Moreover, our approach also generates public transport routes to link objectives and also offers information regarding the pollution level from the areas in which pollution sensors are available. The trip planner presented in this paper is tested in Bucharest city. This paper is organized as follows: in the second section we make a short overall presentation of the main characteristics of the proposed solution and we highlight the general features of the recommender systems, in the third part we present the semantic technologies and algorithm applied for semantic search and document indexing by using keywords, in the fourth section we formalize the model used for generating itineraries, while the fifth section enlarges upon the overall distributed architecture of our web service based platform, and in the sixth section we present the entire itinerary generation flow and the scenario, in addition we also introduce the mobile itinerary recommender application. The last section will conclude the article and present the future research guidelines. ITINERARY RECOMMENDER SYSTEMS In order to offer users the possibility of flexibly selecting the itinerary based on their preferences, we propose an algorithm for indexing documents and a flexible semantic search mechanism. Several reviews of existing itinerary recommender solutions can be found in [8] and [9]. In most approaches, the process of generating itineraries and guiding users can be divided in three distinct steps, which also represent the main functionalities of itinerary recommender systems (Fig 1.): Fig. 1 The main functionalities of itinerary recommender systems Creating the list of POI: In the simpler approaches like the one presented in P-Tour [10], the user is requested to select the desired POI from a list without receiving any recommendations from the system. On the opposite side, other systems [11] opt for a completely automatic approach in which the user has no control over the selected Points of Interest – POI. mTripAssistent combines the manual and automatic approaches for building the list of POI, by both allowing the user to manually select locations and using an algorithm to choose the best locations for the remaining time. Building the itinerary: Building tourist itineraries is an extension of the Traveling Salesmen Problem which is known to be already NP-hard. Existing approaches either only try to maximize the sum of scores associated to the visited POIs [11], or if multiple criteria are taken into consideration, they are combined using a simple weighted sum. [4] presents an approach based on using weighted sum that takes into consideration the preference expressed by the user and the distance between the Points of Interest. mTripAssistent uses a novel multiobjective algorithm that allows finding several Paretooptimal solutions in near real-time. Guidance during the tour: As it was shown in the user studies performed by [12], tourists typically modify the list of POI during the itinerary. Therefore mTripAssistent allows the user to easily change the itinerary at any moment during the tour. As it can be seen from the features presented above the novelty of our approach resides in the fact that it requires less input and effort from the user, is flexible and dynamicby applying the multi-objective algorithm a wider set of constraints and rules can be applied. Furthermore the proposed solution also recommends public transport routes. DOCUMENT INDEXING FOR EXTRACTING CANDIDATE POIS In order to determine the possible candidate points of interest, we developed a semantic search algorithm that has two main steps: the semantic classification of the description files of POI and also a document indexing step. Afterwards we perform the semantic matching between the indexed words and the search phrases introduced by the user. An approach for key words extraction based on WordNet is also presented in INTERNATIONAL JOURNAL OF COMPUTERS Issue 3, Volume 5, 2011 371
NTERNATIONAL JOURNAL OF COMPUTERS ssue 3. Volume 5. 2011 [13]. The article uses key sentences for automatically The artificial network model can be seen in Fig. 2 summarizing by applying statistical methods. Our In order to train the neuronal network we used the back paper uses a neuronal network approach for identifying propagation algorithm implemented in Weka[20] on a set of manually classified instances. For the training set we have Document indexing is a useful in a lot of Natural Language chosen documents from different sources, with a focus on ocessing application such as text mining, knowledge newspaper articles. We created a java application that retrieval, keywords extraction, etc [14]. Indexing defines the automatically extracts words in the document by following the process of converting a document into a list of words included steps presented above and computes the values for them. We in it. The result of indexing is represented by a list of words then manually classify the extracted potential keywords into called index language [151 two gro eywords and non-keywords. We use the The selection process of document indexing has the resulting set of classified instances to train the ANN following steps Step 1: Eliminate stop words. Stop words are words which have grammatical functionality, but are not related to the ontent of the document. Some possible stop words are conjunction, prepositions, auxiliary verbs, articles, pronouns Step 2: For the remaining of the document words in the list perform word stemming. This means that we extract the root of the words. In order to achieve this we use porter stemmer for English language &.Step 3: This step consists in performing a neuronal network ased filtering so that to extract only those words that are relevant to the context The input nodes for the neural network are represented by the following features that can be grouped into statistic features(TF, IDF, ITF)[15], position features (T, Fpos)[16] grammatical syntax(subject, object, predicate(SPO)) term frequency (TF)- the frequency of each term in the Fig 2 Artificial neuronal network for extracting relevant words document inverted document frequency(DF)the number of We tested the artificial neuronal network on a set of 50 documents that contain the word from the initial English documents. From the set of 3850 identified possible document sample keywords, only 10% were selected in the manual classification IDF=-Log2 Tol where Tot-total occurrence [17] ohase. In order to choose the best configuration for the aNN several choices for the number of neurons in the hidden layer inverted total frequency(ITF)-total frequency of the word have been tested in the initial document sample Below, we present the comparison results for 3 ANN T-Boolean value indicating whether the word appears in configurations the title or not (no attributes+no classes) with the result pos=f where i-the number of the sentence in which the word appears for the first time Confusion Matrix -(43 342)a 293436b SPO-Boolean value indicating whether the word is a where ibject, predicate or an object in the sentences from the first column(a-is keyword first paragraph For determining the grammatical category second(b)-non-keyword of a word we used Gate and we integrated the Stanford Correctly Classified Instance 3479 90.3636% Therefore, the neuronal network will have six input nodes Other characteristics of the ANn are: one hidden layer Incorrectly Classified Instances.6364% usually it is not indicated to use more than one hidden layer he number of hidden nodes is given by the formula Nn= no attributes no classes, with the results (no attributes + no classes) Confusion Matrix =(44 341 a 453420丿b no. attributes-represents the number of features (in our case noclasses -represents the number of classes in which words Correctly Classified Instance 346489.974% can be classified (in our case 2: keyword and non-keyword)
[13]. The article uses key sentences for automatically summarizing documents by applying statistical methods. Our paper uses a neuronal network approach for identifying keywords. Document indexing is a useful in a lot of Natural Language Processing application such as text mining, knowledge retrieval, keywords extraction, etc [14]. Indexing defines the process of converting a document into a list of words included in it. The result of indexing is represented by a list of words called index language [15]. The selection process of document indexing has the following steps: Step 1: Eliminate stop words. Stop words are words which have grammatical functionality, but are not related to the content of the document. Some possible stop words are: conjunction, prepositions, auxiliary verbs, articles, pronouns, etc. Step 2: For the remaining of the document words in the list perform word stemming. This means that we extract the root of the words. In order to achieve this, we use Porter stemmer for English language. Step 3: This step consists in performing a neuronal network based filtering so that to extract only those words that are relevant to the context. The input nodes for the neural network are represented by the following features that can be grouped into statistical features (TF, IDF, ITF) [15], position features (T, Fpos) [16], grammatical syntax (subject, object, predicate (SPO)): term frequency(TF)- the frequency of each term in the document; inverted document frequency(IDF)-the number of documents that contain the word from the initial document sample; IDF= , where - total occurrence [17] inverted total frequency(ITF)-total frequency of the word in the initial document sample; T-Boolean value indicating whether the word appears in the title or not; Fpos= √ where i-the number of the sentence in which the word appears for the first time SPO-Boolean value indicating whether the word is a subject, predicate or an object in the sentences from the first paragraph. For determining the grammatical category of a word we used Gate and we integrated the Stanford Parser [18][19]. Therefore, the neuronal network will have six input nodes. Other characteristics of the ANN are: one hidden layer, usually it is not indicated to use more than one hidden layer; the number of hidden nodes is given by the formula: where: - represents the number of features (in our case equals six) - represents the number of classes in which words can be classified (in our case 2: keyword and non-keyword). The artificial network model can be seen in Fig. 2. In order to train the neuronal network we used the back propagation algorithm implemented in Weka [20] on a set of manually classified instances. For the training set we have chosen documents from different sources, with a focus on newspaper articles. We created a java application that automatically extracts words in the document by following the steps presented above and computes the values for them. We then manually classify the extracted potential keywords into two groups: keywords and non-keywords. We use the resulting set of classified instances to train the ANN. Fig. 2 Artificial neuronal network for extracting relevant words We tested the artificial neuronal network on a set of 50 English documents. From the set of 3850 identified possible keywords, only 10% were selected in the manual classification phase. In order to choose the best configuration for the ANN, several choices for the number of neurons in the hidden layer have been tested. Below, we present the comparison results for 3 ANN configurations: - with the results: Confusion Matrix =( ) where: first column(a) - is keyword second(b) - non-keyword Correctly Classified Instance 3479 90.3636 % Incorrectly Classified Instances 371 9.6364 % - , with the results: Confusion Matrix =( ) Correctly Classified Instance 3464 89.974% INTERNATIONAL JOURNAL OF COMPUTERS Issue 3, Volume 5, 2011 372
NTERNATIONAL JOURNAL OF COMPUTERS ssue 3. Volume 5. 2011 incorrectly Classified Instances 38610.026% where SimLin(X1, X2)-the level of similarity between concepts 1 =10. with the results and 2 computed by Lin formula Confusion MatrIx< /45 340 S(X1, X2)the degree of shared information for X, and X2 493416 Correctly Classified Instances 346189.8961% IC(X)-information content (IC)of X concept Incorrectly Classified Instances.1039% The semantic matching evaluation will performed gainst the user key search phrases and the key words extracted We also compared the results with the Naive Bayesian from the do For instance. if a tourist introduces"modern art' as search Confusion matrix phrase, and in a document about a military museum it is 3437 specified that there is a painting created by a modern art ainter"Paul Gauguin", then the POI will be suggested to the Correctly Classified Instances 3472 90. 1818% Incorrectly Classified Instances37898182% MULTI-OBJECTIVE EVOLUTIONARY ALGORITHM FOR GENERATING ITINERARIES We conclude that the ANN configuration initially presented Besides the semantic search approach described in the ives us better results previous section, users can also choose POl from a suggestion First in order to understand the sense of the words as it is importance, by changing the order g the selected POIs are After the document indexing phase is completed, we can list built using collaborative filtering. All displayed in a common list, allowin change their used in the documents we perform word disambiguation. For We frame the problem of building tourist itineraries as an this, we query an external corpus, YAGO [21], based on extension of the classic Orienteering Problem -OP, also wikipedia and WordNet that contains both word definitions known as the Time Dependent Orienteering Problem with and the semantic relations with other words in the same Time Windows-TDOPTW. As the Orienteering Problem is synonymic class. For word sense disambiguation we identify known to be NP-hard, finding an exact solution for the all possible senses of the words that were previously POS- TDOPTW in near real-time is considered to require a high Parts of Speech tagged. For each meaning of the word we computational power[22]. The proposed system uses a hybrid identify all possible solutions multi-objective genetic algorithm in order to generate Its own definition that includes example texts that itineraries in a short amount of time. Genetic algorithms are a WordNet or other corpus provides to the glosses particular class of evolutionary algorithms that use techniques The dictionary definition of the synonymic groups that are inspired by evolutionary biology such as inheritance connected to it through the "has a'"relations. If there is mutation, selection, and crossover. In order to decrease the more than one relation for a word sense, then the glosses number of required generations, we use the following for each relation are concatenated into a single gloss heuristic: each time we add a new POl to an itinerary we choose the one with the highest ratio between the associated The dictionary definition of the synonymic group that are score, t and the time required to visit the POl connected to it through the"is a"relations The dictionary definition of the synonymic group that are n Is-the start time connected to it through the"part of relations T-the finish time After these steps are taken, the semantic similarity of the synonymic group the words belong to is measured. If a word LF-the finish location has more than one sense, it will appear in multiple synonymic POlselected the set of manually selected tourist attractions that will be included in all groups at various locations in the classification. WordNet defines relations between synonymic groups and relations between word senses. To measure the semantic similarity n POlFavorites -the set of attractions that were marked as favorites by the user between two synonymic groups, the most appropriate relations POl -the global set of tourist attractions that are to be used are the semantic hierarchic ones. Several NSoL-the number of requested solutions types of semantic similarity formulas exist: Conrath, Lin, The output of the algorithm consists in a number of Resnik. In our approach, Lin formula was chosen as it fully Nsol s N:ol Pareto-optimal itineraries including both th takes into account the information content tourist attractions and the traveling directions between them 2*S(X1,X2) The considered optimization criteria are (C(X1)+lC(X2)) Cl-represents the useful visiting time calculated as shown in(2)
Incorrectly Classified Instances 386 10.026% - , with the results: Confusion Matrix =( ) Correctly Classified Instances 3461 89.8961 % Incorrectly Classified Instances 389 10.1039 % - We also compared the results with the Naive Bayesian Classifier: Confusion Matrix =( ) Correctly Classified Instances 3472 90.1818 % Incorrectly Classified Instances 378 9.8182 % We conclude that the ANN configuration initially presented gives us better results. After the document indexing phase is completed, we can perform the semantic matching. First in order to understand the sense of the words as it is used in the documents we perform word disambiguation. For this, we query an external corpus, YAGO [21], based on Wikipedia and WordNet that contains both word definitions and the semantic relations with other words in the same synonymic class. For word sense disambiguation we identify all possible senses of the words that were previously POS – Parts of Speech tagged. For each meaning of the word we identify all possible solutions: Its own definition that includes example texts that WordNet or other corpus provides to the glosses. The dictionary definition of the synonymic groups that are connected to it through the “has a” relations. If there is more than one relation for a word sense, then the glosses for each relation are concatenated into a single gloss string. The dictionary definition of the synonymic group that are connected to it through the “is a” relations. The dictionary definition of the synonymic group that are connected to it through the “part of” relations. After these steps are taken, the semantic similarity of the synonymic group the words belong to is measured. If a word has more than one sense, it will appear in multiple synonymic groups at various locations in the classification. WordNet defines relations between synonymic groups and relations between word senses. To measure the semantic similarity between two synonymic groups, the most appropriate relations that are to be used are the semantic hierarchic ones. Several types of semantic similarity formulas exist: Conrath, Lin, Resnik. In our approach, Lin formula was chosen as it fully takes into account the information content: (1) where: the level of similarity between concepts 1 and 2 computed by Lin formula - the degree of shared information for and concepts X1, X2 - concepts 1 and 2 IC(X) - information content (IC) of X concept. The semantic matching evaluation will be performed against the user key search phrases and the keywords extracted from the documents. For instance, if a tourist introduces “modern art” as search phrase, and in a document about a military museum it is specified that there is a painting created by a modern art painter “Paul Gauguin”, then the POI will be suggested to the user. MULTI-OBJECTIVE EVOLUTIONARY ALGORITHM FOR GENERATING ITINERARIES Besides the semantic search approach described in the previous section, users can also choose POI from a suggestion list built using collaborative filtering. All selected POIs are displayed in a common list, allowing the user to change their importance, by changing the order. We frame the problem of building tourist itineraries as an extension of the classic Orienteering Problem - OP, also known as the Time Dependent Orienteering Problem with Time Windows - TDOPTW. As the Orienteering Problem is known to be NP-hard, finding an exact solution for the TDOPTW in near real-time is considered to require a high computational power [22]. The proposed system uses a hybrid multi-objective genetic algorithm in order to generate itineraries in a short amount of time. Genetic algorithms are a particular class of evolutionary algorithms that use techniques inspired by evolutionary biology such as inheritance, mutation, selection, and crossover. In order to decrease the number of required generations, we use the following heuristic: each time we add a new POI to an itinerary we choose the one with the highest ratio between the associated score, and the time required to visit the POI. The itinerary building algorithm has as inputs: - the start time; - the finish time; - the start location; - the finish location; - the set of manually selected tourist attractions that will be included in all generated itineraries; - the set of attractions that were marked as favorites by the user; - the global set of tourist attractions; - the number of requested solutions. The output of the algorithm consists in a number of Pareto-optimal itineraries including both the tourist attractions and the traveling directions between them. The considered optimization criteria are: C1 - represents the useful visiting time calculated as shown in (2); INTERNATIONAL JOURNAL OF COMPUTERS Issue 3, Volume 5, 2011 373
NTERNATIONAL JOURNAL OF COMPUTERS ssue 3. Volume 5. 2011 C2- represents the average score of the visited valid. First, the user selected tourist attractions, POlselected are cations calculated as shown in(3) dded to the new itinerary in a random manner C3-represents the diversity of the itinerary calculated as shown in(4) S1: Initialization Other criteria can also easily be taken into consideration The criteria C3 can also take into consideration the structure f the itinerary 52: Determine pareto fronts Visitt S3: Selection MR=②S)/n Generate the new population S4: Mutation Crossover DV=nj The start and finish locations, Ls and LF, represent the GPS Best itineraries coordinates associated to the locations selected by the user either from the map or from the list of tourist attractions Fig. 3 Itinerary building algorithm Each Point of Interest i= 1.NpoI is characterized by Oi-opening time, Afterwards, the remaining time is completed with tourist tractions selected from the set of favorites, POlFavorites and Si-associated score, from the normal set, POl, A tourist attraction i can only be Feel-entrance fee inserted if the arrival time, Ai, satisfies the conditions in(6) VisitDi -medium visit duration and(7) Loci-GPS coordinates For each user u and tourist attraction i, visitDi represents Ai< Ci-visitDi the real visiting duration and VisitDi the predicted one. For a tourist attraction that has not yet been visited, the estimated Ai+VisitD+ TravelDiLe< TE visiting duration is computed as shown in(5), where Npol represents the number of tourist attractions in the category k to The time needed to visit a tourist attraction can be which the tourist attraction i belongs that have already been computed as shown in( 8)by summing the travel time from visited by the user u the previous location, the waiting time and the visit duration. VisitTDi= travelDili+ waitDi+ visitDi VisitD VisitDl= visitDi VisitOr (5) The waiting duration, WaitD i can be computed as shown in (9), where Ai represents the arrival time The algorithm is run until a maximum number of WaitDi= max(0, Oi-AD (9) generations, GMG is reached or until the best found solution doesn't change for Gcc consecutive generations. Depending Evaluation: All Np itineraries are evaluated based on the on the purpose GMG can be adjusted either for accuracy or three selected criteria: C1. C2. C3. In order to determine the speed. In order to increase the performance with compute in Pareto-optimal fronts we use an approach similar to the one in advance the minimum-Traveldin and maximum time- the NSGA-II algorithm[23] TravelDiar required to travel between locations i and j. The Selection: As we have chosen an elitist approach in which values can be calculated for different hours or week days we copy all non-dominated solutions from Pg in Eg. We We have chosen an Elitist approach [13] in which we use remove all other solutions from Pg and we automatically two populations, the normal one Pg and the population of non- include all solutions from Egin Pa+1. Using mutation and dominated solutions in generation g GMG, Eg crossover we generate Np-EgIsolutions that are added to Pg Initialization: For g =0, we create the initial population in order to have Pa+1 Np. If Egl> Np/2 we use a Po,with a number of Np itineraries called chromosomes. The clustering approach to select Np/2 solutions as different population of non-dominated solutions is also initialized Eo=0. In order to limit the number of required algorithm Mutation: Adds random variation to the evolving iterations,all itineraries in the initial population are generated population. Two links are randomly chosen from the itinerary
C2 - represents the average score of the visited locations calculated as shown in (3); C3 - represents the diversity of the itinerary calculated as shown in (4). Other criteria can also easily be taken into consideration. The criteria C3 can also take into consideration the structure of the itinerary. ∑ (2) ∑ (3) (4) The start and finish locations, and , represent the GPS coordinates associated to the locations selected by the user either from the map or from the list of tourist attractions. Each Point of Interest is characterized by: - opening time; - closing time; - associated score; - entrance fee; - medium visit duration; - GPS coordinates. For each user and tourist attraction , represents the real visiting duration and the predicted one. For a tourist attraction that has not yet been visited, the estimated visiting duration is computed as shown in (5), where represents the number of tourist attractions in the category to which the tourist attraction belongs that have already been visited by the user √∏ (5) The algorithm is run until a maximum number of generations, is reached or until the best found solution doesn’t change for consecutive generations. Depending on the purpose can be adjusted either for accuracy or speed. In order to increase the performance with compute in advance the minimum - and maximum time - required to travel between locations i and j. The values can be calculated for different hours or week days. We have chosen an Elitist approach [13] in which we use two populations, the normal one and the population of nondominated solutions in generation , . Initialization: For , we create the initial population , with a number of itineraries called chromosomes. The population of non-dominated solutions is also initialized . In order to limit the number of required algorithm iterations, all itineraries in the initial population are generated valid. First, the user selected tourist attractions, are added to the new itinerary in a random manner. S1: Initialization S2: Determine Pareto fronts S3: Selection S4: Mutation S5: Crossover Loop condition Best itineraries Generate the new population Fig. 3 Itinerary building algorithm Afterwards, the remaining time is completed with tourist attractions selected from the set of favorites, and from the normal set, . A tourist attraction can only be inserted if the arrival time, , satisfies the conditions in (6) and (7). (6) (7) The time needed to visit a tourist attraction can be computed as shown in (8) by summing the travel time from the previous location, the waiting time and the visit duration. (8) The waiting duration, can be computed as shown in (9), where represents the arrival time. (9) Evaluation: All itineraries are evaluated based on the three selected criteria: C1, C2, C3. In order to determine the Pareto-optimal fronts we use an approach similar to the one in the NSGA-II algorithm [23]. Selection: As we have chosen an elitist approach in which we copy all non-dominated solutions from in . We remove all other solutions from and we automatically include all solutions from in . Using mutation and crossover we generate | |solutions that are added to in order to have | | . If | | we use a clustering approach to select solutions as different as possible. Mutation: Adds random variation to the evolving population. Two links are randomly chosen from the itinerary INTERNATIONAL JOURNAL OF COMPUTERS Issue 3, Volume 5, 2011 374