《电子商务 E-business》阅读文献：A Two-Level Learning Hierarchy of Concept Based Keyword Extraction for Tag Recommendations.pdf_P21-P25

In this work, we test our approach with a dataset obtained from BibSonomy system hose bookmarks have, among others, the attributes shown in Table 1 Table 1. Meta-information available in BibSonomy system about two different bookmarks: a web page and a scientific publication. http:/www.adammathes.com/academic/computer-mediated-communication/ mies. html Description Folksonomies-Cooperative Classification and Communication Through Shared Extended General overview of tagging and folksonomies. Difference between controlled folksonomies Semantic Modelling of User Interests Based on Cross-Folksonomy analysis Author M. Szomszor and H Alani and I Cantador and K Ohara and N. Shadbolt ings of the 7th International Semantic Web Conference(ISwC 2008) emantic Web-ISWC 2008 632648 tp/ dx. dol org/10.1007/978-3-540-88564-1 Month Abstract The continued increase in Web usage, in particular participation in folksonomi reveals a trend towards a more dynamic and interactive Web where individuals an organise and share resources. Tagging has emerged as the de-facto standard for the organisation of such resources, providing a versatile and reactive knowledge management mechanism that users find easy to use and understand. It is common nowadays for users to have multiple profiles in various folksonomies, thus distributing their tagging activities. In this pares. oular social networking 3 method for automatic consolidation of user profiles across two pop sites, and subsequent semantic modelling of their interests utilising Wikipedia as multi-domain model. We evaluate how much can be learned from such sites and in which domains the knowledge acquired is focussed. Results show that far richer interest profiles can be generated for users when multiple tag-clouds are combined In our approach, for each bookmark, using a set of NLP tools [2], the text attributes title, URL, abstract and description, and extended description are processed and transformed into a weighted list of key words. These simplified bookmark representations are then stored into an index, which will allow fast searches for bookmarks that satisfy keyword-and tag-based queries In our implementation, we used Lucene&, which allowed us to apply keyword stemming, stop words removal and term TF-IDF weighting pacheLucene-open-sourceInformationRetrievallibraryhttp://uceneapacheorg

In this work, we test our approach with a dataset obtained from BibSonomy system, whose bookmarks have, among others, the attributes shown in Table 1. Table 1. Meta-information available in BibSonomy system about two different bookmarks: a web page and a scientific publication. URL http://www.adammathes.com/academic/computer-mediated-communication/ folksonomies.html Description Folksonomies - Cooperative Classification and Communication Through Shared Metadata Extended General overview of tagging and folksonomies. Difference between controlled vocabularies, author and user tagging. Advantages and shortcomings of folksonomies Title Semantic Modelling of User Interests Based on Cross-Folksonomy Analysis Author M. Szomszor and H. Alani and I. Cantador and K. O'hara and N. Shadbolt Booktitle Proceedings of the 7th International Semantic Web Conference (ISWC 2008) Journal The Semantic Web - ISWC 2008 Pages 632-648 URL http://dx.doi.org/10.1007/978-3-540-88564-1_40 Year 2008 Month October Location Karlsruhe, Germany Abstract The continued increase in Web usage, in particular participation in folksonomies, reveals a trend towards a more dynamic and interactive Web where individuals can organise and share resources. Tagging has emerged as the de-facto standard for the organisation of such resources, providing a versatile and reactive knowledge management mechanism that users find easy to use and understand. It is common nowadays for users to have multiple profiles in various folksonomies, thus distributing their tagging activities. In this paper, we present a method for the automatic consolidation of user profiles across two popular social networking sites, and subsequent semantic modelling of their interests utilising Wikipedia as a multi-domain model. We evaluate how much can be learned from such sites, and in which domains the knowledge acquired is focussed. Results show that far richer interest profiles can be generated for users when multiple tag-clouds are combined. In our approach, for each bookmark, using a set of NLP tools [2], the text attributes title, URL, abstract and description, and extended description are processed and transformed into a weighted list of keywords. These simplified bookmark representations are then stored into an index, which will allow fast searches for bookmarks that satisfy keyword- and tag-based queries. In our implementation, we used Lucene8 , which allowed us to apply keyword stemming, stop words removal, and term TF-IDF weighting. 8 Apache Lucene – Open-source Information Retrieval library, http://lucene.apache.org/ 21

4 Social tag recommendation In this section, we describe our approach to recommend social tags for a bookmark, which does not need to be already tagged. The recommendation process is divided in 5 stages, depicted in Figure 1. Each of these stages is explained in detail in the next subsections. For a better understanding, the explanations follow a common illustrative tag recommendaton x etnea xr Figure 1. Tag recommendation process 4.1 Extracting bookmark keywords he first stage of our tag recommendation approach(identified by label I in Figure 1) is the extraction of keywords from some of the textual contents of the input bookmark According to the document model explained in Section 2, we extract such key words from the title, URL, abstract, description and extended description of the bookmark. We made experiments processing other attributes such as authors,user comments, and book and journal titles, but we obtained worse recommendation results. The noise(in the case of personal comments) and generality (in the case of authors and book/journal titles)implied the suggestion of social tags not related to the content topics of the web page or scientific publication associated to the bookmark For plain text fields of the bookmark, such as title, abstract and descriptions, we filter out numeric characters and discard stop words from English, Spanish, French erman and Italian, which were identified as the predominant languages of the bookmarks available in our experimental datasets. We also carry out transformations to LATEX expressions. Finally, we remove punctuation symbols, parentheses, and exclamation and question marks, and discard special terms like paper, work section, chapter, among others. For the URL field, we firstly remove the networkprotocol(http,ftP,etc.),thewebdomain(com,orgeduetc.),thefile extension(html, pdf, doc, etc. ) and possible GEt arguments for CGI scripts Next, we tokenise the remaining text removing the dots()and slashes (/) Finally, we discard numeric words and several special words like index, main, default home, among others. In both cases, a natural language processing tool [2 is used to singularise the resultant keywords, and filter out those that were not nouns

4 Social tag recommendation In this section, we describe our approach to recommend social tags for a bookmark, which does not need to be already tagged. The recommendation process is divided in 5 stages, depicted in Figure 1. Each of these stages is explained in detail in the next subsections. For a better understanding, the explanations follow a common illustrative example. Figure 1. Tag recommendation process. 4.1 Extracting bookmark keywords The first stage of our tag recommendation approach (identified by label 1 in Figure 1) is the extraction of keywords from some of the textual contents of the input bookmark. According to the document model explained in Section 2, we extract such keywords from the title, URL, abstract, description and extended description of the bookmark. We made experiments processing other attributes such as authors, user comments, and book and journal titles, but we obtained worse recommendation results. The noise (in the case of personal comments) and generality (in the case of authors and book/journal titles) implied the suggestion of social tags not related to the content topics of the web page or scientific publication associated to the bookmark. For plain text fields of the bookmark, such as title, abstract and descriptions, we filter out numeric characters and discard stop words from English, Spanish, French, German and Italian, which were identified as the predominant languages of the bookmarks available in our experimental datasets. We also carry out transformations to LATEX expressions. Finally, we remove punctuation symbols, parentheses, and exclamation and question marks, and discard special terms like paper, work, section, chapter, among others. For the URL field, we firstly remove the network protocol (HTTP, FTP, etc.), the web domain (com, org, edu, etc.), the file extension (html, pdf, doc, etc.), and possible GET arguments for CGI scripts. Next, we tokenise the remaining text removing the dots (.) and slashes (/). Finally, we discard numeric words and several special words like index, main, default, home, among others. In both cases, a natural language processing tool [2] is used to singularise the resultant keywords, and filter out those that were not nouns. index and search engine bookmark text fields processing input bookmark keywords 1 2 5 4 similar bookmarks 3 recommended tags bookmark retrieval related tags tag co-occurrence graph tag retrieval graph vertex centrality tag selection tag recommendation 22

Table 2 shows the content of an example bookmark whose tag recommendations are going to be explained in the rest of this section. It also lists the keywords extracted from the bookmark in the first stage of our approach. The bookmarked document is a ientific publication. Its main research fields are recommender systems and semantic web technologies. It describes a content-based collaborative recommendation model that exploits semantic (ontology-based) descriptions of user and item profiles Table 2. Example of bookmark for which the tag recommendation is performed, and the set of keywords extracted from it A Multilayer Ontology-based Hybrid Recommendation Model Authors Ivan Cantador, Alejandro Bellogin, Pablo Castells http://www.configworks.com/aicom/ Al Communica abstract We propose a novel hybrid recommendation model in which user preferences and item features are described in terms of semantic oncepts defined in domain ontologies. The concept, item and user aces are clustered in a coordinated way, and the resulting clusters are used to find similarities among individuals at multiple semantic layers. Such layers correspond to implicit Communities of Interest, keywords multilayer, ontology, hybrid, recommendation, configwork, aicom, preference, semantic, concept, domain ontology, item, space, way, cluster, similarity, individual, layer, In this stage, we performed a simple mechanism to obtain a keyword-based description of the bookmarked document (web page or scientific publication) contents. Note that more complex approaches can be performed. For example, instead of only being limited to the bookmark attributes, we could also extract additional keywords from the bookmarked document itself. Moreover, external knowledge bases could be exploited to infer new keywords related to the ones extracted from the bookmark. These are issues to be investigated in future work 4.2 Searching for similar bookmarks The second stage(label 2 in Figure 1)consists of searching for bookmarks that c The list of keywords extracted from the input bookmark are weighted based on ntain some of the keywords obtained in the previous stage their appearance frequency in the bookmark attributes, and are included in a weighted keyword-based query. This query represents an initial description of the input bookmark More specifically, in the query an for bookmark bn, the weight qn k E [0, 1 assigned to each keyword k is computed as the number of times the keyword appears in the bookmark attributes divided by the total number of keywords extracted from the bookmark

Table 2 shows the content of an example bookmark whose tag recommendations are going to be explained in the rest of this section. It also lists the keywords extracted from the bookmark in the first stage of our approach. The bookmarked document is a scientific publication. Its main research fields are recommender systems and semantic web technologies. It describes a content-based collaborative recommendation model that exploits semantic (ontology-based) descriptions of user and item profiles. Table 2. Example of bookmark for which the tag recommendation is performed, and the set of keywords extracted from it. Title A Multilayer Ontology-based Hybrid Recommendation Model Authors Iván Cantador, Alejandro Bellogín, Pablo Castells URL http://www.configworks.com/AICOM/ Journal title AI Communications Abstract We propose a novel hybrid recommendation model in which user preferences and item features are described in terms of semantic concepts defined in domain ontologies. The concept, item and user spaces are clustered in a coordinated way, and the resulting clusters are used to find similarities among individuals at multiple semantic layers. Such layers correspond to implicit Communities of Interest, and enable enhanced recommendation. Extracted keywords multilayer, ontology, hybrid, recommendation, configwork, aicom, ai, communication, user, preference, semantic, concept, domain, ontology, item, space, way, cluster, similarity, individual, layer, community, interest In this stage, we performed a simple mechanism to obtain a keyword-based description of the bookmarked document (web page or scientific publication) contents. Note that more complex approaches can be performed. For example, instead of only being limited to the bookmark attributes, we could also extract additional keywords from the bookmarked document itself. Moreover, external knowledge bases could be exploited to infer new keywords related to the ones extracted from the bookmark. These are issues to be investigated in future work. 4.2 Searching for similar bookmarks The second stage (label 2 in Figure 1) consists of searching for bookmarks that contain some of the keywords obtained in the previous stage. The list of keywords extracted from the input bookmark are weighted based on their appearance frequency in the bookmark attributes, and are included in a weighted keyword-based query. This query represents an initial description of the input bookmark. More specifically, in the query for bookmark , the weight , ∈ [0,1] assigned to each keyword is computed as the number of times the keyword appears in the bookmark attributes divided by the total number of keywords extracted from the bookmark: 23

= = {, , … , ,, … , ,} where , = , ∑ , , being , the number of times keyword appears in bookmark fields. The query is then launched against the index described in Section 2. Thus, we are not only taking into account the relevance of the keywords for the input bookmark, but also ranking the list of retrieved similar bookmarks. The searching result is a set of bookmarks that are similar to the input bookmark, assuming that “similar” bookmarks have common keywords. Using the cosine similarity measure for the vector space model [14], the retrieved bookmarks are assigned scores , ∈ [0,1] that measure the similarity between the query (i.e., the input bookmark ) and the retrieved bookmarks : , = , = cos , "# = ∙ "# % %%"#% For the example input bookmark, Table 3 shows the keywords, query, and some similar bookmarks obtained in the second stage of our tag recommendation model. Table 3. Extracted keywords, generated query, and retrieved similar bookmarks for the example input bookmark. Input bookmark: A Multilayer Ontology-based Hybrid Recommendation Model Keywords multilayer, ontology, hybrid, recommendation, configwork, aicom, ai, communication, user, preference, semantic, concept, domain, ontology, item, space, way, cluster, similarity, individual, layer, community, interest Query recommendation^0.125, ontology^0.09375, concept^0.0625, hybrid^0.0625, item^0.0625, layer^0.0625, multilayer^0.0625, semantic^0.0625, user^0.0625, aicom^0.03125, cluster^0.03125, configwork^0.03125, individual^0.03125, interest^0.03125, communication^0.03125, community^0.03125, preference^0.03125, similarity^0.03125, space^0.03125, way^0.03125 Similar bookmarks • Improving Recommendation Lists Through Topic Diversification • Item-Based Collaborative Filtering Recommendation Algorithms • Probabilistic Models for Unified Collaborative and ContentBased Recommendation in Sparse-Data Environments • Automatic Tag Recommendation for the Web 2.0 Blogosphere using Collaborative Tagging and Hybrid ANN semantic structures • PIMO - a Framework for Representing Personal Information Models 24

In this stage, we attempted to define and contextualise the vocabulary that is likely to describe the contents of the bookmarked document. For that purpose, the initial set of keywords extracted from the input bookmark was used to find related bookmarks, assuming that the keywords and social tags of the latter are useful to describe the content topics of the former. 4.3 Obtaining related social tags Once the set of similar bookmarks has been retrieved, in the third stage (label 3 in Figure 1), we collect and weight all their social tags. The weight assigned to each tag represents how much it contributes to the definition of the vocabulary that describes the input bookmark. Based on the scores , of the bookmarks retrieved in the previous stage, the weight & of a tag ' for the input bookmark is given by: & ' = ∑:) ∈ *+,-. , / . At this point, we could finish the recommendation process suggesting those social tags with highest weights &. However, doing this, we are not taking into account tag popularities and tag correlations, very important features of any collaborative tagging system. In fact, we conducted experiments evaluating recommendations based on the highest weighted tags, and we obtained worse results that the ones provided by the whole approach presented herein. Table 4 shows a subset of the tags retrieved from the bookmarks that were retrieved in Stage 2 for the example input bookmark. The weights & for each tag are also given in the table. Table 4. Weighted subset of tags retrieved from the list of bookmarks that are similar to the example input bookmark. Input bookmark: A Multilayer Ontology-based Hybrid Recommendation Model Related tag Weight Related tag Weight Related tag Weight recommender 10.538 clustering 2.013 dataset 0.871 recommendation 6.562 recommendersystems 1.669 evaluation 0.786 collaborative 5.142 web 1.669 suggestion 0.786 filtering 5.142 information 1.539 semantics 0.786 collaborativefiltering 3.585 ir 1.378 tag 0.786 ecommerce 3.138 retrieval 1.378 tagging 0.786 personalization 3.138 contentbasedfiltering 1.006 knowledgemanagement 0.290 cf 2.757 ontologies 1.006 network 0.290 semantic 2.745 ontology 1.006 neural 0.290 semanticweb 2.259 userprofileservices 1.006 neuralnetwork 0.290 25