当前位置：和泉文库 > 计算机 > 浏览文档

《电子商务 E-business》阅读文献：Keyword clustering for user interest profiling refinement within paper recommender systems

文件格式：PDF，文件大小：1.88MB，售价：4.45元

文档详细内容（约15页）

The Journal of Systems and Software 85(2012)87-101 Contents lists available at Science Direct The Journal of Systems and Software ELSEVIER journalhomepagewww.elsevier.com/locate/iss Keyword clustering for user interest profiling refinement within paper recommender systems Xiaoyu Tang, Qingtian Zeng College of information Science and Engineering. Shandong University of Science and Technology, No 579 Qianwangang Road, Qingdao 266510, PR China ARTICLE INFO A BSTRACT To refine user interest profiling, this paper focuses on extending scientific subject ontology via key- word clustering and on improving the accuracy and effectiveness of recommendation of the electronic in online services. A ne 23 July 2011 the purpose of the subject ontology extension. Based on the keyword clusters, the construction of ser interest profiles is presented on a rather fine granularity level. In the construction of user inter- est profiles, we apply two types of interest profiles: explicit profiles and implicit profiles. The explicit profiles are obtained by relating users' interest-topic relevance factors to users'interest measurements ser interest profiles on the basis of the correlative relationships among the topic nodes in topic network graphs. Three recommender systems experiments are conducted which reveal that the uses of the subject ontology extension approach as Ontology extension vell as the two types of interest profiles satisfyingly contribute to an improvement in the accuracy of Crown Copyright o 2011 Published by Elsevier Inc. All rights reserved. 1. Introduction will probably agree There are also case-based approache like CASPER (Smyth et al., 2002). Entree system(Burke et al, 1997) Recommender systems form or work from a specific type of and some other profiling methods based on artificial neutral net formation filtering system technique that attempts to recom- work, such as Tan et al. (1998), Shepherd et al. (2002)and Kim et al end information items that are likely to be of interest to the (2002). user. Typically, a recommender system compares a user profile to he utilization of ontologies in user profiling techniques has certain reference characteristics and seeks to predict the 'rating gained much attention since it allows inference to be employed that a user would give to an item they had not yet considered enabling interests to be discovered that were not directly observed (Wikipedia, 2010). The current generation of recommendation in the user's behavior (Middleton et al, 2004: Zeng et al, 2009: methods is usually classified into the following three main cate- Wu et al, 2009). Additionally, once profiles are represented using gories: content-based, collaborative, and hybrid recommendation an ontology, they can communicate with other ontologies which pproaches(Adomavicius and tuzhilin, 2005 ). Among the ma share similar concepts, which contributes to knowledge reuse and techniques of recommender systems, user profiling forms the basis abates the effect of the cold-start problem(Felden and Linden, of such systems. At the same time, much work has been done in 2007: Middleton et al., 2004). In the Foxtrot system(Middleton academia and industry on creating user profiles since the emer- et al., 2004). researchers improved recommendations based on gence of recommender systems. For example, VSM-based profiling the ontological profiling method by visualizing the profiles them pproaches are a basic way of conducting user interest modeling selves and presenting them to users. blanco-Fernandez et al. (2008 (Balabanovic and Shoham, 1997). In GroupLens, profiles of all users used a domain ontology in which the semantic descriptions of the are aggregated in a user-item rating matrix, in which each line cor- available items(e.g TV programs, books, CDs, etc )are formalized responds to a user, and each column to an item(Resnick et al. Their reasoning-based recommendation strategy discovers seman- 1994). Collaborative recommender systems produce recommen- tic relationships between the users' preferences and the items dations based on the heuristic that people who agreed in the past available in the domain ontology. These relationships provide the system with extra knowledge about the users'interests, thus favor In this paper, we propose a refined ontological profiling metho es:lonewar@163.com(xTang).qtzeng@163.com. based on our extended subject ontology. Two kinds of interest 1.cn,qingtianzeng@hotmail.com(QZeng). profiles are introduced: explicit profiles and implicit profiles. The 0164-1212/s-see front matter. Crown Copyright o 2011 Published by Elsevier Inc. All rights reserved. i:10.1016Js201107029

The Journal of Systems and Software 85 (2012) 87–101 Contents lists available at ScienceDirect The Journal of Systems and Software journal homepage: www.elsevier.com/locate/jss Keyword clustering for user interest profiling refinement within paper recommender systems Xiaoyu Tang, Qingtian Zeng∗ College of Information Science and Engineering, Shandong University of Science and Technology, No. 579 Qianwangang Road, Qingdao 266510, PR China article info Article history: Received 5 October 2010 Received in revised form 8 July 2011 Accepted 13 July 2011 Available online 23 July 2011 Keywords: Weighted keyword graph Keyword clustering User interest profiles Recommender systems Ontology extension abstract To refine user interest profiling, this paper focuses on extending scientific subject ontology via keyword clustering and on improving the accuracy and effectiveness of recommendation of the electronic academic publications in online services. A clustering approach is proposed for domain keywords for the purpose of the subject ontology extension. Based on the keyword clusters, the construction of user interest profiles is presented on a rather fine granularity level. In the construction of user interest profiles, we apply two types of interest profiles: explicit profiles and implicit profiles. The explicit profiles are obtained by relating users’ interest-topic relevance factors to users’ interest measurements of these topics computed by a conventional ontology-based method, and the implicit profiles are acquired on the basis of the correlative relationships among the topic nodes in topic network graphs. Three experiments are conducted which reveal that the uses of the subject ontology extension approach as well as the two types of interest profiles satisfyingly contribute to an improvement in the accuracy of recommendation. Crown Copyright © 2011 Published by Elsevier Inc. All rights reserved. 1. Introduction Recommender systems form or work from a specific type of information filtering system technique that attempts to recommend information items that are likely to be of interest to the user. Typically, a recommender system compares a user profile to certain reference characteristics and seeks to predict the ‘rating’ that a user would give to an item they had not yet considered (Wikipedia, 2010). The current generation of recommendation methods is usually classified into the following three main categories: content-based, collaborative, and hybrid recommendation approaches (Adomavicius and Tuzhilin, 2005). Among the main techniques of recommender systems, user profiling forms the basis of such systems. At the same time, much work has been done in academia and industry on creating user profiles since the emergence of recommender systems. For example, VSM-based profiling approaches are a basic way of conducting user interest modeling (Balabanovic and Shoham, 1997 ´ ). In GroupLens, profiles of all users are aggregated in a user-item rating matrix, in which each line corresponds to a user, and each column to an item (Resnick et al., 1994). Collaborative recommender systems produce recommendations based on the heuristic that people who agreed in the past ∗ Corresponding author. E-mail addresses: lonewar@163.com (X. Tang), qtzeng@163.com, qtzeng@sdust.edu.cn, qingtianzeng@hotmail.com (Q. Zeng). will probably agree again. There are also case-based approaches like CASPER (Smyth et al., 2002), Entrée system (Burke et al., 1997), and some other profiling methods based on artificial neutral network, such as Tan et al. (1998), Shepherd et al. (2002) and Kim et al. (2002). The utilization of ontologies in user profiling techniques has gained much attention since it allows inference to be employed, enabling interests to be discovered that were not directly observed in the user’s behavior (Middleton et al., 2004; Zeng et al., 2009; Wu et al., 2009). Additionally, once profiles are represented using an ontology, they can communicate with other ontologies which share similar concepts, which contributes to knowledge reuse and abates the effect of the cold-start problem (Felden and Linden, 2007; Middleton et al., 2004). In the Foxtrot system (Middleton et al., 2004), researchers improved recommendations based on the ontological profiling method by visualizing the profiles themselves and presenting them to users. Blanco-Fernández et al. (2008) used a domain ontology in which the semantic descriptions of the available items (e.g., TV programs, books, CDs, etc.) are formalized. Their reasoning-based recommendation strategy discovers semantic relationships between the users’ preferences and the items available in the domain ontology. These relationships provide the system with extra knowledge about the users’ interests, thus favoring more accurate personalization processes. In this paper, we propose a refined ontological profiling method based on our extended subject ontology. Two kinds of interest profiles are introduced: explicit profiles and implicit profiles. The 0164-1212/$ – see front matter. Crown Copyright © 2011 Published by Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2011.07.029

88 X. Tang, Q. Zeng / The Journal of Systems and Software 85 (2012) 87–101 construction algorithms for profiling explicit interests and for pro- filing implicit interests are also proposed. Based on our extended subject ontology, we take into account the pertinence of keywords and the classifications of the ontology in the calculation of explicit interest profiles. The method for inferring users’ potential interests (implicit interests) proposed employs a quantified evaluation of relationships between classifications in the ontology. The paper is organized into seven sections. Section 2 discusses the related work. Section 3 presents a brief description of our recommender system, based on which we evaluate our approach. In Section 4, we propose a way of clustering keywords for an existing ontology, in order to create a more detailed and accurate structure of the existing ontology. In Section5, the approaches to user interest profiles, which contain both explicit and implicit interest profiles, are presented. Section 6 introduces our prototype system called SPRS, and using this system we conducted three experiments. We evaluate the subject ontology extension method and refinement of user profiling approach. Section 7 concludes the paper and discusses future work. 2. Related work Ontology is a conceptualization of a domain into a humanunderstandable butmachine-readable format consisting of entities, attributes, relationships, and axioms (Guarino and Giaretta, 1995). It is used to alleviate the communication problems between systems due to ambiguous usage of different terms. For building user profiles, ontologies are used to address the so-called “cold-start problem” (Middleton et al., 2002; Susan and Gauch, 2004).Cantador et al. (2008) mapped social tagging information from multiple sources to their ontological structures that described the domains of interest covered by the tags, in order to build user profiles. Moreover, Middleton et al. (2004) used a term ontology to refer to the classification structure and instances within a knowledge base, representing the profiles in terms of a research paper topic ontology, which allows other interests to be inferred that go beyond those only seen in directly observed behavior. They also employ pro- file visualization to acquire profile feedback from users to improve profiling accuracy. A number of strategies have been implemented to facilitate the construction of ontology. For example, Mika (2007) extended the traditional bipartite model of ontologies with the social dimension, leading to a tripartite model of actors, concepts and instances. He also demonstrated the application of this representation by showing how community-based semantics emerges from this model through a process of graph transformation. In addition, Zhang et al. (2010) proposed a suite of ontology metrics, at both the ontologylevel and the class-level, to measure the design complexity of ontologies. Finally, in recent years complex network and other forms of network models such as bipartite network have been considered to be important aspects of recommender systems by many researchers (Zanin et al., 2008; Zhou et al., 2007, 2010). When focusing on the problem of recommending items to a user, the underlying transaction data can be seen as a bipartite network, in which users and items are represented as two groups of nodes, connected to each other by certain links (Zanin et al., 2008). In order to utilize the bipartite network, a one-mode projecting method is usually implemented as an alternative to using the bipartite network directly. Zhou et al. (2007) raised a novel one-mode projecting method to compress the bipartite network and better preserve the original information. Zhou et al. (2010) introduced and used evaluation criteria for the diversification of recommendation, aside from accuracy. They believe the next generation of information filtering methods should focus on not only precision but also diversification, and that a balance between them should be sought. There are some problems in the existing ontologies (Adomavicius and Tuzhilin, 2005; Correa da Silva et al., 2002) and the user interest profiling methods based on them (Middleton et al., 2001, 2003, 2004; Cantador et al., 2008; Susan and Gauch, 2004; Felden and Linden, 2007). (1) Most of the ontologies in use are framed in coarse granularity, making the classification of items obscure and undetermined. Therefore, the effectiveness of user profiling techniques based on such ontologies, no matter how sophisticated the interest profiling algorithms are, would deteriorate because of the coarse classification. (2) Typically, ontologies are usually predefined manually and remain fixed during a certain period of time, which makes them insensitive to new changes. When new research subjects emerge or other changes happen, the modifications must be done by their creators manually. This reaction is slow and tardy and needs human involvement. Moreover, typical ontology-based user interest calculation algorithms compute the interest value on each category for every user; however, these methods are not always effective. For instance, assume that two users, user A and user B have both viewed a certain number of papers in the subject “artificial intelligence” and we obtain the same interest value for their interests in this subject; therefore we believe their interests in “artificial intelligence” are no different. However, in fact some papers viewed by user A are about “support vector machine”, and the other papers viewed by this user are about “genetic algorithm”, whereas the papers viewed by user B are all related to “natural language processing”. In this case, the relatively subtle differences between the users’ interests in “artificial intelligence” are not discovered, leading to the description of user interests and the subsequent recommendation being inaccurate. (3) It is difficult for conventional approaches to differentiate the items within the same class (or subject). That is, in the case of research paper recommendation, different papers in the same subject contribute essentially equally to the construction of user profiles and the differences in textual information between papers is neglected, which is evidently unreasonable. (4) The inference of topics of interest via ontological relations between topics that have not been browsed explicitly by users is not precise enough. For instance, assume that there are three subclasses A, B and C under their immediate super class A, and a user has an interest in subclass A with an interest value of 0.4. With the aforementioned conventional profiling method, we get 0.2 for the user’s interest value of the super class A. In this scenario, the user will receive equivalent recommendations from subclass B and C, because the conventional profiling method does not take into account the differences between subclass B and C, leading to inaccuracy of the inference about a user’s implicit interest because these differences may be relevant to the user’s preferences. 3. Framework for generating user interest profiles In this section, we first present the framework for generating user interest profiles within online paper recommender systems, which is shown in Fig. 1. The main components in the framework include: (1) Subject ontology. The subject ontology is predefined by domain experts with suitable granularity and scale, which presents the organization structure of domain knowledge and serves as the taxonomy for research papers. Moreover, it is the basis of the user profile. To refine the subject ontology, we will discuss our method of extending the subject ontology through automatically clustering weighted keyword graph in Section 4. (2) Paper management module. Users can upload, browse, download and comment on any research papers through the user interface of the paper management module. All of the research papers are

X Tang Q Zeng/ The Journal of Systems and Software 85(2012)87-101 日 per Management Module Pa User 1 te rface 日 Subject User Behavior Monitoring E Subject2.1.1 Module User n User ProfilIng Module Behavi Datab ser Profiles Fig. 1. Framework for generating user interest profiles. stored in the paper database. Each paper in the paper database In the paper database storing the research paper da is classified according to the subject ontology and can readily each papers information contains its corresponding category be retrieved by users. The paper management module plays the mark based on the subject ontology. In this paper, vector role of fundamental component in the whole framework. space model is used to represent the research papers. Key (3)User behavior monitoring module This module is responsible for words are provided by the papers authors, representing the the background collection of the behavior data of each user. key content of each corresponding paper. In the vector space The user behavior data include searching keywords, browsing, model, a paper is represented by a keyword vector, i.e., downloading and commenting on papers, etc. The monitoring paper=(keywordl,., keywor keyword)(1≤i≤n).The nd collecting processes are totally unobtrusive. weight of the ith keyword keyword in paper is (4)User profiling module. The user profiling module makes use of as WKP(keyword, paper), which is computed by the tE the user behavior data recorded by the user behavior moni- oring module the paper database and the subject to create user profiles. The user profiles obtained ca to recommend papers to these users. We will elaborate our COMPUTER SICENCI profiling approach in Section 6. 4. Automatic ontology extension through clustering Communications and Theory of computation Algorithms and data structures Databases In order to solve the problems in the user profiles based on the traditional ontologies, we pi ontology extension algorithm Programming languages and to refine the user profiles. Before presenting it, we introduce an Artificial intelligence com pliers riginal subject ontology, which is defined by the Science Paper Concurrent, paralleL, and Online website(Sciencepaper Online, 2010). It is a taxonomy of distributed systems Computer graphics research subjects and has been in use on the Internet for many ears. This simple ontology consists of two levels of classification, Software engineering Scientific computing primary subjects and secondary subjects, and it holds is-a relation- ships between the subjects in different levels. In the first level, there Computer architecture Collaborative Networks are 43 primary subjects. Each primary subject has secondary sub- jects as its subordinate classifications. Fig. 2 shows the section of Fig. 2. The classification of"computer science"in the research paper subject onto- the primary subject"computer science "in this subject ontology

X. Tang, Q. Zeng / The Journal of Systems and Software 85 (2012) 87–101 89 Fig. 1. Framework for generating user interest profiles. stored in the paper database. Each paper in the paper database is classified according to the subject ontology and can readily be retrieved by users. The paper management module plays the role of fundamental component in the whole framework. (3) User behavior monitoring module. This module is responsible for the background collection of the behavior data of each user. The user behavior data include searching keywords, browsing, downloading and commenting on papers, etc. The monitoring and collecting processes are totally unobtrusive. (4) User profiling module. The user profiling module makes use of the user behavior data recorded by the user behavior monitoring module, the paper database and the subject ontology, to create user profiles. The user profiles obtained can be used to recommend papers to these users. We will elaborate our profiling approach in Section 6. 4. Automatic ontology extension through clustering weighted keyword graphs In order to solve the problems in the user profiles based on the traditional ontologies, we propose an ontology extension algorithm to refine the user profiles. Before presenting it, we introduce an original subject ontology, which is defined by the Science Paper Online website (Sciencepaper Online, 2010). It is a taxonomy of research subjects and has been in use on the Internet for many years. This simple ontology consists of two levels of classification, primary subjects and secondary subjects, and it holds is–a relationships between the subjects in different levels. In the first level, there are 43 primary subjects. Each primary subject has secondary subjects as its subordinate classifications. Fig. 2 shows the section of the primary subject “computer science” in this subject ontology. In the paper database storing the research paper data, each paper’s information contains its corresponding category mark based on the subject ontology. In this paper, vector space model is used to represent the research papers. Keywords are provided by the paper’s authors, representing the key content of each corresponding paper. In the vector space model, a paper is represented by a keyword vector, i.e., paper = (keyword1,..., keywordi,..., keywordn) (1 ≤ i ≤ n). The weight of the ith keyword keywordi in paperp is denoted as WKP(keywordi, paperp), which is computed by the TF-IDF Fig. 2. The classification of “computer science” in the research paper subject ontology

X Tang, Q Zeng/The Joumal of Systems and Software 85(2012)87-101 algorithm (Term Frequency-Inverse Document Frequency)(Salton and Buckley, 1988): here ps(keyword) is the set of papers containing keyword WKP(keyword, paper)is the weight of keyword in PKV (2)The set of edges CRC(KNS x KNS) represents the relations between different keyword nodes in KNS 1)WKP(keyword, papers)is the weight of the keyword weight of R where Ri is the edge linking keyword;and erp)is the keyword: Ps(keyword) is the paper set containing keyword, keyword in paper: (3)N indicates the the papers; (4)nk is the frequency of the keyword the function count() counts the number of papers of which the keyword vectors contain both keyword; and keyword. Le, PS(keyword)n PS(keyword) In the following discussions, we use PKVi (WKP(keyword, paperi),., WKP(keyword, paper i),..., WKP In the paper-keyword relation model (PKRM), if any two key (keywords, papern )) to represent paperi in which the quantitative quantitative word nodes in KNS are connected to one or more paper nodes in value of keyword is computed by the TF-IDF algorithm( Salton and PNS, then there is an edge connecting the two keyword nodes in uckley, 1988). By computing the TF-IDF values of keywords we CR of WKG WGT(Rii)indicates the number of co-occurrences of the obtain the keyword vectors of all papers. two keywords in papers. During the subsequent keyword clusterin The subject ontology we have discussed so far is still coarse process WGT(Rij )are changed granular and thus not precise enough to use. In order to produce For example, a weighted keyword graph of the subjectartifi a more finely granular ontology, we now introduce an automatic cial intelligence", which is created with Pajek( Batagelj and Mrvar, clustering algorithm for weighted keyword graph, which includes 1998), is illustrated in Fig 4. The keywords engaged are from 200 hree main steps, as follows papers in the subject of"artificial intelligence". The construction First, we define the paper-keyword relation model represented of the weighted keyword graph depends on the paper set. As the by a bipartite graph. A bipartite graph is a graph whose vertices paper set is changed, so the result may be different. can be divided into two disjoint sets U and V such that every edge nally, we cluster the weighted keyword graph to extend the connects a vertex in U to one in V: that is, U and V are independent subject ontology The keyword clustering is virtually the process of sets(West, 2000).Usually the authors specify a few keywords for increasing the edge weight threshold and finding a practical one. In their papers in order to demonstrate the contents or topic of the Fig 4, we can clearly see that keywords related to similar themes tend to cluster together. These keyword bundles naturally emerge papers to create the bipartite graphs between papers and keywords. and signify the tendency towards a concentration of the contents Here, we introduce the definition of the paper-keyword relation among similar papers. Through the keyword clustering process. these concentrated keyword bundles will be decomposed into sep Definition 1(Paper-keyword relation model). A paper-keyword arate ones. Now we present the keyword clustering procedure relation model is a bipartite graph PKRM=(PNS, KNS, RS), where (1)According to Definition 2, the weighted keyword graph WKG of (1)The set of nodes PNS= paper1, paper, papern represents a set of papers. (2)Define a threshold S2 for the weights of edges in WKG. If (2)The set of nodes KNS=(keywordl, keyword,. keywordmI WGT(Rij)<s2, the edge linking keyword and keyword; would t of keywords. be removed from WKG, where keyword keyword; e KNs and ()The set of edges RSS(PNS x KNS)U(KNS x PNS) indicates the binary relations between PNS and KNS. Suppose paper i e PNs (3)A set of connected components are produced as a result of Step keyword; e KNS and keyword; exists in PKVi, then ri E RS, where 2), with each connected component indicating one keyword rij is an edge connecting the node keyword and the node paper cluster. These keyword clusters are named"topics".We define a topic graph set TGS={TG1,……,TG,……,TGn} in which TGi rep- resents a topic, to represent all the topics generated An example of a paper-keyword relation model is shown in Fig. 3, in which the circle nodes represent research papers and the rectangle nodes represent keywords. The edges stand for the Here, we make three rules to facilitate the application of key relations between papers and keywords. word clustering. Secondly using the one-mode projection method(zanin et al 2008: Zhouet al,2007).we project paper-keyword relation models (1)Only when there are at least four keyword nodes in a cluster onto the keywords set KNS to generate weighted keyword graphs. is this keyword cluster considered to be a topic: the nodes in The one-mode projection onto keywords means a netw the clusters with keywords less than four, together with the taining only keyword nodes, in which two keyword nodes are keyword nodes with no connections, are collected into a special connected when they have at least one common neighboring paper topic, called"heterogeneous"topic. The "heterogeneous"topic node(zhou et al, 2007). In the following, we give the definition of is not included in user profiles, and therefore in the rest of this weighted keyword graphs paper the term "topic"only refers to usual keyword clusters that are generated by the keyword clustering process. Definition 2(Weighted keyword graph). A weighted keyword (2)We name the topics with the keywords which possess the graph is a graph WKG=(KNS, CR), where largest value of the sum of the weights of edges connected to them. If there is more than one keyword node with the highest (1)KNS is the node set representing keywords. For each connectivity, we choose one from them randomly keyword; E KNS. the weight of keyword; is calculated by (3)In the case that names of the clusters (topics) are the same s their immediate upper subjects' names, we would not con- Weight(keywords) WKP(keyword, paper) sider these kinds of clusters to be topics; that is, the keyword node with the highest connectivity in each cluster and the edges paperpE PS(keyword) linking to it are virtually neglected

90 X. Tang, Q. Zeng / The Journal of Systems and Software 85 (2012) 87–101 algorithm (Term Frequency-Inverse Document Frequency) (Salton and Buckley, 1988): WKP(keywordi, paperp) = tf(keywordi, paperp) · log((N/ni) + 0.01) n k=1 (tf(keywordk, paperp) · log((N/nk) + 0.01))2 , where (1) WKP(keywordi, paperp) is the weight of the keyword keywordi in paperp; (2) tf(keywordi, paperp) is the frequency of the keyword keywordi in paperp; (3) N indicates the total number of the papers; (4) nk is the frequency of the keyword keywordk in all papers. In the following discussions, we use PKVi = (WKP(keyword1, paperi), . . . ,WKP(keywordj, paperi), . . . ,WKP (keywordn, paperi)) to represent paperi, in which the quantitative value of keywordj is computed by the TF-IDF algorithm (Salton and Buckley, 1988). By computing the TF-IDF values of keywords we obtain the keyword vectors of all papers. The subject ontology we have discussed so far is still coarsely granular and thus not precise enough to use. In order to produce a more finely granular ontology, we now introduce an automatic clustering algorithm for weighted keyword graph, which includes three main steps, as follows. First, we define the paper–keyword relation model represented by a bipartite graph. A bipartite graph is a graph whose vertices can be divided into two disjoint sets U and V such that every edge connects a vertex in U to one in V; that is, U and V are independent sets (West, 2000). Usually the authors specify a few keywords for their papers in order to demonstrate the contents or topic of the papers. We collect the keywords and record their relations with papers to create the bipartite graphs between papers and keywords. Here, we introduce the definition of the paper–keyword relation model: Definition 1 (Paper–keyword relation model). A paper–keyword relation model is a bipartite graph PKRM = (PNS, KNS, RS), where (1) The set of nodes PNS = {paper1, paper2, ..., papern} represents a set of papers. (2) The set of nodes KNS = {keyword1, keyword2, ..., keywordm} represents a set of keywords. (3) The set of edges RS ⊆ (PNS × KNS)∪(KNS × PNS) indicates the binary relations between PNS and KNS. Suppose paperi ∈ PNS, keywordj ∈ KNS and keywordj exists in PKVi, then ri,j ∈ RS, where ri,j is an edge connecting the node keywordj and the node paperi. An example of a paper–keyword relation model is shown in Fig. 3, in which the circle nodes represent research papers and the rectangle nodes represent keywords. The edges stand for the relations between papers and keywords. Secondly, using the one-mode projection method (Zanin et al., 2008; Zhou et al., 2007), we project paper–keyword relationmodels onto the keywords set KNS to generate weighted keyword graphs. The one-mode projection onto keywords means a network containing only keyword nodes, in which two keyword nodes are connected when they have at least one common neighboring paper node (Zhou et al., 2007). In the following, we give the definition of weighted keyword graphs. Definition 2 (Weighted keyword graph). A weighted keyword graph is a graph WKG= (KNS, CR), where (1) KNS is the node set representing keywords. For each keywordi ∈ KNS, the weight of keywordi is calculated by Weight(keywordi) = paperp ∈ PS(keywordi) WKP(keywordi, paperp), where PS(keywordi) is the set of papers containing keywordi; WKP(keywordi, paperp) is the weight of keywordi in PKVp. (2) The set of edges CR ⊆ (KNS × KNS) represents the relations between different keyword nodes in KNS. (3) WGT(Ri,j) = count(PS(keywordi) ∩ PS(keywordj)) denotes the weight of Ri,j, where Ri,j is the edge linking keywordi and keywordj; PS(keywordi) is the paper set containing keywordi; the function count( ) counts the number of papers of which the keyword vectors contain both keywordi and keywordj, i.e., PS(keywordi) ∩ PS(keywordj). In the paper–keyword relation model (PKRM), if any two keyword nodes in KNS are connected to one or more paper nodes in PNS, then there is an edge connecting the two keyword nodes in CR of WKG. WGT(Ri,j) indicates the number of co-occurrences of the two keywords in papers. During the subsequent keyword clustering process WGT(Ri,j) are changed. For example, a weighted keyword graph of the subject “artifi- cial intelligence”, which is created with Pajek (Batagelj and Mrvar, 1998), is illustrated in Fig. 4. The keywords engaged are from 200 papers in the subject of “artificial intelligence”. The construction of the weighted keyword graph depends on the paper set. As the paper set is changed, so the result may be different. Finally, we cluster the weighted keyword graph to extend the subject ontology. The keyword clustering is virtually the process of increasing the edge weight threshold and finding a practical one. In Fig. 4, we can clearly see that keywords related to similar themes tend to cluster together. These keyword bundles naturally emerge and signify the tendency towards a concentration of the contents among similar papers. Through the keyword clustering process, these concentrated keyword bundles will be decomposed into separate ones. Now we present the keyword clustering procedure: (1) According to Definition 2, the weighted keyword graph WKG of a subject is generated. (2) Define a threshold ˝ for the weights of edges in WKG. If WGT(Ri,j) < ˝, the edge linking keywordi and keywordj would be removed from WKG, where keywordi, keywordj ∈ KNS and i =/ j. (3) A set of connected components are produced as a result of Step (2), with each connected component indicating one keyword cluster. These keyword clusters are named “topics”. We define a topic graph set TGS = {TG1, ..., TGi, ..., TGn} in which TGi represents a topic, to represent all the topics generated. Here, we make three rules to facilitate the application of keyword clustering. (1) Only when there are at least four keyword nodes in a cluster is this keyword cluster considered to be a topic; the nodes in the clusters with keywords less than four, together with the keyword nodes with no connections, are collected into a special topic, called “heterogeneous” topic. The “heterogeneous” topic is not included in user profiles, and therefore in the rest of this paper the term “topic” only refers to usual keyword clusters that are generated by the keyword clustering process. (2) We name the topics with the keywords which possess the largest value of the sum of the weights of edges connected to them. If there is more than one keyword node with the highest connectivity, we choose one from them randomly. (3) In the case that names of the clusters (topics) are the same as their immediate upper subjects’ names, we would not consider these kinds of clusters to be topics; that is, the keyword node with the highest connectivity in each cluster and the edges linking to it are virtually neglected.

点击进入文档下载页（PDF格式）

共15页，试读已结束，阅读完整版请下载

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录