《电子商务 E-business》阅读文献：A Paper Recommendation Mechanism for the Research Support System Papits.pdf

A Paper Recommendation Mechanism for the Research Support System Papits Satoshi Watanabe, Takayuki Ito, Tadachika Ozono and toramatsu shintani Graduate School of Engineering, Nagoya Institute of Technology Gokiso-cho, Showa-ku, Nagoya, Aichi, 466-8555 Japan watanabe, itota, ozono, tora @ics. nitech ac jp Abstract called Papits[8J118(13. Papits has several functions that allow it to manage research information by paper sharing, ve have developed Papits, a research support system, paper recommending, paper retrieving, paper classifying that shares research information, such as PDF files of re search papers, in computers on networks and classifies the the sharing of research information, such as the PDF files of nformation into research types. Papits users can share var- research papers, and to collect papers from Web sites ious research information and survey the corpora of their The recommendation function constructs a users model particular fields. To develop Papits, we need to design to determine a user's research interests and specialties. This mechanism to identify a user's interest. Also, when cor model is constructed by analyzing structing an effective paper recommendation system, it is user has viewed and enable them to recommend papers important to carefully create user's models. We propose a ased on their interest. The recommendation in Papits grad- ually improves accuracy through paper viewing history of work. The scale-free network has vertices and edges, and users. This particular paper focuses on the paper recom ensures growth by ' preference attachments. Our method mendation. One of the main problems associated with the applies a paper viewing history to construct a scale-free recommendation is how to reduce information overload and network based on the word co-occurrence. a constructed realize a precise and accurate recommendation. In conven network consists of vertices that represent words, and edge tional research of natural language processes, the TF-IDF that represent the word co-occurrence In our method, a pa method[21]et al., was used to give added weight to words per is added to the network as indicated by a user's paper for searching or summation. Also, the TF-IDF method is viewing history. Additionally we define the 'topic weight often used to calculated the importance of and a similari- By using two elements; the topic frequency and the topic re- ties between documents. The calculation, however, does not cency, we calculate the topic weight. By using the word co- take the differences between the users'interest into account occurrence in a database, we measure the topic frequency. As each user's interest is different, so a weight of word is Also, by using the Jaccard coefficient, we measure the topic different for every user. We proposed and applied a recom- recency. Our result indicates that our method can effec mendation mechanism to Papits that uses the user's paper tively recommend documents for Papits users viewing history to reflect a user's interests or specialties By using a recommendation mechanism, we can discover several papers in various databases, but each paper can be classified according to the following characteristics 1. Introduction (I)Does the paper have an important fact or not? (2)Does the paper have a novel fact or a known fact? ()Does the paper have a fact that is interesting to the user As information technology becomes an indispensable part of our daily life, huge amount of information is shared When constructing a user's model, ideally we would like throughout the world. The speed and amount of this shar to discover papers that are important, novel, and of inter ing has accelerated with the advent of the Internet and users est to the user. Conventional recommendation mechanisms are becoming overloaded with information. with such valid mainly deal with the characteristics(1); the importance of and noisy information, we need tools to identify useful in- formation or knowledge that meets demands of individual mechanisms rank papers by using the precision and recall user. So, we have developed a research support system 'PAPer Information Tailor System Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce(DEEC'05) 076952401-X0520.00@2005LEEE SOCIETY

A Paper Recommendation Mechanism for the Research Support System Papits Satoshi Watanabe, Takayuki Ito, Tadachika Ozono and Toramatsu Shintani Graduate School of Engineering, Nagoya Institute of Technology Gokiso-cho, Showa-ku, Nagoya, Aichi, 466-8555 Japan {watanabe, itota, ozono, tora}@ics.nitech.ac.jp Abstract We have developed Papits, a research support system, that shares research information, such as PDF files of research papers, in computers on networks and classifies the information into research types. Papits users can share various research information and survey the corpora of their particular fields. To develop Papits, we need to design a mechanism to identify a user’s interest. Also, when constructing an effective paper recommendation system, it is important to carefully create user’s models. We propose a method to construct user’s models using the scale-free network. The scale-free network has vertices and edges, and ensures growth by ‘preference attachments’. Our method applies a paper viewing history to construct a scale-free network based on the word co-occurrence. A constructed network consists of vertices that represent words, and edges that represent the word co-occurrence. In our method, a paper is added to the network as indicated by a user’s paper viewing history. Additionally we define the ‘topic weight’. By using two elements; the topic frequency and the topic recency, we calculate the topic weight. By using the word cooccurrence in a database, we measure the topic frequency. Also, by using the Jaccard coefficient, we measure the topic recency. Our result indicates that our method can effectively recommend documents for Papits users. 1. Introduction As information technology becomes an indispensable part of our daily life, huge amount of information is shared throughout the world. The speed and amount of this sharing has accelerated with the advent of the Internet and users are becoming overloaded with information. With such valid and noisy information, we need tools to identify useful information or knowledge that meets demands of individual user. So, we have developed a research support system, called Papits1[8][18][13]. Papits has several functions that allow it to manage research information by paper sharing, paper recommending, paper retrieving, paper classifying and a research diary. The paper sharing function facilitates the sharing of research information, such as the PDF files of research papers, and to collect papers from Web sites. The recommendation function constructs a user’s model to determine a user’s research interests and specialties. This model is constructed by analyzing research papers that a user has viewed and enable them to recommend papers based on their interest. The recommendation in Papits gradually improves accuracy through paper viewing history of users. This particular paper focuses on the paper recommendation. One of the main problems associated with the recommendation is how to reduce information overload and realize a precise and accurate recommendation. In conventional research of natural language processes, the TF-IDF method[21] et al., was used to give added weight to words for searching or summation. Also, the TF-IDF method is often used to calculated the importance of and a similarities between documents. The calculation, however, does not take the differences between the users’ interest into account. As each user’s interest is different, so a weight of word is different for every user. We proposed and applied a recommendation mechanism to Papits that uses the user’s paper viewing history to reflect a user’s interests or specialties. By using a recommendation mechanism, we can discover several papers in various databases, but each paper can be classified according to the following characteristics. (1)Does the paper have an important fact or not? (2)Does the paper have a novel fact or a known fact? (3)Does the paper have a fact that is interesting to the user at the present moment? When constructing a user’s model, ideally we would like to discover papers that are important, novel, and of interest to the user. Conventional recommendation mechanisms mainly deal with the characteristics(1); the importance of papers, for example, by using a statistics approach. Some mechanisms rank papers by using the precision and recall 1PAPer Information Tailor System Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce (DEEC’05) 0-7695-2401-X/05 $20.00 © 2005 IEEE

value of each rule. However, it is not easy to deal with other B characteristics, as the novelty and significance of the paper to the user may change over time To deal with the characteristics(2) and (3), we utilized the user's paper viewing history. This allowed us to check whether or not a paper is novel. Moreover, this monitoring enabled us to also determine a user's preference and interest and check whether or not a paper is of interest at the present moment. Additionally, we define thetopic model that has two elements: the topic frequency and the topic recency. By using the word co-occurrence in a database we measure the opic frequency. Also, by using the Jaccard coefficient, we measure the topic recency In the first section we will describe the recommendation algorithm for managing research papers. Second, we will outline out the Papits research support system. Third, we will discuss the our results using our algorithm and proves Figure 1. A conception of scale-free network ts usefulness. Fourth, we will compare our work with ilar researches. Finally, we will conclude with a brief 2. Paper recommendation mechanism mo). The probability that a vertex attaches to another ver- tex i is proportional to the rate an edge hi which a vertex i When constructing the user s model, we used the scale- has3J[] free network to measure the frequency and recency of words. The scale-free network has the characteristics of growthand preferential attachment. We also use'fit I1=I(k)≡ =7-1+mo (0≤i<r+mo)(1) ness of vertices in the network. The fitness is the probabil ity when a new vertex is added to the network. So, when constructing a network, we look upon papers which a user The probability P(k)that a vertex in the network inter- possesses as a user's interests and specialities. We repre- acts with k other vertices decays as a power law, following sented the words contained in papers that a user possesses P(k) as the vertices in the network and the word co-occurrence The collaboration graph of movie actors represents a as the edges in network. well-documented example of a social network. Each tor is represented by a vertex, two actors being connected if 2. 1. Scale-free network they were cast together in the same movie. The probability that an actor has k links(characterizing his or her popu- larity) has a power-law tail for large k, following P(k) The scale-free network results from two generic mecha- actor, where 2.3+0.1. A more complex net nisms:(i) networks continuously expand with the addition work with over 800 million vertices is the www, where a of new vertices, and ( ii) new vertices preferentially attach to vertex is a document and the edges are the links pointing edges that are already well connected. The scale -free net from one document to another. The topology of this graph work concept is as follows determines the Web's connectivity and, consequently, our (i)iniTially,thenetworkhasnoedgesandmoverticeseffectivenessinlocatinginformationonthewww.Infor (Figure 1 A) In Figure 1, a o means an added vertex, a. mation about P(k)can be obtained using robots, indicating means the existing vertices. The network grow sequentially at the probability that k documents point to a certain Web from A to B, and from B to C in Figure I every step?. The page follows a power law, with y, 2.1±0.1[3 added vertex attaches the existing m vertices. The added Real networks have a competitive aspect, as each node vertex attaches the existing m vertices and these processes has an intrinsic ability to compete for edges at the expense are called'growth'in the scale-free network. of other vertices[10]. They propose a model in which each (ii)As the network adds new vertices(i= y+ mo), the node is assigned a fitness parameter ni which does not vertices preferentially select a vertex which already well change in time. Thus at every time step a new node j with a connected from the existing vertices(i=0,1,.,y-1+ fitness n; is added to the system, where n, is chosen from a Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce(DEEC'05) 076952401-X0520.00@2005LEEE SOCIETY

value of each rule. However, it is not easy to deal with other characteristics, as the novelty and significance of the paper to the user may change over time. To deal with the characteristics(2) and (3), we utilized the user’s paper viewing history. This allowed us to check whether or not a paper is novel. Moreover, this monitoring enabled us to also determine a user’s preference and interest and check whether or not a paper is of interest at the present moment. Additionally, we define the ‘topic model’ that has two elements: the topic frequency and the topic recency. By using the word co-occurrence in a database, we measure the topic frequency. Also, by using the Jaccard coefficient, we measure the topic recency. In the first section, we will describe the recommendation algorithm for managing research papers. Second, we will outline out the Papits research support system. Third, we will discuss the our results using our algorithm and proves its usefulness. Fourth, we will compare our work with similar researches. Finally, we will conclude with a brief summary. 2. Paper recommendation mechanism When constructing the user’s model, we used the scalefree network to measure the frequency and recency of words. The scale-free network has the characteristics of ‘growth’ and ‘preferential attachment’. We also use ‘fitness’ of vertices in the network. The fitness is the probability when a new vertex is added to the network. So, when constructing a network, we look upon papers which a user possesses as a user’s interests and specialities. We represented the words contained in papers that a user possesses as the vertices in the network, and the word co-occurrence as the edges in network. 2.1. Scale-free network The scale-free network results from two generic mechanisms: (i) networks continuously expand with the addition of new vertices, and (ii) new vertices preferentially attach to edges that are already well connected. The scale-free network concept is as follows: (i)Initially, the network has no edges and m0 vertices (Figure 1 A). In Figure 1, a ◦ means an added vertex, a • means the existing vertices. The network grow sequentially from A to B, and from B to C in Figure 1 every step γ. The added vertex attaches the existing m vertices. The added vertex attaches the existing m vertices and these processes are called ‘growth’ in the scale-free network. (ii)As the network adds new vertices(i = γ + m0), the vertices preferentially select a vertex which already well connected from the existing vertices(i = 0, 1, ..., γ − 1 + Figure 1. A conception of scale-free network m0). The probability that a vertex attaches to another vertex i is proportional to the rate an edge ki which a vertex i has[3][1]. Πi = Π(ki) ≡ ki j=τ−1+m0 j=0 kj (0 ≤ i<τ + m0) (1) The probability P(k) that a vertex in the network interacts with k other vertices decays as a power law, following P(k) ∼ k−γ. The collaboration graph of movie actors represents a well-documented example of a social network. Each actor is represented by a vertex, two actors being connected if they were cast together in the same movie. The probability that an actor has k links (characterizing his or her popularity) has a power-law tail for large k, following P(k) ∼ kγactor , where γactor = 2.3 ± 0.1. A more complex network with over 800 million vertices is the WWW, where a vertex is a document and the edges are the links pointing from one document to another. The topology of this graph determines the Web’s connectivity and, consequently, our effectiveness in locating information on the WWW. Information about P(k) can be obtained using robots, indicating that the probability that k documents point to a certain Web page follows a power law, with γwww = 2.1 ± 0.1[3]. Real networks have a competitive aspect, as each node has an intrinsic ability to compete for edges at the expense of other vertices[10]. They propose a model in which each node is assigned a fitness parameter ηi which does not change in time. Thus at every time step a new node j with a fitness ηj is added to the system, where ηj is chosen from a Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce (DEEC’05) 0-7695-2401-X/05 $20.00 © 2005 IEEE

distribution r(n). Each new vertex connects with m edges to the vertices already in the network, and the probability of Database connecting to a vertex i is proportional to the degree and the fitness of vertex i. Topic model It is well known that the more frequent a word, the relevant pape send viewed papers available it is for production and comprehension processes This phenomenon is known as the frequency(referring to User the whole individual,s experience)or recency(referring to the recent individual's experience)effect. This phenomenon shows that preferential attachment is very likely to shape the Figure 2. A creation procedure of user's scale-free distribution of degrees[7] To deal with the characteristics(2)and(3), we looked pon words in papers which a user possesses as the vertices of the scale-free network. and word co-occurrence as the edges. We also calculated the frequency and the fitness of words from the Equation 2. Checking the user's interests or specialities secularly, we considered that the user's interests or specialities are determined using words that frequentl The construction of the user's model[ 22) can be seen in appear in the paper viewing history, and words which ap- Figure2. The user model generation mechanism uses pa pears most recently. Namely, we represent the frequency of pers which a user possesses to eliminate stopwords, pre- the network as the user's longer-term interest and a fitness processing based on making stems, and construct or adjust of network as the user's shorter-term interest the user's model. The comparison and selection mechanism seen in Figure 2. The constructed user's model is compared to papers in a Papits database. By comparing the users 2. 2. Construction of users model paper viewing history to the user's model, Papits can rec- ommend papers which are of interest to the user. Figure This section outlines the construction of the user model 3 represents the user's model made from a paper[7] as the based on the user's paper viewing history. Our method uses preprocessed network. Each vertex of Figure 3 is described papers that a user possesses to construct a network based on as square with the word, and each edge is described as line word frequency and word co-occurrences. The process of between two vertices. The described square, at the core, is our method is as follows: the frequent word which means the user's core word For example, we measured the frequency and fitness Step I Papers use natural language and require modifi- of the words using a paper[7]hereinafter called"pa cation before processing. The most frequent terms, such as a, andit,, are considered to be common and per A), a paper[6](hereinafter called"paper B"), and a paper[4 (hereinafter called"paper C"). Both paper A and meaningless[14]. For the reason, we should first re- paper B describe language and networks, and paper Cde move stopwords used in SMART system [21] scribes networks ep 2 Based on the assumption that terms with a common Table I shows a list of the top ten most frequent words stem usually have similar meanings, various-ED, This list consists of the word frequency and fitness of words. ING, -ION,IONS suffixes are removed to produce the The fitness i of words in Table 1 are calculated in the order word. For example, PLAY, PLAYS, PLAYED of paper A, paper B, and paper C. The fitness l of words in PLAYING are translated into play Our method em Table I are calculated in the order of paper A, paper C, and ployed Porter's suffix stripping algorithm[ 19] paper B From the fitness value point of view in Table 1, even if Step 3 Our method continuously adds words and word a different users read the same papers at different a period co-occurrences to the network. As previously men- a different value will be calculated. Thus, the latest paper tioned words are the network vertices and word co- which a user reads alters/changes the fitness occurrences are network edges. If the words or the An interface for paper recomm on using Papits can word co-occurrences have already been added to the be seen in Figure 4. Inside the bold line in Figure 4 is the network, they are not repeated. paper recommendation with title, authors, and paper rele Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce(DEEC'05) 076952401-X0520.00@2005LEEE SOCIETY

distribution r(η). Each new vertex connects with m edges to the vertices already in the network, and the probability of connecting to a vertex i is proportional to the degree and the fitness of vertex i, Πi = ηiki j ηjkj (2) It is well known that the more frequent a word, the more available it is for production and comprehension processes. This phenomenon is known as the frequency (referring to the whole individual’s experience) or recency(referring to the recent individual’s experience) effect. This phenomenon shows that preferential attachment is very likely to shape the scale-free distribution of degrees[7]. To deal with the characteristics (2) and (3), we looked upon words in papers which a user possesses as the vertices of the scale-free network, and word co-occurrence as the edges. We also calculated the frequency and the fitness of words from the Equation 2. Checking the user’s interests or specialities secularly, we considered that the user’s interests or specialities are determined using words that frequently appear in the paper viewing history, and words which appears most recently. Namely, we represent the frequency of the network as the user’s longer-term interest and a fitness of network as the user’s shorter-term interest. 2.2. Construction of user’s model This section outlines the construction of the user model based on the user’s paper viewing history. Our method uses papers that a user possesses to construct a network based on word frequency and word co-occurrences. The process of our method is as follows: Step 1 Papers use natural language and require modifi- cation before processing. The most frequent terms, such as ‘a’ and ‘it’, are considered to be common and meaningless[14]. For the reason, we should first remove stopwords used in SMART system [21]. Step 2 Based on the assumption that terms with a common stem usually have similar meanings, various -ED, - ING, -ION, -IONS suffixes are removed to produce the stem word. For example, PLAY, PLAYS, PLAYED, PLAYING are translated into PLAY. Our method employed Porter’s suffix stripping algorithm[19]. Step 3 Our method continuously adds words and word co-occurrences to the network. As previously mentioned words are the network vertices and word cooccurrences are network edges. If the words or the word co-occurrences have already been added to the network, they are not repeated. User's model Figure 2. A creation procedure of user’s model The construction of the user’s model[22] can be seen in Figure2. The user model generation mechanism uses papers which a user possesses to eliminate stopwords, preprocessing based on making stems, and construct or adjust the user’s model. The comparison and selection mechanism seen in Figure 2. The constructed user’s model is compared to papers in a Papits database. By comparing the user’s paper viewing history to the user’s model, Papits can recommend papers which are of interest to the user. Figure 3 represents the user’s model made from a paper[7] as the preprocessed network. Each vertex of Figure 3 is described as square with the word, and each edge is described as line between two vertices. The described square, at the core, is the frequent word which means the user’s core word. For example, we measured the frequency and fitness of the words using a paper[7](hereinafter called “paper A”), a paper[6](hereinafter called “paper B”), and a paper[4](hereinafter called “paper C”). Both paper A and paper B describe language and networks, and paper C describes networks. Table 1 shows a list of the top ten most frequent words. This list consists of the word frequency and fitness of words. The fitness I of words in Table 1 are calculated in the order of paper A, paper B, and paper C. The fitness II of words in Table 1 are calculated in the order of paper A, paper C, and paper B. From the fitness value point of view in Table 1, even if a different users read the same papers at different a period, a different value will be calculated. Thus, the latest paper which a user reads alters/changes the fitness. An interface for paper recommendation using Papits can be seen in Figure 4. Inside the bold line in Figure 4 is the paper recommendation with title, authors, and paper releProceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce (DEEC’05) 0-7695-2401-X/05 $20.00 © 2005 IEEE