当前位置：和泉文库 > 计算机 > 浏览文档

《电子商务 E-business》阅读文献：Recommending Related Articles in Wikipedia via a Topic-Based Model

文件格式：PDF，文件大小：161.62KB，售价：3元

文档详细内容（约10页）

Some previous works have identified this problem as the missing link problem and also proposed some methods for automatically generating links to related articles. J. Voss [Vo05] presented an analysis of wikipedia snapshot on March 2005. The study showed that Wikipedia links form a scale-free network and the distribution of in-degree and out degree of wikipedia pages follows a power law. S. Fissaha Adafre and M. de rijke [FROS] presented an automated approach in finding related pages by exploring potential links in a wiki page. They proposed a method of discovering missing links in Wikipedia pages via a clustering approach. The clustering process is performed by grouping topically related pages using LTRank and then performing identification of link candidates by matching the anchor texts. Cosley et al. [Co07] presented Suggest Bot, software that performs intelligent task routing (matching people with tasks)in Wikipedia. Suggest Bot uses broadly applicable strategies of text analysis, collaborative filtering, and hyperlink following to recommend tasks In this paper, we propose a method for recommending related articles in wikipedia based on the Latent Dirichlet Allocation(LDA)algorithm. We adopt the dot product omputation for calculating the similarity between two topic distributions which represent two different articles. Using the proposed approach, we can find the relation between two articles and use this relation to recommend links for each article. The rest of paper is organized as follows. In Section 2, we describe the topic-based mode for article recommendation. Section 3 presents experiments and discussion. Finally, we conclude our work and put forward the directions of our future work in Section 4 2 The Topic-Based Model for Article Recommendation There have been many studies on discovering latent topics from text collections [SG06 Latent Semantic Analysis(LSA)uses singular value decomposition(SVD) to map high- dimensional term-by-document matrix to a lower dimensional representation called latent semantic space ] However, SVD is actually designed for normally distributed data. Such a distribution is inappropriate for count data which is what a term- by-document matrix consists of. LSA has been applied to a wide variety of learning tasks, such as search and retrieval [De90] and classification [Bi08]. Although LSa have achieved important success but Lsa have some drawbacks such as overfitting and inappropriate generative semantics [BNJ031

Some previous works have identified this problem as the missing link problem and also proposed some methods for automatically generating links to related articles. J. Voss [Vo05] presented an analysis of Wikipedia snapshot on March 2005. The study showed that Wikipedia links form a scale-free network and the distribution of in-degree and outdegree of Wikipedia pages follows a power law. S. Fissaha Adafre and M. de Rijke [FR05] presented an automated approach in finding related pages by exploring potential links in a wiki page. They proposed a method of discovering missing links in Wikipedia pages via a clustering approach.The clustering process is performed by grouping topically related pages using LTRank and then performing identification of link candidates by matching the anchor texts. Cosley et al. [Co07] presented SuggestBot, software that performs intelligent task routing (matching people with tasks) in Wikipedia. SuggestBot uses broadly applicable strategies of text analysis, collaborative filtering, and hyperlink following to recommend tasks. In this paper, we propose a method for recommending related articles in Wikipedia based on the Latent Dirichlet Allocation (LDA) algorithm. We adopt the dot product computation for calculating the similarity between two topic distributions which represent two different articles. Using the proposed approach, we can find the relation between two articles and use this relation to recommend links for each article. The rest of paper is organized as follows. In Section 2, we describe the topic-based mode for article recommendation. Section 3 presents experiments and discussion. Finally, we conclude our work and put forward the directions of our future work in Section 4. 2 The Topic-Based Model for Article Recommendation There have been many studies on discovering latent topics from text collections [SG06]. Latent Semantic Analysis (LSA) uses singular value decomposition (SVD) to map highdimensional term-by-document matrix to a lower dimensional representation called latent semantic space [De90]. However, SVD is actually designed for normallydistributed data. Such a distribution is inappropriate for count data which is what a termby-document matrix consists of. LSA has been applied to a wide variety of learning tasks, such as search and retrieval [De90] and classification [Bi08]. Although LSA have achieved important success but LSA have some drawbacks such as overfitting and inappropriate generative semantics [BNJ03]. 196

Due to the drawbacks of the LSA, the Latent Dirichlet Allocation (LDA)has been introduced as a generative probabilistic model for a set of documents [BNJ03]. The basic idea behind this approach is that documents are represented as random mixtures over latent topics. Each topic is represented by a probability distribution over the terms. Each article is represented by a probability distribution over the topics. LDA has also been applied for identification of topics in a number of different areas. For example, LDa has been used to find scientific topics from abstracts of papers published in the proceedings of the national academy of sciences [GS04]. McCallum et al. [MCO5] proposed an LDA- based approach to extract topics from social networks and applied it to a collection of 250,000 Enron emails. Newman et al.(2006)applied LDa to derive 400 topics such as Basketball, Harry Potter and Holidays from a corpus of 330,000 New York Times news articles and represent each news article as a mixture of these topics [Ne06] Haruechaiyasak and Damrongrat [HDo8] applied the LDa algorithm for recommending elated articles in Wikipedia Selection for Schools, however, without providing any comparative evaluation 6 Figure 1: The Latent Dirichlet Allocation(LDA)model Generally, an LDA model can be represented as a probabilistic graphical model as shown in Figure 2 [BNJ03]. There are three levels to the LDA representation. The variables a and P are the corpus-level parameters, which are assumed to be sampled during the process of generating a corpus. u is the parameter of the uniform Dirichlet prior on the per-document topic distributions. b is the parameter of the uniform Dirichlet prior on the per-topic word distribution. 8 is a document-level variable, sampled once per document. Finally, the variables z and w are word-level variables and are sampled once for each word in each document. The variable N is the number of word tokens in a document and variable m is the number of documents The LDA model [BNJ03] introduces a set of K latent variables, called topics. Each word in the document is assumed to be generated by one of the topics. The generative process for each document w can be described as follows

Due to the drawbacks of the LSA, the Latent Dirichlet Allocation (LDA) has been introduced as a generative probabilistic model for a set of documents [BNJ03]. The basic idea behind this approach is that documents are represented as random mixtures over latent topics. Each topic is represented by a probability distribution over the terms. Each article is represented by a probability distribution over the topics. LDA has also been applied for identification of topics in a number of different areas. For example, LDA has been used to find scientific topics from abstracts of papers published in the proceedings of the national academy of sciences [GS04]. McCallum et al. [MC05] proposed an LDAbased approach to extract topics from social networks and applied it to a collection of 250,000 Enron emails. Newman et al. (2006) applied LDA to derive 400 topics such as Basketball, Harry Potter and Holidays from a corpus of 330,000 New York Times news articles and represent each news article as a mixture of these topics [Ne06]. Haruechaiyasak and Damrongrat [HD08] applied the LDA algorithm for recommending related articles in Wikipedia Selection for Schools, however, without providing any comparative evaluation. Figure 1: The Latent Dirichlet Allocation (LDA) model Generally, an LDA model can be represented as a probabilistic graphical model as shown in Figure 2 [BNJ03]. There are three levels to the LDA representation. The variables α and β are the corpus-level parameters, which are assumed to be sampled during the process of generating a corpus. α is the parameter of the uniform Dirichlet prior on the per-document topic distributions. β is the parameter of the uniform Dirichlet prior on the per-topic word distribution. θ is a document-level variable, sampled once per document. Finally, the variables z and w are word-level variables and are sampled once for each word in each document. The variable N is the number of word tokens in a document and variable M is the number of documents. The LDA model [BNJ03] introduces a set of K latent variables, called topics. Each word in the document is assumed to be generated by one of the topics. The generative process for each document w can be described as follows: 197

点击进入文档下载页（PDF格式）

共10页，试读已结束，阅读完整版请下载

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录