当前位置：和泉文库 > 计算机 > 浏览文档

《电子商务 E-business》阅读文献：Learning Agorithms for Keyphrase Extraction

文件格式：PDF，文件大小：572.85KB，售价：13.74元

文档详细内容（约49页）

Learning Algorithms for Keyphrase Extraction 13 generalize to new data. In our preliminary experiments, we found that the main factor influencing generalization was the average length of the documents in the training set, compared to the testing set. In a real-world application, it would be reasonable to have two different learned models, one for short documents and one for long documents. As Table 3 shows, we selected part of the Journal Article corpus to train the learning algorithms to handle long documents and part of the Email Message corpus to train the learning algorithms to handle short documents. During testing, we used the training corpus that was most similar to the given testing corpus, with respect to document lengths. The method for applying C4.5 to keyphrase extraction (Section 5) and the GenEx algorithm (Section 7) were developed using only the training subsets of the Journal Article and Email Message corpora (Table 3). The other three corpora were only acquired after the completion of the design of the method for applying C4.5 and the design of the GenEx algorithm. This practice ensures that there is no risk that C4.5 and GenEx have been tuned to the testing data. 5. Applying C4.5 to Keyphrase Extraction In the first set of experiments, we used the C4.5 decision tree induction algorithm (Quinlan, 1993) to classify phrases as positive or negative examples of keyphrases. In this section, we describe the feature vectors, the settings we used for C4.5’s parameters, the bagging proceTable 3: The correspondence between testing and training data. Testing Corpus ↔ Corresponding Training Corpus Name Number of Documents Name Number of Documents Journal Articles (Testing Subset) 20 ↔ Journal Article (Training Subset) 55 Email Messages (Testing Subset) 76 ↔ Email Messages (Training Subset) 235 Aliweb Web Pages 90 ↔ Email Messages (Training Subset) 235 NASA Web Pages 141 ↔ Email Messages (Training Subset) 235 FIPS Web Pages 35 ↔ Journal Article (Training Subset) 55

Turney dure, and the method for sampling the training data The task of supervised learning is to learn how to assign cases(or examples) to classes For keyphrase extraction, a case is a candidate phrase, which we wish to classify as a positive or negative example of a keyphrase. We classify a case by examining its features. A feature can be any property of a case that is relevant for determining the class of the case. C4.5 can handle real-valued features, integer-valued features, and features with values that range over an arbitrary, fixed set of symbols. C4. 5 takes as input a set of training data, in which cases are represented as feature vectors. In the training data, a teacher must assign a class to each feature vector(hence supervised learning). C4.5 generates as output a decision tree that mod- els the relationships among the features and the classes(Quinlan, 1993) a decision tree is a rooted tree in which the internal vertices are labelled with tests or feature values and the leaf vertices are labelled with classes. The edges that leave an internal vertex are labelled with the possible outcomes of the test associated with that vertex. For example, a feature might be, " the number of words in the given phrase, and a test on a fea ture value might be, the number of words in the given phrase is less than two, which can have the outcomes"true"or"false". A case is classified by beginning at the root of the tree and following a path to a leaf in the tree, based on the values of the features of the case. The label on the leaf is the predicted class for the given case We converted the documents into sets of feature vectors by first making a list of all phrases of one, two, or three consecutive non-stop words that appear in a given document with no intervening punctuation. We used the Iterated Lovins stemmer to find the stemmed form of each of these phrases. For each unique stemmed phrase, we generated a feature vec tor as described in table 4 C4.5 has access to nine features(features 3 to 11) when building a decision tree. The leaves of the tree predict class(feature 12). When a decision tree predicts that the class of a vector is 1, then the phrase whole phrase is a keyphrase, according to the tree. This

Turney 14 dure, and the method for sampling the training data. The task of supervised learning is to learn how to assign cases (or examples) to classes. For keyphrase extraction, a case is a candidate phrase, which we wish to classify as a positive or negative example of a keyphrase. We classify a case by examining its features. A feature can be any property of a case that is relevant for determining the class of the case. C4.5 can handle real-valued features, integer-valued features, and features with values that range over an arbitrary, fixed set of symbols. C4.5 takes as input a set of training data, in which cases are represented as feature vectors. In the training data, a teacher must assign a class to each feature vector (hence supervised learning). C4.5 generates as output a decision tree that models the relationships among the features and the classes (Quinlan, 1993). A decision tree is a rooted tree in which the internal vertices are labelled with tests on feature values and the leaf vertices are labelled with classes. The edges that leave an internal vertex are labelled with the possible outcomes of the test associated with that vertex. For example, a feature might be, “the number of words in the given phrase,” and a test on a feature value might be, “the number of words in the given phrase is less than two,” which can have the outcomes “true” or “false”. A case is classified by beginning at the root of the tree and following a path to a leaf in the tree, based on the values of the features of the case. The label on the leaf is the predicted class for the given case. We converted the documents into sets of feature vectors by first making a list of all phrases of one, two, or three consecutive non-stop words that appear in a given document, with no intervening punctuation. We used the Iterated Lovins stemmer to find the stemmed form of each of these phrases. For each unique stemmed phrase, we generated a feature vector, as described in Table 4. C4.5 has access to nine features (features 3 to 11) when building a decision tree. The leaves of the tree predict class (feature 12). When a decision tree predicts that the class of a vector is 1, then the phrase whole_phrase is a keyphrase, according to the tree. This

点击进入文档下载页（PDF格式）

共49页，试读已结束，阅读完整版请下载

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录