Learning Algorithms for Keyphrase Extraction generalize to new data In our preliminary experiments, we found that the main factor influencing generalization was the average length of the documents in the training set, compared to the testing set. In a real-world application, it would be reasonable to have two different learned models, one for short documents and one for long documents. As Table 3 shows, we selected part of the Jour- al Article corpus to train the learning algorithms to handle long documents and part of the Email Message corpus to train the learning algorithms to handle short documents. During testing, we used the training corpus that was most similar to the given testing corpus, with respect to document lengths Table 3: The correspondence between testing and training data Testing Corpus Corresponding Training Corpus Num ber of Documents Nam Number of Documents Journal Articles(Testing Subset) 20 6 Journal Article(Training Subset Email Messages(Testing Subset) 76 Subset) 235 Aliweb Web Pages 90 6 Email Messages(Training Subset) 235 NASA Web Pages 141 6 Email Messages (Training Subset) 235 FIPS Web Pages e Journal Article(Training Subset) The method for applying C4.5 to keyphrase extraction(Section 5)and the Gen Ex algo rithm(Section 7) were developed using only the training subsets of the Journal Article and Email Message corpora(Table 3). The other three corpora were only acquired after the com pletion of the design of the method for applying C4. 5 and the design of the GenEx algorithm This practice ensures that there is no risk that C4. 5 and gen Ex have been tuned to the testing 5. Applying C4.5 to Keyphrase extraction In the first set of experiments, we used the C4. 5 decision tree induction algorithm(Quinlan 1993)to classify phrases as positive or negative examples of keyphrases. In this section, we describe the feature vectors, the settings we used for C45s parameters, the bagging proce-
Learning Algorithms for Keyphrase Extraction 13 generalize to new data. In our preliminary experiments, we found that the main factor influencing generalization was the average length of the documents in the training set, compared to the testing set. In a real-world application, it would be reasonable to have two different learned models, one for short documents and one for long documents. As Table 3 shows, we selected part of the Journal Article corpus to train the learning algorithms to handle long documents and part of the Email Message corpus to train the learning algorithms to handle short documents. During testing, we used the training corpus that was most similar to the given testing corpus, with respect to document lengths. The method for applying C4.5 to keyphrase extraction (Section 5) and the GenEx algorithm (Section 7) were developed using only the training subsets of the Journal Article and Email Message corpora (Table 3). The other three corpora were only acquired after the completion of the design of the method for applying C4.5 and the design of the GenEx algorithm. This practice ensures that there is no risk that C4.5 and GenEx have been tuned to the testing data. 5. Applying C4.5 to Keyphrase Extraction In the first set of experiments, we used the C4.5 decision tree induction algorithm (Quinlan, 1993) to classify phrases as positive or negative examples of keyphrases. In this section, we describe the feature vectors, the settings we used for C4.5’s parameters, the bagging proceTable 3: The correspondence between testing and training data. Testing Corpus ↔ Corresponding Training Corpus Name Number of Documents Name Number of Documents Journal Articles (Testing Subset) 20 ↔ Journal Article (Training Subset) 55 Email Messages (Testing Subset) 76 ↔ Email Messages (Training Subset) 235 Aliweb Web Pages 90 ↔ Email Messages (Training Subset) 235 NASA Web Pages 141 ↔ Email Messages (Training Subset) 235 FIPS Web Pages 35 ↔ Journal Article (Training Subset) 55
Turney dure, and the method for sampling the training data The task of supervised learning is to learn how to assign cases(or examples) to classes For keyphrase extraction, a case is a candidate phrase, which we wish to classify as a positive or negative example of a keyphrase. We classify a case by examining its features. A feature can be any property of a case that is relevant for determining the class of the case. C4.5 can handle real-valued features, integer-valued features, and features with values that range over an arbitrary, fixed set of symbols. C4. 5 takes as input a set of training data, in which cases are represented as feature vectors. In the training data, a teacher must assign a class to each feature vector(hence supervised learning). C4.5 generates as output a decision tree that mod- els the relationships among the features and the classes(Quinlan, 1993) a decision tree is a rooted tree in which the internal vertices are labelled with tests or feature values and the leaf vertices are labelled with classes. The edges that leave an internal vertex are labelled with the possible outcomes of the test associated with that vertex. For example, a feature might be, " the number of words in the given phrase, and a test on a fea ture value might be, the number of words in the given phrase is less than two, which can have the outcomes"true"or"false". A case is classified by beginning at the root of the tree and following a path to a leaf in the tree, based on the values of the features of the case. The label on the leaf is the predicted class for the given case We converted the documents into sets of feature vectors by first making a list of all phrases of one, two, or three consecutive non-stop words that appear in a given document with no intervening punctuation. We used the Iterated Lovins stemmer to find the stemmed form of each of these phrases. For each unique stemmed phrase, we generated a feature vec tor as described in table 4 C4.5 has access to nine features(features 3 to 11) when building a decision tree. The leaves of the tree predict class(feature 12). When a decision tree predicts that the class of a vector is 1, then the phrase whole phrase is a keyphrase, according to the tree. This
Turney 14 dure, and the method for sampling the training data. The task of supervised learning is to learn how to assign cases (or examples) to classes. For keyphrase extraction, a case is a candidate phrase, which we wish to classify as a positive or negative example of a keyphrase. We classify a case by examining its features. A feature can be any property of a case that is relevant for determining the class of the case. C4.5 can handle real-valued features, integer-valued features, and features with values that range over an arbitrary, fixed set of symbols. C4.5 takes as input a set of training data, in which cases are represented as feature vectors. In the training data, a teacher must assign a class to each feature vector (hence supervised learning). C4.5 generates as output a decision tree that models the relationships among the features and the classes (Quinlan, 1993). A decision tree is a rooted tree in which the internal vertices are labelled with tests on feature values and the leaf vertices are labelled with classes. The edges that leave an internal vertex are labelled with the possible outcomes of the test associated with that vertex. For example, a feature might be, “the number of words in the given phrase,” and a test on a feature value might be, “the number of words in the given phrase is less than two,” which can have the outcomes “true” or “false”. A case is classified by beginning at the root of the tree and following a path to a leaf in the tree, based on the values of the features of the case. The label on the leaf is the predicted class for the given case. We converted the documents into sets of feature vectors by first making a list of all phrases of one, two, or three consecutive non-stop words that appear in a given document, with no intervening punctuation. We used the Iterated Lovins stemmer to find the stemmed form of each of these phrases. For each unique stemmed phrase, we generated a feature vector, as described in Table 4. C4.5 has access to nine features (features 3 to 11) when building a decision tree. The leaves of the tree predict class (feature 12). When a decision tree predicts that the class of a vector is 1, then the phrase whole_phrase is a keyphrase, according to the tree. This