Turney word keyphrases. This approach has the same limitations as Munoz(1996), when considered as a keyphrase extraction algorithm: it produces a low precision list of two-word phrases In the time since this paper was submitted for publication, Frank et al.(1999) have implemented a system, Kea, which builds on our work(Turney, 1997, 1999). It treats key phrase extraction as a supervised learning problem, but it uses a Bayesian approach instead of a genetic algorithm approach. Their experiments indicate that Kea and gen Ex have statis- tically equivalent levels of performance. The same group( Gutwin et al., 1999) has evaluated Kea as a component in a new kind of search engine, Keyphind, designed specially to support browsing. Their experiments suggest that certain kinds of tasks are much easier with Keyph d than with conventional search engines. The Keyphind interface is somewhat similar to the interface of Tetranet's wisebot Several papers explore the task of producing a summary of a document by extracting key sentences from the document (Luhn, 1958; Edmundson, 1969; Marsh et al., 1984; Paice 1990; Paice and Jones, 1993; Johnson et al., 1993; Salton et al., 1994; Kupiec et al., 1995 Brandow et al, 1995; Jang and Myaeng, 1997). This task is similar to the task of keyphrase extraction but it is more difficult. The extracted sentences often lack cohesion because ana phonic references are not resolved (Johnson et al., 1993; Brandow et al., 1995). Anaphors are pronouns(e.g,it',"they"), definite noun phrases(e. g,the car"), and demonstratives(e. g this",these) that refer to previously discussed concepts. When a sentence is extracted out of the context of its neighbouring sentences, it may be impossible or very difficult for the reader of the summary to determine the referents of the anaphors. Johnson et al.(1993) attempt to automatically resolve anaphors, but their system tends to produce overly long summaries Keyphrase extraction avoids this problem because anaphors(by their nature)are not keyphrases. Also, a list of keyphrases has no structure; unlike a list of sentences, a list of keyphrases can be randomly permuted without significant consequences Most of these papers on summarization by sentence extraction describe algorithms that
Turney 8 word keyphrases. This approach has the same limitations as Muñoz (1996), when considered as a keyphrase extraction algorithm: it produces a low precision list of two-word phrases. In the time since this paper was submitted for publication, Frank et al. (1999) have implemented a system, Kea, which builds on our work (Turney, 1997, 1999). It treats keyphrase extraction as a supervised learning problem, but it uses a Bayesian approach instead of a genetic algorithm approach. Their experiments indicate that Kea and GenEx have statistically equivalent levels of performance. The same group (Gutwin et al., 1999) has evaluated Kea as a component in a new kind of search engine, Keyphind, designed specially to support browsing. Their experiments suggest that certain kinds of tasks are much easier with Keyphind than with conventional search engines. The Keyphind interface is somewhat similar to the interface of Tetranet’s Wisebot. Several papers explore the task of producing a summary of a document by extracting key sentences from the document (Luhn, 1958; Edmundson, 1969; Marsh et al., 1984; Paice, 1990; Paice and Jones, 1993; Johnson et al., 1993; Salton et al., 1994; Kupiec et al., 1995; Brandow et al., 1995; Jang and Myaeng, 1997). This task is similar to the task of keyphrase extraction, but it is more difficult. The extracted sentences often lack cohesion because anaphoric references are not resolved (Johnson et al., 1993; Brandow et al., 1995). Anaphors are pronouns (e.g., “it”, “they”), definite noun phrases (e.g., “the car”), and demonstratives (e.g., “this”, “these”) that refer to previously discussed concepts. When a sentence is extracted out of the context of its neighbouring sentences, it may be impossible or very difficult for the reader of the summary to determine the referents of the anaphors. Johnson et al. (1993) attempt to automatically resolve anaphors, but their system tends to produce overly long summaries. Keyphrase extraction avoids this problem because anaphors (by their nature) are not keyphrases. Also, a list of keyphrases has no structure; unlike a list of sentences, a list of keyphrases can be randomly permuted without significant consequences. Most of these papers on summarization by sentence extraction describe algorithms that
Learning Algorithms for Keyphrase Extraction are based on manually derived heuristics. The heuristics tend to be effective for the intended domain, but they often do not generalize well to a new domain. Extending the heuristics to a new domain involves a significant amount of manual work. a few of the papers describe learning algorithms, which can be trained by supplying documents with associated target summaries(Kupiec et al., 1995; Jang and myaeng, 1997). Learning algorithms can be extended to new domains with less work than algorithms that use manually derived heuris tics. However, there is still some manual work involved, because the training summaries must be composed of sentences that appear in the document, which means that standard author-supplied abstracts are not suitable. An advantage of keyphrase extraction is that stan dard author-supplied keyphrases are suitable for training a learning algorithm, because the majority of such keyphrases appear in the bodies of the corresponding documents Another body of related work addresses the task of information extraction. An inform tion extraction system seeks specific information in a document, according to predefined template. The template is specific to a given topic area. For example, if the topic area is news reports of terrorist attacks, the template might specify that the information extraction system hould identify (i) the terrorist organization involved in the attack, (ii) the victims of the attack, (iii) the type of attack(kidnapping, murder, etc. ) and other information of this type ARPa has sponsored a series of Message Understanding Conferences(MUC-3, 1991; MUC- 4, 1992; MUC-5, 1993; MUC-6, 1995), where information extraction systems are evaluated with corpora in various topic areas, including terrorist attacks and corporate mergers Most information extraction systems are manually built for a single topic area, which requires a large amount of expert labour. The highest performance at the Fifth Message Understanding Conference(MUC-5, 1993)was achieved at the cost of two years of intense programming effort. However, recent work has demonstrated that a learning algorithm can perform as well as a manually constructed system(Soderland and Lehnert, 1994). Soderland and Lehnert(1994) use decision tree induction as the learning component in their informa
Learning Algorithms for Keyphrase Extraction 9 are based on manually derived heuristics. The heuristics tend to be effective for the intended domain, but they often do not generalize well to a new domain. Extending the heuristics to a new domain involves a significant amount of manual work. A few of the papers describe learning algorithms, which can be trained by supplying documents with associated target summaries (Kupiec et al., 1995; Jang and Myaeng, 1997). Learning algorithms can be extended to new domains with less work than algorithms that use manually derived heuristics. However, there is still some manual work involved, because the training summaries must be composed of sentences that appear in the document, which means that standard author-supplied abstracts are not suitable. An advantage of keyphrase extraction is that standard author-supplied keyphrases are suitable for training a learning algorithm, because the majority of such keyphrases appear in the bodies of the corresponding documents. Another body of related work addresses the task of information extraction. An information extraction system seeks specific information in a document, according to predefined template. The template is specific to a given topic area. For example, if the topic area is news reports of terrorist attacks, the template might specify that the information extraction system should identify (i) the terrorist organization involved in the attack, (ii) the victims of the attack, (iii) the type of attack (kidnapping, murder, etc.), and other information of this type. ARPA has sponsored a series of Message Understanding Conferences (MUC-3, 1991; MUC- 4, 1992; MUC-5, 1993; MUC-6, 1995), where information extraction systems are evaluated with corpora in various topic areas, including terrorist attacks and corporate mergers. Most information extraction systems are manually built for a single topic area, which requires a large amount of expert labour. The highest performance at the Fifth Message Understanding Conference (MUC-5, 1993) was achieved at the cost of two years of intense programming effort. However, recent work has demonstrated that a learning algorithm can perform as well as a manually constructed system (Soderland and Lehnert, 1994). Soderland and Lehnert (1994) use decision tree induction as the learning component in their informa-
Turney tion extraction system. In Soderland and Lehnert's(1994)system, each slot in a template is handled by a group of decision trees that have been trained specially for that slot. The deci sion trees are based on syntactical features of the text, such as the presence of certain words Information extraction and key phrase extraction are at opposite ends of a continuum that ranges from detailed, specific, and domain-dependent (information extraction) to condensed general, and domain-independent(keyphrase extraction). The different ends of this contin- uum require substantially different algorithms. However, there are intermediate points on this continuum. An example is the task of identifying corporate names in business news. This task was introduced in the Sixth Message Understanding Conference(MUC-6, 1995), where it was called the Named Entity Recognition task. The best system used hand-crafted linguis- tic rules to recognize named entities(Krupka, 1995) Other related work addresses the problem of automatically creating an index(Sparck Jones, 1973; Field, 1975; Fagan, 1987, Salton, 1988; Croft et al, 1991; Ginsberg, 1993 Nakagawa, 1997; Leung and Kan, 1997). Leung and Kan(1997) provide a good survey of this work. There are two general classes of indexes: indexes that are intended for human readers to browse(often called back-of-book indexes) and indexes that are intended for use with information retrieval software(search engine indexes). Search engine indexes are not suitable for human browsing, since they usually index every occurrence of every word (excluding stop words) in the document collection. Back-of-book indexes tend to be much maller, since they index only important occurrences of interesting words and phrases. The older work on automatically creating an index (Sparck Jones, 1973; Field, 1975; Fagan 1987) is concerned with building search engine indexes, not with building back-of-book indexes(Salton, 1988; Nakagawa, 1997; Leung and Kan, 1997) Since we are interested in keyphrases for human browsing, back-of-book indexes are more relevant than search engine indexes. Leung and Kan(1997)address the problem of learning to assign index terms from a controlled vocabulary. This involves building a statisti
Turney 10 tion extraction system. In Soderland and Lehnert’s (1994) system, each slot in a template is handled by a group of decision trees that have been trained specially for that slot. The decision trees are based on syntactical features of the text, such as the presence of certain words. Information extraction and keyphrase extraction are at opposite ends of a continuum that ranges from detailed, specific, and domain-dependent (information extraction) to condensed, general, and domain-independent (keyphrase extraction). The different ends of this continuum require substantially different algorithms. However, there are intermediate points on this continuum. An example is the task of identifying corporate names in business news. This task was introduced in the Sixth Message Understanding Conference (MUC-6, 1995), where it was called the Named Entity Recognition task. The best system used hand-crafted linguistic rules to recognize named entities (Krupka, 1995). Other related work addresses the problem of automatically creating an index (Sparck Jones, 1973; Field, 1975; Fagan, 1987; Salton, 1988; Croft et al., 1991; Ginsberg, 1993; Nakagawa, 1997; Leung and Kan, 1997). Leung and Kan (1997) provide a good survey of this work. There are two general classes of indexes: indexes that are intended for human readers to browse (often called back-of-book indexes) and indexes that are intended for use with information retrieval software (search engine indexes). Search engine indexes are not suitable for human browsing, since they usually index every occurrence of every word (excluding stop words) in the document collection. Back-of-book indexes tend to be much smaller, since they index only important occurrences of interesting words and phrases. The older work on automatically creating an index (Sparck Jones, 1973; Field, 1975; Fagan, 1987) is concerned with building search engine indexes, not with building back-of-book indexes (Salton, 1988; Nakagawa, 1997; Leung and Kan, 1997). Since we are interested in keyphrases for human browsing, back-of-book indexes are more relevant than search engine indexes. Leung and Kan (1997) address the problem of learning to assign index terms from a controlled vocabulary. This involves building a statisti-
Learning Algorithms for Keyphrase Extraction cal model for each index term in the controlled vocabulary. The statistical model attempts to capture the syntactic properties that distinguish documents for which the given index term is appropriate from documents for which it is inappropriate. Their results are interesting, but the use of a controlled vocabulary makes it difficult to compare their work with the algo- ithms we examine here. It is also worth noting that a list of controlled index terms must grow every year, as the body of literature grows, so Leung and Kan's(1997)software would need to be continuously trained Nakagawa(1997)automatically extracts simple and compound nouns from technical manuals, to create back-of-book indexes. Each compound noun is scored using a formula that is based on the frequency of its component nouns in the given document. In his experi ments, Nakagawa(1997)evaluates his algorithm by comparing human-generated indexes to machine-generated indexes. However, Nakagawa's(1997) human-generated indexes were generated with the assistance of his algorithm, which tends to bias the results One feature that distinguishes a back-of-book index from a keyphrase list is length. As Nakagawa(1997)observes, a document is typically assigned 10-10 keyphrases, but a back-of-book index typically contains 10-10 index terms. Also, keyphrases are usually intended to cover the whole document, but index terms are intended to cover only a small part of a document. Another distinguishing feature is that a sophisticated back-of-book index is not simply an alphabetical list of terms. There is often a hierarchical structure, where a major index term is followed by an indented list of related minor index terms 4. The Corpora The experiments in this paper are based on five different document collections. For each document, there is a target set of keyphrases, generated by hand. Some basic statistics for the five corpora are presented in Table 2 For the Journal Articles corpus, we selected 75 journal articles from five different jour nals. Three of the journals are about cognition(Psycoloquy, The Neuroscientist, Behavioral
Learning Algorithms for Keyphrase Extraction 11 cal model for each index term in the controlled vocabulary. The statistical model attempts to capture the syntactic properties that distinguish documents for which the given index term is appropriate from documents for which it is inappropriate. Their results are interesting, but the use of a controlled vocabulary makes it difficult to compare their work with the algorithms we examine here. It is also worth noting that a list of controlled index terms must grow every year, as the body of literature grows, so Leung and Kan’s (1997) software would need to be continuously trained. Nakagawa (1997) automatically extracts simple and compound nouns from technical manuals, to create back-of-book indexes. Each compound noun is scored using a formula that is based on the frequency of its component nouns in the given document. In his experiments, Nakagawa (1997) evaluates his algorithm by comparing human-generated indexes to machine-generated indexes. However, Nakagawa’s (1997) human-generated indexes were generated with the assistance of his algorithm, which tends to bias the results. One feature that distinguishes a back-of-book index from a keyphrase list is length. As Nakagawa (1997) observes, a document is typically assigned keyphrases, but a back-of-book index typically contains index terms. Also, keyphrases are usually intended to cover the whole document, but index terms are intended to cover only a small part of a document. Another distinguishing feature is that a sophisticated back-of-book index is not simply an alphabetical list of terms. There is often a hierarchical structure, where a major index term is followed by an indented list of related minor index terms. 4. The Corpora The experiments in this paper are based on five different document collections. For each document, there is a target set of keyphrases, generated by hand. Some basic statistics for the five corpora are presented in Table 2. For the Journal Articles corpus, we selected 75 journal articles from five different journals. Three of the journals are about cognition (Psycoloquy, The Neuroscientist, Behavioral 100 101 ∼ 102 103 ∼
Turney Table 2. Some statistics for each of the five collections Average number of. t Standard de eviation Number of P Corpus Name Keyphrases per Words per Words per Keyphrases in Document Keyphrase Document Full Text Journal articles 757.5±2.81.6±0.710,781±7,80781.6% Email Messages 311 4.9±4.3 76±561 Aliweb Web pages 6.0±3.0 1.2±0.5 949±2603 69.0% NASA Web Pages 4.7±2.0 19±0.9 466±102 65.3% FIPS Web Pages 35 9.0±3.5 2.0±1.1 7025±6385 78.2% &e Brain Sciences Preprint Archive), one is about the hotel industry (Journal of the Interna- tional Academy of Hospitality Research), and one is about chemistry (Journal of Computer Aided Molecular Design). The full text of every article is available on the web. The authors supplied keyphrases for their articles. For the Email Messages corpus, we collected 311 email messages from six NRC employees. a university student created the keyphrases for the messages. For the Aliweb corpus, we collected 90 web pages using the Aliweb search engine, a public search engine provided by NEXOR. Aliweb has web-based fill-in form with a field for keyphrases, where people may enter URLs to add to Aliweb. The keyphrases are stored in the Aliweb index, along with the URLS For the NASA corpus, we collected 141 web pages from NASA's Langley Research Center. Each page includes a list of keyphrases For the FIPs corpus, we gathered 35 web pages from the US governments Federal Informa tion Processing Standards(FIPS).Each document includes a list of keyphrases We would like our learning algorithms to be able to perform well even when the testing data are significantly different from the training data. In a real-world application, it would be inconvenient if the learning algorithm required re-training for each new type of document Therefore our experiments do not use a random split of the documents into training and test- ing sets. Instead, we designed the experiments to test the ability of the learning algorithms to 6.TheUrlishttp://www.nexor.com/public/aliweb/search/doc/form.htm 7.Availableathttp://tag-www.larc.nasagov/tops/topstext.html 8.Availableathttp://www.itl.nistgov/div897/pubs/
Turney 12 & Brain Sciences Preprint Archive), one is about the hotel industry (Journal of the International Academy of Hospitality Research), and one is about chemistry (Journal of ComputerAided Molecular Design). The full text of every article is available on the web. The authors supplied keyphrases for their articles. For the Email Messages corpus, we collected 311 email messages from six NRC employees. A university student created the keyphrases for the messages. For the Aliweb corpus, we collected 90 web pages using the Aliweb search engine, a public search engine provided by NEXOR.6 Aliweb has web-based fill-in form, with a field for keyphrases, where people may enter URLs to add to Aliweb. The keyphrases are stored in the Aliweb index, along with the URLs. For the NASA corpus, we collected 141 web pages from NASA’s Langley Research Center.7 Each page includes a list of keyphrases. For the FIPS corpus, we gathered 35 web pages from the US government’s Federal Information Processing Standards (FIPS).8 Each document includes a list of keyphrases. We would like our learning algorithms to be able to perform well even when the testing data are significantly different from the training data. In a real-world application, it would be inconvenient if the learning algorithm required re-training for each new type of document. Therefore our experiments do not use a random split of the documents into training and testing sets. Instead, we designed the experiments to test the ability of the learning algorithms to Table 2: Some statistics for each of the five collections. Corpus Name Number of Documents Average Number of … ± Standard Deviation Percentage of Keyphrases in Full Text Keyphrases per Document Words per Keyphrase Words per Document Journal Articles 75 7.5 ± 2.8 1.6 ± 0.7 10,781 ± 7,807 81.6% Email Messages 311 4.9 ± 4.3 1.8 ± 1.0 376 ± 561 97.9% Aliweb Web Pages 90 6.0 ± 3.0 1.2 ± 0.5 949 ± 2603 69.0% NASA Web Pages 141 4.7 ± 2.0 1.9 ± 0.9 466 ± 102 65.3% FIPS Web Pages 35 9.0 ± 3.5 2.0 ± 1.1 7025 ± 6385 78.2% 6. The URL is http://www.nexor.com/public/aliweb/search/doc/form.html. 7. Available at http://tag-www.larc.nasa.gov/tops/tops_text.html. 8. Available at http://www.itl.nist.gov/div897/pubs/