ARC- CNRC a science atruvepourke From Discovery atwork for Canada to Innovation De la decouverte iinnovation NRC Publications Archive(NPArC) Archives des publications du CNRC(Nparc) Learning Algorithms for Key phrase Extraction Turney, Peter D Web page page Web http://nparc.cisti-icist.nrc-cnrc.gc.ca/npsilctrl?action=rtdoc&an=8913713&lang=en http:/nparc.cisti-icist.nrc-cnrc.gc.ca/npsilctrl?action=rtdoc&an=8913713&lang=fr Access and use of this website and the material on it are subject to the Terms and Conditions set forth at http:/inParc.cisti-icist.nrc-cnrc.ac.ca/npsilisp/nparccp.isp?lang=en READ THESE TERMS AND CONDITIONS CAREFULLY BEFORE USING THIS WEBSITE L'acces a ce site Web et lutilisation de son contenu sont assujettis aux conditions presentees dans le site LISEZ CES CONDITIONS ATTENTIVEMENT AVANT D'UTILISER CE SITE WEB ontact us /Contactez nous: nparccisti@nrc-cnrc gcca ◆■ ational Research Conseil nationa Canada
NRC Publications Archive (NPArC) Archives des publications du CNRC (NPArC) Learning Algorithms for Keyphrase Extraction Turney, Peter D. Contact us / Contactez nous: nparc.cisti@nrc-cnrc.gc.ca. http://nparc.cisti-icist.nrc-cnrc.gc.ca/npsi/jsp/nparc_cp.jsp?lang=fr L’accès à ce site Web et l’utilisation de son contenu sont assujettis aux conditions présentées dans le site Web page / page Web http://nparc.cisti-icist.nrc-cnrc.gc.ca/npsi/ctrl?action=rtdoc&an=8913713&lang=en http://nparc.cisti-icist.nrc-cnrc.gc.ca/npsi/ctrl?action=rtdoc&an=8913713&lang=fr LISEZ CES CONDITIONS ATTENTIVEMENT AVANT D’UTILISER CE SITE WEB. READ THESE TERMS AND CONDITIONS CAREFULLY BEFORE USING THIS WEBSITE. Access and use of this website and the material on it are subject to the Terms and Conditions set forth at http://nparc.cisti-icist.nrc-cnrc.gc.ca/npsi/jsp/nparc_cp.jsp?lang=en
National research Conseil national Council canada de recherches canada Institute for Institut de Technologie Information Technology de Iinformation RRC CRRC Learning Algorithms for Keyphrase Extraction* P. Turney July 2000 published in J. Information Retrieval, 2 (4): 303-336: 2000. NRC 44105 bright 2001 ional Research Council of canada Permission is granted to quote short excerpts and to reproduce figures and tables from this report, provided that the source of such material is fully acknowled
National Research Council Canada Institute for Information Technology Conseil national de recherches Canada Institut de Technologie de l’information Learning Algorithms for Keyphrase Extraction* P. Turney July 2000 Copyright 2001 by National Research Council of Canada Permission is granted to quote short excerpts and to reproduce figures and tables from this report, provided that the source of such material is fully acknowledged. *published in J. Information Retrieval, 2(4): 303-336; 2000. NRC 44105
Submitted to Information Retrieval- INRT 34-99 September 8, 1999 Learning algorithms for Keyphrase Extraction Peter. Turne Institute for Information Technology National Research Council of canada Ottawa. Ontario, Canada, KIA OR6 peter: turney ant.nrc.ca Phoe:613-993-8564 Fax.:6l3-952-715l Abstract Many academic journals ask their authors to provide a list of about five to fifteen keywords to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a wide variety of tasks for which keyphrases are useful, as we discuss in this paper. We approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4. 5 decision tree induction algorithm to this learning task. We evaluate the performance of nine different configurations of C4.5 The second set of experiments applies the GenEx algorithm to the task. We developed the Gen Ex algorithm specifically for automatically extracting keyphrases from text. The experi- mental results support the claim that a custom-designed algorithm( GenEx), incorporating specialized procedural domain knowledge, can generate better keyphrases than a general purpose algorithm(C4.5). Subjective human evaluation of the keyphrases generated by Extractor suggests that about 80%of the keyphrases are acceptable to human readers. This level of performance should be satisfactory for a wide variety of applications Keyphrases: machine learning, summarization, indexing, key words, keyphrase extraction c 1999 National Research Council Canada
Submitted to Information Retrieval — INRT 34-99 September 8, 1999 © 1999 National Research Council Canada Learning Algorithms for Keyphrase Extraction Peter D. Turney Institute for Information Technology National Research Council of Canada Ottawa, Ontario, Canada, K1A 0R6 peter.turney@iit.nrc.ca Phone: 613-993-8564 Fax: 613-952-7151 Abstract Many academic journals ask their authors to provide a list of about five to fifteen keywords, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a wide variety of tasks for which keyphrases are useful, as we discuss in this paper. We approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4.5 decision tree induction algorithm to this learning task. We evaluate the performance of nine different configurations of C4.5. The second set of experiments applies the GenEx algorithm to the task. We developed the GenEx algorithm specifically for automatically extracting keyphrases from text. The experimental results support the claim that a custom-designed algorithm (GenEx), incorporating specialized procedural domain knowledge, can generate better keyphrases than a generalpurpose algorithm (C4.5). Subjective human evaluation of the keyphrases generated by Extractor suggests that about 80% of the keyphrases are acceptable to human readers. This level of performance should be satisfactory for a wide variety of applications. Keyphrases: machine learning, summarization, indexing, keywords, keyphrase extraction
Turney Learning Algorithms for Keyphrase extraction 1. Introduction Many journals ask their authors to provide a list of keywords for their articles. We call these keyphrases, rather than keywords, because they are often phrases of two or more words, rather than single words. We define a keyphrase list as a short list of phrases(typically five to fifteen noun phrases)that capture the main topics discussed in a given document. This paper is concerned with the automatic extraction of keyphrases from text Keyphrases are meant to serve multiple goals. For example, (1)when they are printed on the first page of a journal article, the goal is summarization. They enable the reader to quickly determine whether the given article is in the readers fields of interest. (2)When they are printed in the cumulative index for a journal, the goal is indexing. They enable the reader to quickly find a relevant article when the reader has a specific need. (3)When a search engine form has a field labelled keywords, the goal is to enable the reader to make the search more precise. A search for documents that match a given query term in the keyword field will yield a smaller, higher quality list of hits than a search for the same term in the full text of the documents. Keyphrases can serve these diverse goals and others, because the goals share the requirement for a short list of phrases that captures the main topics of the documents We define automatic keyphrase extraction as the automatic selection of important, topi cal phrases from within the body of a document. Automatic keyphrase extraction is a special case of the more general task of automatic keyphrase generation, in which the generated phrases do not necessarily appear in the body of the given document. Section 2 discusses cri teria for measuring the performance of automatic keyphrase extraction algorithms. In the experiments in this paper, we measure the performance by comparing machine-generated keyphrases with human-generated key phrases In our document collections, an average of about 75% of the authors keyphrases appear somewhere in the body of the corresponding document. Thus, an ideal keyphrase extraction algorithm could (in principle) generate
Turney 2 Learning Algorithms for Keyphrase Extraction 1. Introduction Many journals ask their authors to provide a list of keywords for their articles. We call these keyphrases, rather than keywords, because they are often phrases of two or more words, rather than single words. We define a keyphrase list as a short list of phrases (typically five to fifteen noun phrases) that capture the main topics discussed in a given document. This paper is concerned with the automatic extraction of keyphrases from text. Keyphrases are meant to serve multiple goals. For example, (1) when they are printed on the first page of a journal article, the goal is summarization. They enable the reader to quickly determine whether the given article is in the reader’s fields of interest. (2) When they are printed in the cumulative index for a journal, the goal is indexing. They enable the reader to quickly find a relevant article when the reader has a specific need. (3) When a search engine form has a field labelled keywords, the goal is to enable the reader to make the search more precise. A search for documents that match a given query term in the keyword field will yield a smaller, higher quality list of hits than a search for the same term in the full text of the documents. Keyphrases can serve these diverse goals and others, because the goals share the requirement for a short list of phrases that captures the main topics of the documents. We define automatic keyphrase extraction as the automatic selection of important, topical phrases from within the body of a document. Automatic keyphrase extraction is a special case of the more general task of automatic keyphrase generation, in which the generated phrases do not necessarily appear in the body of the given document. Section 2 discusses criteria for measuring the performance of automatic keyphrase extraction algorithms. In the experiments in this paper, we measure the performance by comparing machine-generated keyphrases with human-generated keyphrases. In our document collections, an average of about 75% of the author’s keyphrases appear somewhere in the body of the corresponding document. Thus, an ideal keyphrase extraction algorithm could (in principle) generate