Digital Libraries 98-Third ACM Conference on Digital Libraries, Edited by I. Witten, R. Akscyn, F. Shipman Ill, ACM Press, New York, pp. 89-98, 1998 opyright 1998 ACM. Shortlisted for best paper award] Cite Seer: An Automatic Citation Indexing System C. Lee Giles. Kurt D. Bollacker. Steve Lawrence NEC Research Institute, 4 Independence Way, Princeton, NJ08540 igiles,kurt,lawrence]@research.nj.nec.com ABSTRACT the advantages of traditional(manually constructed)citation We present Cite Seer: an autonomous citation indexing sys- indexes( e.g. the IsI citation indexes [10D), including: litera tem which indexes academic literature in electronic format ture retrieval by following citation links(e.g. by providing a (e.g. Postscript files on the Web). Cite Seer understands how list of papers that cite a given paper), the evaluation and rank to parse citations, identify citations to the same paper in dif- ing of papers, authors, journals, etc. based on the number ferent formats, and identify the context of citations in the of citations, and the identification of research trends. Cite- body of articles. CiteSeer provides most of the advantages of Seer has many advantages over traditional citation indexes traditional(manually constructed) citation indexes(e.g. the including a more up-to-date database which is not limited to ISI citation indexes), including: literature retrieval by fol- a preselected set of journals or restricted by journal publica lowing citation links(e.g. by providing a list of papers that tion delays, completely autonomous operation with a corre cite a given paper), the evaluation and ranking of papers, au- sponding reduction in cost, and powerful interactive brows. thors, journals, etc. based on the number of citations, and the ing of the literature using the context of citations identification of research trends. Cite Seer has many advan- ges over traditional citation indexes, including the ability to CITATION INDEXING create more up-to-date databases which are not limited to a References contained in academic articles are used to give preselected set of journals or restricted by journal publication credit to previous work in the literature and provide a link delays, completely autonomous operation with a correspond- between the"citing"and"cited"articles. A citation index g reduction in cost, and powerful interactive browsing of [6] indexes the citations that an article makes, linking the ar- the literature using the context of citations. Given a particu- ticles with the cited works. Citation indexes were originally lar paper of interest, Cite Seer can display the context of how designed mainly for information retrieval [7]. The citation the paper is cited in subsequent publications. This context links allow navigating the literature in unique ways. Papers may contain a brief summary of the paper, another author's can be located independent of language, and words in the response to the paper, or subsequent work which builds upon title, keywords or document. A citation index allows naviga- the original article. Cite Seer allows the location of papers by tion backward in time( the list of cited articles) and forward keyword search or by citation links. Papers related to a given in time(which subsequent articles cite the current article?) paper can be located using common citation information or Citation indexes can be used in many ways, e.g. a)citations word vector similarity. Cite Seer will soon be available for can help to find other publications which may be of inter- bli est, b) the context of citations in citing publications may be KEYWORDS: citation indexing, citation context, literature helpful in judging the important contributions of a cited pa search bibliometrics per and the usefulness of a paper for a given query [7, 14 c)a citation index allows finding out where and how often a INTRODUCTION particular article is cited in the literature, thus providing an A citation index [6] indexes the links between articles that indication of the importance of the article, and d)a citation researchers make when they cite other articles. Citation in- ndex can provide detailed analyses of research trends and dexes are very useful for a number of purposes, including identify emerging areas of science [8] literature search, evaluation, and analysis of the academic literature. This paper introduces CiteSeer, which is an au Existing Citation Indexe tomatic citation indexing system. CiteSeer provides most of The Institute for Scientific Information (ISD)[10] produces multidisciplinary citation indexes. One is the Science Cita- tion Index (sci), intended to be a practical, cost-effective tool for indexing the significant scientific journals. The Is databases are valuable and useful tools. A recurrent criticism against the IsI databases is that they are biased because of the management decisions of ISl, with respect to the selection of the items which are indexed [5]. Other IsI services include Keywords Plus [7 which adds citation information to
Digital Libraries 98 - Third ACM Conference on Digital Libraries, Edited by I. Witten, R. Akscyn, F. Shipman III, ACM Press, New York, pp. 89–98, 1998. Copyright c 1998 ACM. [Shortlisted for best paper award] CiteSeer: An Automatic Citation Indexing System C. Lee Giles, Kurt D. Bollacker, Steve Lawrence NEC Research Institute, 4 Independence Way, Princeton, NJ 08540 ✁✄✂✆☎✞✝✞✟✆✠☛✡✌☞✎✍✎✏✒✑✓✡✔✝✒✕✗✖✎✏✎✟✗✘✚✙✛✟✆✜✄✢✞✏✎✟✆✠✛✟✒✕✛✏✚✙✄✣✥✤✦✘✚✧★✤✦✘✆✟✆✙✩✤✪✙✛✫✭✬ ABSTRACT We present CiteSeer: an autonomous citation indexing system which indexes academic literature in electronic format (e.g. Postscript files on the Web). CiteSeer understands how to parse citations, identify citations to the same paper in different formats, and identify the context of citations in the body of articles. CiteSeer provides most of the advantages of traditional (manually constructed) citation indexes (e.g. the ISI citation indexes), including: literature retrieval by following citation links (e.g. by providing a list of papers that cite a given paper), the evaluation and ranking of papers, authors, journals, etc. based on the number of citations, and the identification of research trends. CiteSeer has many advantages over traditional citation indexes, including the ability to create more up-to-date databases which are not limited to a preselected set of journals or restricted by journal publication delays, completely autonomous operation with a corresponding reduction in cost, and powerful interactive browsing of the literature using the context of citations. Given a particular paper of interest, CiteSeer can display the context of how the paper is cited in subsequent publications. This context may contain a brief summary of the paper, another author’s response to the paper, or subsequent work which builds upon the original article. CiteSeer allows the location of papers by keyword search or by citation links. Papers related to a given paper can be located using common citation information or word vector similarity. CiteSeer will soon be available for public use. KEYWORDS: citation indexing, citation context, literature search, bibliometrics. INTRODUCTION A citation index [6] indexes the links between articles that researchers make when they cite other articles. Citation indexes are very useful for a number of purposes, including literature search, evaluation, and analysis of the academic literature. This paper introduces CiteSeer, which is an automatic citation indexing system. CiteSeer provides most of the advantages of traditional (manually constructed) citation indexes (e.g. the ISI citation indexes [10]), including: literature retrieval by following citation links (e.g. by providing a list of papers that cite a given paper), the evaluation and ranking of papers, authors, journals, etc. based on the number of citations, and the identification of research trends. CiteSeer has many advantages over traditional citation indexes, including a more up-to-date database which is not limited to a preselected set of journals or restricted by journal publication delays, completely autonomous operation with a corresponding reduction in cost, and powerful interactive browsing of the literature using the context of citations. CITATION INDEXING References contained in academic articles are used to give credit to previous work in the literature and provide a link between the “citing” and “cited” articles. A citation index [6] indexes the citations that an article makes, linking the articles with the cited works. Citation indexes were originally designed mainly for information retrieval [7]. The citation links allow navigating the literature in unique ways. Papers can be located independent of language, and words in the title, keywords or document. A citation index allows navigation backward in time (the list of cited articles) and forward in time (which subsequent articles cite the current article?) Citation indexes can be used in many ways, e.g. a) citations can help to find other publications which may be of interest, b) the context of citations in citing publications may be helpful in judging the important contributions of a cited paper and the usefulness of a paper for a given query [7, 14], c) a citation index allows finding out where and how often a particular article is cited in the literature, thus providing an indication of the importance of the article, and d) a citation index can provide detailed analyses of research trends and identify emerging areas of science [8]. Existing Citation Indexes The Institute for Scientific Information (ISI) [10] produces multidisciplinary citation indexes. One is the Science Citation Index R✮ (SCI), intended to be a practical, cost-effective tool for indexing the significant scientific journals. The ISI databases are valuable and useful tools. A recurrent criticism against the ISI databases is that they are biased because of the management decisions of ISI, with respect to the selection of the items which are indexed [5]. Other ISI services include Keywords Plus R✮ [7], which adds citation information to
the indexing of an article, Research Alert(R, which provides The advantages of Cite Seer compared to the current commer- weekly listings of citations related to a set of key references cial citation indexes include 81, and bibliographic coupling, which allows navigation by ocating papers which share one or more references [8] 1. Because Cite Seer can index articles as soon as they are available on the Web, it should be of greater use to re- A Universal Citation Database searchers for finding recent relevant literat Cameron proposed a universal bibliographic and citation keeping up to date database which would link every scholarly work ever written 2. Cite Seer is autonomous, requiring no manual effort dur- [ 3]. He describes a system in which all published research would be available to and searchable by any scholar with Internet access. The database would include citation links 3. CiteSeer can be used to make a more informed estima- and would be comprehensive and up-to-date. Perhaps the tion of the impact of a given article(citations do not al most important difference between Camerons vision of a ways imply scholarly impact [2, 4,9, 12, 171, and Cite- universal citation database and cite seer is that Citeseer does Seer helps by making the context of citations easily and not require any extra effort on the part of authors beyond quickly browsable) placement of their work on the Web. CiteSeer automatically The following sections describe the document acquisition, creates the citation database from downloaded documents document parsing, identification of identical citations, and whereas Cameron has proposed a system which requires authors or institutions to provide citation information in a database query/browsing specific format. A second relevant difference is that Cite Seer Document Acquisition exists whereas Cameron's system is only a proposal that Cite Seer can be used to create a comprehensive index of lit- presents significant difficulty for implementation. Addition- erature on the Web, or to create indexes of a user-specified ally, Cite Seer can extract the context of citations, improving topic. There are many ways in which Cite Seer can locate literature search and evaluation papers to index, e.g. a Web crawler could be used similar to the web search engines, the location information could be CITESEER extracted from the announcements of papers in Usenet mes- ite Seer downloads papers that are made available on the sage groups or mailing lists, or CiteSeer might index new is- World wide Web, converts the papers to text, parses them sues of journals as they are made available(under agreement to extract the citations and the context in which the citations with the publisher). Currently, Cite Seer uses Web search en- are made in the body of the paper, and stores the information gines(e.g. Alta Vista, Hot Bot, Excite)and heuristics to lo- in a database. Cite Seer includes algorithms for identifying cate papers(e.g. Cite Seer can search for pages which con- and grouping variant forms of citations to the same pape tain the words“ publications," papers”," postscript”,etc.) CiteSeer also performs full-text indexing of the articles and Cite Seer locates and downloads Postscript files identified by citations as well as providing support for browsing via ci-"ps","ps Z, or"ps gz "extensions. URLS and Postscript tation links. Papers related to a given paper can be located files that are duplicates of those already found are detected using common citation information or word vector similar- and skipped y. Given a particular paper of interest, Cite Seer can display cations. This context may contain a brief summary of the The downloaded Postscript files are first converted into text paper, another author's response to the paper, or subsequent work which builds upon the original article Libraryproject(http://www.nzdl.org/technology/ prescript.html). The text file is checked to verify that Compared to the current commercial citation indexes, the ci- the document is a valid research document by testing for the tation indexing performed by CiteSeer has the following dis existence of a references or bibliography section. Cite Seer ady detects and reorders Postscript files that have their pages reverse order. The following information is extracted from iteSeer does not cover the significant journals as com- the documents prehensively. We expect that this will be less of a dis- dvantage over time as more journals become available URL: The URL of the downloaded Postscript file is online(agreements with the publishers would be required tored to index most journals) Header: The title and author block of the paper is ex- 2. CiteSeer cannot distinguish subfields as accurately, e.g tracted CiteSeer will not disambiguate two authors with the same ame. We ct that cite Seer will Abstract. If it exists. the abstract text is extracted over time due to the collection of databases and improve- Introduction If it exists. the introduction section is ex nent of the algorithms used in CiteSeer
the indexing of an article, Research Alert R✮ , which provides weekly listings of citations related to a set of key references [8], and bibliographic coupling, which allows navigation by locating papers which share one or more references [8]. A Universal Citation Database Cameron proposed a universal bibliographic and citation database which would link every scholarly work ever written [3]. He describes a system in which all published research would be available to and searchable by any scholar with Internet access. The database would include citation links and would be comprehensive and up-to-date. Perhaps the most important difference between Cameron’s vision of a universal citation database and CiteSeer is that CiteSeer does not require any extra effort on the part of authors beyond placement of their work on the Web. CiteSeer automatically creates the citation database from downloaded documents whereas Cameron has proposed a system which requires authors or institutions to provide citation information in a specific format. A second relevant difference is that CiteSeer exists whereas Cameron’s system is only a proposal that presents significant difficulty for implementation. Additionally, CiteSeer can extract the context of citations, improving literature search and evaluation. CITESEER CiteSeer downloads papers that are made available on the World Wide Web, converts the papers to text, parses them to extract the citations and the context in which the citations are made in the body of the paper, and stores the information in a database. CiteSeer includes algorithms for identifying and grouping variant forms of citations to the same paper. CiteSeer also performs full-text indexing of the articles and citations as well as providing support for browsing via citation links. Papers related to a given paper can be located using common citation information or word vector similarity. Given a particular paper of interest, CiteSeer can display the context of how the paper is cited in subsequent publications. This context may contain a brief summary of the paper, another author’s response to the paper, or subsequent work which builds upon the original article. Compared to the current commercial citation indexes, the citation indexing performed by CiteSeer has the following disadvantages: 1. CiteSeer does not cover the significant journals as comprehensively. We expect that this will be less of a disadvantage over time as more journals become available online (agreements with the publishers would be required to index most journals). 2. CiteSeer cannot distinguish subfields as accurately, e.g. CiteSeer will not disambiguate two authors with the same name. We expect that CiteSeer will improve in this regard over time due to the collection of databases and improvement of the algorithms used in CiteSeer. The advantages of CiteSeer compared to the current commercial citation indexes include: 1. Because CiteSeer can index articles as soon as they are available on the Web, it should be of greater use to researchers for finding recent relevant literature, and for keeping up to date. 2. CiteSeer is autonomous, requiring no manual effort during indexing. 3. CiteSeer can be used to make a more informed estimation of the impact of a given article (citations do not always imply scholarly impact [2, 4, 9, 12, 17], and CiteSeer helps by making the context of citations easily and quickly browsable). The following sections describe the document acquisition, document parsing, identification of identical citations, and database query/browsing. Document Acquisition CiteSeer can be used to create a comprehensive index of literature on the Web, or to create indexes of a user-specified topic. There are many ways in which CiteSeer can locate papers to index, e.g. a Web crawler could be used similar to the Web search engines, the location information could be extracted from the announcements of papers in Usenet message groups or mailing lists, or CiteSeer might index new issues of journals as they are made available (under agreement with the publisher). Currently, CiteSeer uses Web search engines (e.g. AltaVista, HotBot, Excite) and heuristics to locate papers (e.g. CiteSeer can search for pages which contain the words “publications”, “papers”, “postscript”, etc.). CiteSeer locates and downloads Postscript files identified by “.ps”, “.ps.Z”, or “.ps.gz” extensions. URLs and Postscript files that are duplicates of those already found are detected and skipped. Document Parsing The downloaded Postscript files are first converted into text. We currently use PreScript from the New Zealand Digital Library project (✂✁✂✁☎✄✝✆✟✞☎✞✡✠✂✠✂✠✝☛✌☞✎✍✂✏✎✑✒☛✔✓✖✕☎✗✘✞✡✁✎✙✛✚✜☎☞✎✓✛✑✢✓✢✗✢✣✘✞ ✄✛✕✛✙✘✤✂✚✥✕✧✦★✄✛✁✩☛✪✛✁✥✫✬✑). The text file is checked to verify that the document is a valid research document by testing for the existence of a references or bibliography section. CiteSeer detects and reorders Postscript files that have their pages in reverse order. The following information is extracted from the documents: ✭ URL: The URL of the downloaded Postscript file is stored. ✭ Header: The title and author block of the paper is extracted. ✭ Abstract: If it exists, the abstract text is extracted. ✭ Introduction: If it exists, the introduction section is extracted
Citations: The list of references made by the document Database neural networks are extracted and parsed further as described below Documents parsed 5093 Citation context. The context in which the document Citations found 89614 makes the citations is extracted from the body of the Titles identified71908/89614=802% Authors identified 73539/89614=82.1% Full text: The full text of the document and citations is Table 1: Citation parsing performance for a sample indexed Cite Seer database Due to the wide variation in the formatting of document Identifying to the same Article headers, not all subfields(title, author, author affiliation icle can be made in significantly dif- addresses, etc )are reliably detected. We plan to improve from neural ple, the following citations, extracted plications are all to the same article the parsing of document headers using either additional heuristics or machine learning techniques an,JH. Friedman Once the set of references has been identified individual ci- Pacific Grove, Californ tations are extracted. Each citation is parsed using heuristics J. Friedman, R. 0lshen and c .s2 to extract the following fields: title, author, year of publi Classification and Regre ssion Trees cation, page numbers, and citation tag. The citation tag is [1] L. Breiman et al. Classification and Regression the information in the citation that is used to cite that article Trees. Wadsworth, 1984 in the body of the document(e. g.“f6j),“ [Giles97]”,"Marr 1982). The citation tags are used to find the locations in the Much of the significance of Cite Seer derives from the abil- document body where the citations are actually made, allow- ity to recognize that all of these citations refer to the same ing us to extract the context of these citations article. This ability allows The heuristics used to parse the citations were constructed 1. a detailed listing of a cited article to show all instances using an"invariants first" philosophy. That is, subfields of a of the citation across multiple articles citation which had relatively uniform syntactic indicators as 2. Statistics on citation frequency to be generated, allowing to their position and composition given all previous parsing, the estimation of the importance of articles were always parsed next. For example, the label of a cita- tion to mark it in context always exists at the beginning of a 3. More accurate identification of sub-fields for a citation citation and the format is uniform across all citations. Once e.g. errors or incomplete fields in one instance of a cita- the more regular features of a citation were identified, trends tion may be resolved from analysis of the group of cita- tions to the in syntactic relationships between subfields to be identified article. More accurate identification of sub-fields for a citation leads to more accurate estimation and those already identified were used to predict where the desired subfield existed (if at all). For example, author infor- of statistics based on fields, e.g. the citation frequency for mation almost always precedes title information, and pub- lisher almost al ways is after the title. We also use databases As suggested by the example citations above, the problem of author names, journal names, etc. which are used to help is not completely trivial. We have investigated three meth identify the subfields of the citations ods of identifying citations to identical articles and grouping them together, along with a simple baseline method for com- Parsing of natural language citations is difficult [1. How parison. For all of these method ever, we have been able to achieve reasonably good results certain aspects of citations tends to improve the results. We using heuristics to extract certain subfields. Table I shows have used the following normalizations parsing statistics for a sample CiteSeer database which was created from 5093 documents related to "neural networks 1. Conversion to lowercas We plan on using learning techniques and additional heuris- 2. Removal of hyphens that we have not attempted to exhaustively index all papers 3. Removal of citation tags, e.g. 31, [Giles 92] at the begin- found on the Web in this area. We can see that the titles and of the authors can be found in citations roughly 80% of the time 4. Expansion of common abbreviations, e.g. conf. -,con and page numbers roughly 40%of the time. The low number ference, proc.- proceedings. Common abbreviations for page numbers detected is probably due(at least partially) for the following words are currently expanded: confer- to the fact that many citations did not contain page numbers ence, proceedings, international, society, transactions
✭ Citations: The list of references made by the document are extracted and parsed further as described below. ✭ Citation context: The context in which the document makes the citations is extracted from the body of the document. ✭ Full text: The full text of the document and citations is indexed. Due to the wide variation in the formatting of document headers, not all subfields (title, author, author affiliation, addresses, etc.) are reliably detected. We plan to improve the parsing of document headers using either additional heuristics or machine learning techniques. Once the set of references has been identified, individual citations are extracted. Each citation is parsed using heuristics to extract the following fields: title, author, year of publication, page numbers, and citation tag. The citation tag is the information in the citation that is used to cite that article in the body of the document (e.g. “[6]”, “[Giles97]”, “Marr 1982”). The citation tags are used to find the locations in the document body where the citations are actually made, allowing us to extract the context of these citations. The heuristics used to parse the citations were constructed using an “invariants first” philosophy. That is, subfields of a citation which had relatively uniform syntactic indicators as to their position and composition given all previous parsing, were always parsed next. For example, the label of a citation to mark it in context always exists at the beginning of a citation and the format is uniform across all citations. Once the more regular features of a citation were identified, trends in syntactic relationships between subfields to be identified and those already identified were used to predict where the desired subfield existed (if at all). For example, author information almost always precedes title information, and publisher almost always is after the title. We also use databases of author names, journal names, etc. which are used to help identify the subfields of the citations. Parsing of natural language citations is difficult [1]. However, we have been able to achieve reasonably good results using heuristics to extract certain subfields. Table 1 shows parsing statistics for a sample CiteSeer database which was created from 5093 documents related to “neural networks”. We plan on using learning techniques and additional heuristics in order to extract additional fields of the citations. Note that we have not attempted to exhaustively index all papers found on the Web in this area. We can see that the titles and authors can be found in citations roughly 80% of the time and page numbers roughly 40% of the time. The low number for page numbers detected is probably due (at least partially) to the fact that many citations did not contain page numbers. Database neural networks Documents parsed 5093 Citations found 89614 Titles identified 71908/89614 = 80.2% Authors identified 73539/89614 = 82.1% Page numbers identified 39595/89614 = 44.2% Table 1: Citation parsing performance for a sample CiteSeer database. Identifying Citations to the Same Article Citations to a given article can be made in significantly different ways. For example, the following citations, extracted from neural network publications, are all to the same article: ✂✁☎✄✝✆✟✞✡✠☞☛✍✌✏✎✒✑✔✓✖✕✟✗✙✘✚✞✜✛✢✞✡✣✍☛✔✎✤✌☎✥✤✑✔✓✖✕✟✗✧✦✢✞✩★✟✞✙✪☞✫✍✬✤✭✮✌✖✕✟✗✯✓✖✕✍✥✱✰✚✞✂✘✚✞ ✲☎✳✮✴ ✕✮✌✚✞✯✰✍✫☎✓✮✬☞✬✮✎✶✵✔✎✖✷✖✓✳ ✎ ✴ ✕✸✓✖✕✍✥✝✦✮✌☎✹☞☛✍✌✮✬☞✬✮✎ ✴ ✕✻✺☞☛✍✌☞✌✮✬✼✞ ✽✓☎✥✾✬✤✿✴ ☛✳ ✭✟✗✙❀✍✓✮✷✮✎✶✵✔✎✖✷❂❁☞☛ ✴✖❃✌❄✗✙✰☞✓✍✫✾✎✶✵ ✴ ☛☎✕❅✎✤✓❄✗❇❆✤❈☞❉☎❊✟✞ ❋ ✞✡✆✟✞●✠☞☛✍✌✏✎✒✑✔✓✖✕✟✗❍✘✚✞●✣✍☛✔✎✤✌☎✥✤✑✔✓✖✕✟✗■✦✢✞✧✪☞✫✍✬✤✭✮✌✖✕❏✓✖✕✍✥❑✰✚✞ ✲☎✳✮✴ ✕✮✌❄✗ ✰✍✫☎✓✮✬☞✬✮✎✶✵✔✎✖✷✖✓✳ ✎ ✴ ✕▲✓✖✕✍✥▼✦✮✌☎✹☞☛✍✌✮✬☞✬✮✎ ✴ ✕❏✺☞☛✍✌☞✌✮✬◆✗ ✽✓☎✥✾✬✤✿✴ ☛✳ ✭ ✓✖✕✍✥✝✠☞☛ ✴☞✴P❖ ✬◆✗◗❆✤❈☞❉☎❊✟✞ ❘❆✶✄✝✆✟✞✡✠☞☛✍✌✏✎✒✑✔✓✖✕❏✌ ✳ ✓✍✫❄✞✧✰✍✫☎✓✮✬☞✬✮✎✶✵✔✎✖✷✖✓✳ ✎ ✴ ✕▲✓✖✕✍✥▼✦✮✌☎✹☞☛✍✌✮✬☞✬✮✎ ✴ ✕ ✺☞☛✍✌☞✌✮✬✼✞ ✽✓☎✥✾✬✤✿✴ ☛✳ ✭✟✗◗❆✤❈☞❉☎❊✟✞ Much of the significance of CiteSeer derives from the ability to recognize that all of these citations refer to the same article. This ability allows: 1. A detailed listing of a cited article to show all instances of the citation across multiple articles. 2. Statistics on citation frequency to be generated, allowing the estimation of the importance of articles. 3. More accurate identification of sub-fields for a citation, e.g. errors or incomplete fields in one instance of a citation may be resolved from analysis of the group of citations to the same article. More accurate identification of sub-fields for a citation leads to more accurate estimation of statistics based on fields, e.g. the citation frequency for authors. As suggested by the example citations above, the problem is not completely trivial. We have investigated three methods of identifying citations to identical articles and grouping them together, along with a simple baseline method for comparison. For all of these methods, we found that normalizing certain aspects of citations tends to improve the results. We have used the following normalizations: 1. Conversion to lowercase. 2. Removal of hyphens. 3. Removal of citation tags, e.g. [3], [Giles 92] at the beginning of the citation. 4. Expansion of common abbreviations, e.g. conf. ❙ conference, proc. ❙ proceedings. Common abbreviations for the following words are currently expanded: conference, proceedings, international, society, transactions, and technical report
5. Removal of extraneous words and characters which may Giles Recurrent Fuzzy Average occur in some, but usually not all, instances of a citation. Number of citations. The following words are removed: pp, pages, in press, Word Matching 9%10.0 accepted for publication, vol, volume, no, number, et. Word Phrase Matching 10% 7.7% al, isbn. The following characters are removed -&: on Table 2: Results for grouping identical citations with The methods we tested for identifying citations to identical normalization Results are the percentage of groups in articles are as follows the correct grouping for which the automated grouping 1. Baseline Simple. A simple baseline method which iterates through all citations. For each citation, we find the max imum number of words which matches with a previous of 10%, for example, does not imply that 10% of citations citation, normalized by the length of the shorter citation If this number exceeds a threshold then the new citation is are not grouped correctly, e.g. the addition of one incorrect considered to be a citation to the same article as the pre- citation to a group marks the entire group as incorrect vious citation, and the new citation is grouped with the Likelt currently performs poorly. One explanation may be previous citation. Otherwise, a new group is made for the that two citations to the same article can be of significantly different lengths. Likelt does not currently support a contain- 2. Word Matching. An algorithm similar to the Baseline ment operation ("is a contained in b? rather than"is a equal Simple algorithm which first sorts the citations by length, to b?")although there are plans to add such an operation. Al- from the longest to the shortest ternatively, string distances may not be a good paradigm for determining the semantic similarity of citations, or we may 3. Word and Phrase Matching. The Word Matching algo- not be using the Likelt distance in the best manner.Likelt rithm where sequences of two words within each subfield considers citation strings at the level of letters instead of are also considered as terms in the matching process, i.e words, and the order of letters(and thus words ). Both of this algorithm takes into account some of the ordering of these factors contribute to string differences which may not the words, which is ignored by the previous algorithms Separate thresholds are used for the single word and two correspond to semantic differences word matches Query and Browsing 4. Likelt. A method based on the Likelt intelligent string he first query to CiteSeer is a keyword search, which can be comparison algorithm introduced by Yianilos [19, 18]. used to return a list of citations matching the query, or a list Likelt is an sophisticated form of edit distance which tries of indexed articles. The literature can then be browsed by to build an optimal weighted matching of the letters and following the links between the articles made by citations multigraphs(groups of letters). Likelt provides a distance Figure I shows a sample response for the query"dempster measure between two citations. We use Likelt in the fol- in a CiteSeer database of neural network literature. A list of lowing way. The citations are ordered by length, from citations matching the query are shown, ranked by the num longest to shortest, and we iterate through the citations. ber of citations to them. Once an initial key word search is For each citation, we find the group with the smallest made, the user can browse the database using citation links Likelt distance, if this distance is below a threshold then The user can find which papers are cited by a particular publi the citation is considered identical and added to the pre- cation and which papers cite a particular publication, includ- vious group, otherwise a new group is created ing the context of those citations. Figure 2 lists the papers which cite the first article in Figure 1. alo In order to evaluate the algorithms, we created three sets of the citations(obtained by clicking on the appropriate(De of citations taken from the neural network literature( corre- tails) link in figure 1). An example of full text search in the sponding to queries for the terms Giles,"recurrent, and indexed articles is shown in Figure 3. Here the header infor- fuzzy"). These sets contained 233, 377, and 941 citations mation is given for documents which contain the keyword respectively. We manually grouped the citations such that conjugate and gradient. Details of a particular docu each group contained citations to the same paper. We ran the ment can be found by choosing the link( Details).The de above algorithms on these sets and computed an error score tails for a sample document is shown in Figure 4. The header for each case. The error score is the percentage of the groups abstract, and list of references made by this document can be the correct grouping for which the automated grouping not 100% correct. The results are shown in Table 2. The ay erage percentage of the correct groups which were not 100% FINDING RELATED DOCUMENTS correct in the automated grouping were, from best to worst: Given a database of documents, a user may find a document Word and Phrase Matching 7.7%, Word Matching 10.0%, of interest and then want to find other, related documents. He Baseline Simple 10.7%, Likelt 16.7%. Note that an error or she may do this manually by using semantic features such
5. Removal of extraneous words and characters which may occur in some, but usually not all, instances of a citation. The following words are removed: pp., pages, in press, accepted for publication, vol., volume, no., number, et. al, isbn. The following characters are removed -&:()[]. The methods we tested for identifying citations to identical articles are as follows: 1. Baseline Simple. A simple baseline method which iterates through all citations. For each citation, we find the maximum number of words which matches with a previous citation, normalized by the length of the shorter citation. If this number exceeds a threshold then the new citation is considered to be a citation to the same article as the previous citation, and the new citation is grouped with the previous citation. Otherwise, a new group is made for the citation. 2. Word Matching. An algorithm similar to the Baseline Simple algorithm which first sorts the citations by length, from the longest to the shortest. 3. Word and Phrase Matching. The Word Matching algorithm where sequences of two words within each subfield are also considered as terms in the matching process, i.e. this algorithm takes into account some of the ordering of the words, which is ignored by the previous algorithms. Separate thresholds are used for the single word and two word matches. 4. LikeIt. A method based on the LikeIt intelligent string comparison algorithm introduced by Yianilos [19, 18]. LikeIt is an sophisticated form of edit distance which tries to build an optimal weighted matching of the letters and multigraphs (groups of letters). LikeIt provides a distance measure between two citations. We use LikeIt in the following way. The citations are ordered by length, from longest to shortest, and we iterate through the citations. For each citation, we find the group with the smallest LikeIt distance, if this distance is below a threshold then the citation is considered identical and added to the previous group, otherwise a new group is created. In order to evaluate the algorithms, we created three sets of citations taken from the neural network literature (corresponding to queries for the terms “Giles”, “recurrent”, and “fuzzy”). These sets contained 233, 377, and 941 citations respectively. We manually grouped the citations such that each group contained citations to the same paper. We ran the above algorithms on these sets and computed an error score for each case. The error score is the percentage of the groups in the correct grouping for which the automated grouping is not 100% correct. The results are shown in Table 2. The average percentage of the correct groups which were not 100% correct in the automated grouping were, from best to worst: Word and Phrase Matching 7.7%, Word Matching 10.0%, Baseline Simple 10.7%, LikeIt 16.7%. Note that an error Giles Recurrent Fuzzy Average Number of citations 233 377 941 Baseline Simple 12% 9% 11% 10.7% Word Matching 13% 8% 9% 10.0% Word & Phrase Matching 10% 4% 9% 7.7% LikeIt 22% 11% 17% 16.7% Table 2: Results for grouping identical citations with normalization. Results are the percentage of groups in the correct grouping for which the automated grouping is not 100% correct. of 10%, for example, does not imply that 10% of citations are not grouped correctly, e.g. the addition of one incorrect citation to a group marks the entire group as incorrect. LikeIt currently performs poorly. One explanation may be that two citations to the same article can be of significantly different lengths. LikeIt does not currently support a containment operation (“is contained in ✁ ?” rather than “is equal to ✁ ?”) although there are plans to add such an operation. Alternatively, string distances may not be a good paradigm for determining the semantic similarity of citations, or we may not be using the LikeIt distance in the best manner. LikeIt considers citation strings at the level of letters instead of words, and the order of letters (and thus words). Both of these factors contribute to string differences which may not correspond to semantic differences. Query and Browsing The first query to CiteSeer is a keyword search, which can be used to return a list of citations matching the query, or a list of indexed articles. The literature can then be browsed by following the links between the articles made by citations. Figure 1 shows a sample response for the query “dempster” in a CiteSeer database of neural network literature. A list of citations matching the query are shown, ranked by the number of citations to them. Once an initial keyword search is made, the user can browse the database using citation links. The user can find which papers are cited by a particular publication and which papers cite a particular publication, including the context of those citations. Figure 2 lists the papers which cite the first article in Figure 1, along with the context of the citations (obtained by clicking on the appropriate (Details) link in figure 1). An example of full text search in the indexed articles is shown in Figure 3. Here the header information is given for documents which contain the keywords ✚✖✓✡☞✄✂✆☎✛✗✞✝✖✁✎✙✟✝✖☞✎✏ ✗✂✕✠✝☎✏ ✦✡✙✖☞✂✁ . Details of a particular document can be found by choosing the link ✡☞☛ ✙✖✁✞✝ ✦✖✑✎✤✍✌ . The details for a sample document is shown in Figure 4. The header, abstract, and list of references made by this document can be seen. FINDING RELATED DOCUMENTS Given a database of documents, a user may find a document of interest and then want to find other, related documents. He or she may do this manually by using semantic features such
Cite Seer Home Options Help Feedback hing dempster Citations Article Dempster, A, Laird, N, Rubin, D. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B. 39, 1-38. Details) Dempster A P(1967)Upper and lower probabilities induced by a multivalued mapping. Ann Math. Statis. 38, 325-339. Details) A Dempster, Construction and Local Computation Aspects of Nehwork Belief Functions, "in Influence Diagrams, Belief Nets and Decision Analysis, editors: R. Oliver and J Smith, John Wiley Sons Ltd., pp. 121-141, 1990. (Details) Dempster, A P. ( 1971). Model searching and estimation in the logic of inference. In Foundations of Statistical Inference(eds. V P Godambe and D A. Sprott), Toronto: Holt Rinehart and winston of Canada. (Details) A. P. Dempster, A Generalization of Bayesian Inference ( with discussion),J. Royal Statistical Society ser:. B 30: 205-247, 1968. (Details) section]..] Figure 1: Cite Seer response for the query"dempster"in a sample database of neural network literature Is author, research group, or publication venue for the doc- [15] a word weight w is calculated as ument. However. Cite Seer can also find related documents using various measure of similarity 0.5+0.5-4)(og There has been much interest in computing the distance(or E(05+05Pg) the inverse, similarity) between a pair of documents(text where ND is the total number of documents. In order to strings). Most of the known distance measures between bod find the distance between two documents, a dot product of ies of text rely on models of similarity of groups of letters in the two word vectors for those documents is calculated. One the text. One type of text distance measure is the string dis- limitation of this approach is the inherent noise-uncommon tance or edit distance which considers distance as the amount words may be shared by documents by coincidence, thereby of difference between strings of symbols. For example, the Levenshtein distance [11] is a well known early edit distance giving false evidence that the documents are related. An- other limitation of this approach is the ambiguity of words where the difference between two text strings is simply the and phrases. For example"arm"could mean a human limb, number of insertions deletions or substitutions of letters re- or a weapon. Simple word frequencies do not differentiate quired to transform one string into another. A more recent and sophisticated example is Likelt, as mentioned earlier a third type of semantic distance measure is one in which Another type of text string distance is based on statistics of knowledge about document components or structure is used words which are common to sets of documents, especially as In the case of research publications for example, the citation part of a corpus of a large number of documents. One cor information can be used for computing similarity monly used form of this measure, based on word frequencies CiteSeer uses these three methods for computing similarity is known as term frequency a inverse document frequency (TFIDF)[16]. Sometimes only the stems of words are con- Word Vectors We have implemented a TFIDF scheme to sidered instead of complete words. An often used stemming measure a value of each word stem in each document where heuristic introduced by Porter [13] tries to return the same a vector of all of the word stem values represent the"loca- stem from several forms of the same word(e. g. "walking tion"of a document in a word vector space. The projection "walk","walked"all become simply"walk). In a document of the word vector of one document on another document b, the frequency of each word stem, is f and the number ( dot product of the vectors) is the distance measure used of documents having stem, is n. Irdocument b the highest Currently we use only the top 20 components of each doc- term frequency is called f b,. in one such TFIDF scheme ument for computational reasons, however there is evidence
CiteSeer Home Options Help Feedback Citations matching dempster Citations Article 76 Dempster, A., Laird, N., Rubin, D.1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B. 39, 1--38. (Details) 3 Dempster A.P. (1967) Upper and lower probabilities induced by a multivalued mapping. Ann. Math. Statis. 38, 325-339. (Details) 1 A. Dempster, ``Construction and Local Computation Aspects of Network Belief Functions,'' in Influence Diagrams, Belief Nets and Decision Analysis, editors: R. Oliver and J. Smith, John Wiley & Sons Ltd., pp. 121-141, 1990. (Details) 1 Dempster, A.P. (1971). Model searching and estimation in the logic of inference. In Foundations of Statistical Inference (eds. V.P. Godambe and D.A. Sprott), Toronto: Holt, Rinehart and Winston of Canada. (Details) 1 A. P. Dempster, A Generalization of Bayesian Inference (with discussion), J. Royal Statistical Society ser. B 30: 205-247, 1968. (Details) [ ...section deleted... ] Figure 1: CiteSeer response for the query “dempster” in a sample database of neural network literature. as author, research group, or publication venue for the document. However, CiteSeer can also find related documents using various measure of similarity. There has been much interest in computing the distance (or the inverse, similarity) between a pair of documents (text strings). Most of the known distance measures between bodies of text rely on models of similarity of groups of letters in the text. One type of text distance measure is the string distance or edit distance which considers distance as the amount of difference between strings of symbols. For example, the Levenshtein distance [11] is a well known early edit distance where the difference between two text strings is simply the number of insertions, deletions, or substitutions of letters required to transform one string into another. A more recent and sophisticated example is LikeIt, as mentioned earlier. Another type of text string distance is based on statistics of words which are common to sets of documents, especially as part of a corpus of a large number of documents. One commonly used form of this measure, based on word frequencies, is known as term frequency inverse document frequency (TFIDF) [16]. Sometimes only the stems of words are considered instead of complete words. An often used stemming heuristic introduced by Porter [13] tries to return the same stem from several forms of the same word (e.g. “walking”, “walk”, “walked” all become simply “walk”). In a document ✁ , the frequency of each word stem ✂ is ✄✆☎✞✝ and the number of documents having stem ✂ is ✟✠✝ . In document ✁ the highest term frequency is called ✄✡☎☞☛✍✌✏✎ . In one such TFIDF scheme [15] a word weight ✑✒☎✞✝ is calculated as: ✑☎✞✝✔✓ ✕✗✖✙✘✛✚✒✜✢✖✣✘ ✚✥✤✧✦✩★ ✤✧✦☛✍✌✏✎✫✪ ✕✭✬✯✮✆✰✲✱✴✳✵★ ✪ ✶✷✹✸✻✺ ☎ ✕✧✕✗✖✙✘✛✚✒✜✢✖✣✘ ✚ ✤✦✽✼ ✤✧✦☛✍✌✏✎ ✪✩✾ ✕✿✬✽✮✡✰ ✱✠✳ ✵✼ ✪✩✾❀✪ (1) where ❁❃❂ is the total number of documents. In order to find the distance between two documents, a dot product of the two word vectors for those documents is calculated. One limitation of this approach is the inherent noise – uncommon words may be shared by documents by coincidence, thereby giving false evidence that the documents are related. Another limitation of this approach is the ambiguity of words and phrases. For example “arm” could mean a human limb, or a weapon. Simple word frequencies do not differentiate these uses, requiring context analysis for separation. A third type of semantic distance measure is one in which knowledge about document components or structure is used. In the case of research publications for example, the citation information can be used for computing similarity. CiteSeer uses these three methods for computing similarity: Word Vectors We have implemented a TFIDF scheme to measure a value of each word stem in each document where a vector of all of the word stem values represent the “location” of a document in a word vector space. The projection of the word vector of one document on another document (dot product of the vectors) is the distance measure used. Currently, we use only the top 20 components of each document for computational reasons, however there is evidence