Industrial and Government Applications Track Paper Mining for Proposal Reviewers Lessons learned at the national science foundation Seth hettich Michael Pazzani Google, Inc Rutgers University 1600 Amphitheatre Parkway CoRE Building, Rm 706 Mou View. ca 994043 sh@ics. uci. edu Piscataway, NJ 08854-8018 Pa rutgers ed ABSTRACT directors who In this paper, we discuss a prototype application deployed at the ultimately make all decisions. Altho paper reports on U.S. National Science Foundation for assisting program directors reviewing proposals, we argue that the and technology in identifying reviewers for proposals. The application helps would also apply to the reviewing of papers submitted to program directors sort proposals into panels and find reviewers conferences and journals for proposals. To accomplish these tasks, it extracts information Many proposals are reviewed in panels, i.e, a group of from the full text of proposals both to learn about the topics of pically 8-15 reviewers who meet to discuss a set of 20-4 proposals and the expertise of reviewers. We discuss a of related proposals, with each panelist typically reviewing 6- alternatives that were explored, the solution plemented, and the experience in using the solution proposals. Most proposals are submitted in response to a particular solicitation(e.g,"Information Technology Research") workflow of NSF or to a specific program(e. g, " Human Computer Interaction) Individual program directors, or for larger solicitations teams of Categories and Subject Descriptors program officers, perform a number of tasks to insure that H 2.8 Database Applications Data Mining proposals are reviewed equitably. These tasks include 1. Divide the proposals into"clusters"of 20-40 related General terms proposals to create panels Algorithms. Human Factors 2. Finding reviewers. ging applications, technology, Identify potential external reviewers to invite for Keyword Keyword extraction, similarity functions, clustering, information If there is not adequate expertise on a panel to review a proposal, obtain "ad hoc" reviews from 1. INTRODUCTION people with that expertise who are not on a panel The National Science Foundation receives over 40.000 proposals a year. Each proposal is reviewed by several extemal In addition to this lengthy process, reviewers must not have a conflict of interest with proposals they are reviewing(e.g, they reviewers. It is critical to the mission of the agency and the may not be from the same department as the proposals autho integrity of the review process that every proposal is reviewed by and a diverse group of panelists (both scientifically and researchers with the expertise necessary to comment on the merit of the proposal. If there is not a good match between the topic of demographically)is desirable to insure that multiple perspectives a proposal and the expertise of the reviewers, then it is possible are represented in the review process. Furthermore, due to that a project is funded that will not advance the progress of scheduling or workload conflicts, not every invited review science or that a very promising proposal is declined. We explore ccepts the invitation, requiring an iterative process of inviting a the problem of using data mining technology to assist progra batch of reviewers and then inviting others to fill in gaps after the initial reviewers respond to the invitation directors in the review of proposals. Care is taken to match the technology to the existing workflow of the agency and to A particular consideration at NSF is that many proposals are multidiscipline lining genome data. To determine if such a proposal is meritorious, it is important to consult some experts Permission to make digital or hard copies of all or part of this work for ersonal or classroom use is granted without fee provided that copies with backgrounds in data mining(to insure that the method ot made or distributed for profit or commercial advantage and that proposed are likely to work)and in the biological sciences(te copies bear this notice and the full citation on the first page. To copy Insure that the problem addressed is an important open problem) otherwise, or republish, to post on servers redistribute to lists If all reviewers have expertise in one area, it's possible that an d/or a fee important problem would be addressed by a technique that isn't KDD06, August 20-23, 2006, Philadelphia, Pennsylvania, USA. Copyright2006ACMl-59593-3395060008.5500
Mining for Proposal Reviewers: Lessons Learned at the National Science Foundation Seth Hettich Google, Inc. 1600 Amphitheatre Parkway Mountain View, CA 9 94043 sjh@ics.uci.edu Michael J. Pazzani Rutgers University CoRE Building, Rm 706 96 Frelinghuysen Rd Piscataway, NJ 08854-8018 Pazzani @ rutgers.edu ABSTRACT In this paper, we discuss a prototype application deployed at the U.S. National Science Foundation for assisting program directors in identifying reviewers for proposals. The application helps program directors sort proposals into panels and find reviewers for proposals. To accomplish these tasks, it extracts information from the full text of proposals both to learn about the topics of proposals and the expertise of reviewers. We discuss a variety of alternatives that were explored, the solution that was implemented, and the experience in using the solution within the workflow of NSF. Categories and Subject Descriptors H.2.8 [Database Applications]: Data Mining General Terms Algorithms, Human Factors, Emerging applications, technology, and issues Keywords Keyword extraction, similarity functions, clustering, information retrieval. 1. INTRODUCTION The National Science Foundation receives over 40,000 proposals a year. Each proposal is reviewed by several external reviewers. It is critical to the mission of the agency and the integrity of the review process that every proposal is reviewed by researchers with the expertise necessary to comment on the merit of the proposal. If there is not a good match between the topic of a proposal and the expertise of the reviewers, then it is possible that a project is funded that will not advance the progress of science or that a very promising proposal is declined. We explore the problem of using data mining technology to assist program directors in the review of proposals. Care is taken to match the technology to the existing workflow of the agency and to use technology to offer suggestions to program directors who ultimately make all decisions. Although this paper reports on reviewing proposals, we argue that the lessons and technology would also apply to the reviewing of papers submitted to conferences and journals. Many proposals are reviewed in panels, i.e., a group of typically 8-15 reviewers who meet to discuss a set of 20-40 related proposals, with each panelist typically reviewing 6-10 proposals. Most proposals are submitted in response to a particular solicitation (e.g., “Information Technology Research”) or to a specific program (e.g., “Human Computer Interaction”). Individual program directors, or for larger solicitations teams of program officers, perform a number of tasks to insure that proposals are reviewed equitably. These tasks include: 1. Divide the proposals into “clusters” of 20-40 related proposals to create panels. 2. Finding reviewers: • Identify potential external reviewers to invite for each panel. • Assign panelists as reviewers of proposals. • If there is not adequate expertise on a panel to review a proposal, obtain “ad hoc” reviews from people with that expertise who are not on a panel. In addition to this lengthy process, reviewers must not have a conflict of interest with proposals they are reviewing (e.g., they may not be from the same department as the proposal’s author), and a diverse group of panelists (both scientifically and demographically) is desirable to insure that multiple perspectives are represented in the review process. Furthermore, due to scheduling or workload conflicts, not every invited reviewer accepts the invitation, requiring an iterative process of inviting a batch of reviewers and then inviting others to fill in gaps after the initial reviewers respond to the invitation. A particular consideration at NSF is that many proposals are multidisciplinary, e.g., mining genome data. To determine if such a proposal is meritorious, it is important to consult some experts with backgrounds in data mining (to insure that the methods proposed are likely to work) and in the biological sciences (to insure that the problem addressed is an important open problem). If all reviewers have expertise in one area, it’s possible that an important problem would be addressed by a technique that isn’t Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD'06, August 20–23, 2006, Philadelphia, Pennsylvania, USA. Copyright 2006 ACM 1-59593-339-5/06/0008...$5.00. 862 Industrial and Government Applications Track Paper
Industrial and Government Applications Track Paper very promising or that very promising technology would be ethnic groups. These panels have heterogeneo applied to a problem that is already solved scientific content. Other solicitations focus on advancing the frontier of science and might divide 2. Exploring Potential solutions proposals into panels by scientific subfield. Within Over the past decade, vendors have proposed various text a scientific panel proposals might lining technologies to NSf to help with the reviewing process heterogeneous broader impact such as The most common technology proposed is automated groups clustering to help organize proposals into panels. A variety of eating results of interest to ndustry alternative approaches(e.g, hierarchical [1] or k-means [2] have In general, the problem with fully automated text clustering d.While these present interesting views of proposal solutions is that they dont leave room for human input of workflow of NSF or that have gained universal acceptance by addresses issues raised. For example, the simplest k-means program officers who organize panels and assign reviewers. clustering algorithm is incremental and would allow for the late Automated clustering approaches suffer from a number of flaws additions to the existing clusters. However, the results of k- that have reduced their utility in dividing proposals into panels means are not stable so it results in different partitioning of the 1. The size of clusters. Most clustering algorithms same data on different runs. Several investigators(e.g. [4]and oduce clusters of different size. Often, D have looked at adding constraints to the clustering process so there are a few very clusters and a larger that constraints are approximately the same size. However, none number of very small clusters. In contrast, NSF of these address the lack of alignment with the organization panels are often approximately the same size due structure and workflow. In Section 3, we discuss an approach to logistical constraints ranging from the size of " cluster checking"in which algorithms related to text clustering oms to the number of proposals that can be and classification are used to suggest improvements to clusters produced by people and new proposals are added to existing 2. The stability of clusters. Dividing panels often occurs incrementally NSf has also explored and experimented with technology for solicitations have deadlines, some proposals that assigning reviewers to proposals. One approach is to create a come in before the deadline are misrouted and thel database of reviewers with keywords indicating user expertise. found a few weeks later. Occasionally, due to These databases are populated by users filling out a form wit er or natural disasters. a deadline their expertise. Experience within NSF on prototypes of reviewer extended for some regions of the country. Many databases have found mixed results. Common problems include lustering algorithms rerun on a s 1. It is difficult for a scientific comm anded data set produce drastically different upon a taxonomy of keywords need only results. Some algorithms are stochastic in nature examine the ACM Computing Classification and produce different clusters when rerun on the Schemeathttp://www.acm.orgclass/1998togain same data(e.g, see [3]). It is difficult to convince an appreciation for the difficulty. While this officers with different backgrounds and classification is adequate for a coarse sorting of that a computer system has found an pic areas ganization of a group of proposals if that too coarse to be of much use in bringing expertise organization changes drastically nto the reviewing process. For example, the most 3. Lack of alignment with organizational structure of ne-grained term representing the topic area of this NSF. The clusters produced by clustering conference is"Data Mining. If this were used as algorithms rarely correspond to the scientific and the basis for assigning reviewers, then a system organizational structure of NSF. Each panel has a that uses a keyword-based approach would believe rogram officer (or occasionally a team of 2-3 that anyone publishing in this conference would be considered equally qualified to review a pre lusters are created automatically without regard to or paper on any topic in the conference. The Data the organization and program officers' expertise Mining field has become sufficiently specialized ome clusters do not correspond to established that one can be an expert in one area(such as scientific fields and no program director wants to association rules)and not have detailed expertise be responsible for reviewing proposals that dont in other areas(such as text classification) and an fall within their general area of expertise ideal reviewer for a proposal in one area may not be qualified for another area 4. Lack of alignment with the goals of the solicitation. For example, some solicitations focus 2. It is difficult to maintain such a keyword database on broadening participation in the scientific over time. New topics arise in rapidly growing workforce, and it is useful to group proposals into fields requiring the taxonomy and database to be nels that address issues such as increasing the updated frequently. This is particularly important participation of women and others that focus on for a funding agency that has the goal of funding f underrepresented 63
very promising or that very promising technology would be applied to a problem that is already solved. 2. Exploring Potential Solutions Over the past decade, vendors have proposed various text mining technologies to NSF to help with the reviewing process. The most common technology proposed is automated text clustering to help organize proposals into panels. A variety of alternative approaches (e.g., hierarchical [1] or k-means [2]) have been explored. While these present interesting views of proposal submission data, they do not produce results that fit easily into the workflow of NSF or that have gained universal acceptance by program officers who organize panels and assign reviewers. Automated clustering approaches suffer from a number of flaws that have reduced their utility in dividing proposals into panels. 1. The size of clusters. Most clustering algorithms produce clusters of quite different size. Often, there are a few very large clusters and a larger number of very small clusters. In contrast, NSF panels are often approximately the same size due to logistical constraints ranging from the size of rooms to the number of proposals that can be discussed per day. 2. The stability of clusters. Dividing proposals into panels often occurs incrementally. Although most solicitations have deadlines, some proposals that come in before the deadline are misrouted and then found a few weeks later. Occasionally, due to severe weather or natural disasters, a deadline is extended for some regions of the country. Many clustering algorithms if rerun on a slightly expanded data set produce drastically different results. Some algorithms are stochastic in nature and produce different clusters when rerun on the same data (e.g., see [3]). It is difficult to convince program officers with different backgrounds and expertise that a computer system has found an ideal organization of a group of proposals if that organization changes drastically. 3. Lack of alignment with organizational structure of NSF. The clusters produced by clustering algorithms rarely correspond to the scientific and organizational structure of NSF. Each panel has a program officer (or occasionally a team of 2-3 program officers) with specific expertise. When clusters are created automatically without regard to the organization and program officers’ expertise, some clusters do not correspond to established scientific fields and no program director wants to be responsible for reviewing proposals that don’t fall within their general area of expertise. 4. Lack of alignment with the goals of the solicitation. For example, some solicitations focus on broadening participation in the scientific workforce, and it is useful to group proposals into panels that address issues such as increasing the participation of women and others that focus on increasing the participation of underrepresented ethnic groups. These panels have heterogeneous scientific content. Other solicitations focus on advancing the frontier of science and might divide proposals into panels by scientific subfield. Within a scientific panel proposals might have a heterogeneous broader impact such as increasing the participation of underrepresented groups or creating results of interest to undustry. In general, the problem with fully automated text clustering solutions is that they don’t leave room for human input of preferences or constraints. There has been some research that addresses issues raised. For example, the simplest k-means clustering algorithm is incremental and would allow for the late additions to the existing clusters. However, the results of kmeans are not stable so it results in different partitioning of the same data on different runs. Several investigators (e.g., [4] and [5]) have looked at adding constraints to the clustering process so that constraints are approximately the same size. However, none of these address the lack of alignment with the organization structure and workflow. In Section 3, we discuss an approach to “cluster checking” in which algorithms related to text clustering and classification are used to suggest improvements to clusters produced by people and new proposals are added to existing panels. NSF has also explored and experimented with technology for assigning reviewers to proposals. One approach is to create a database of reviewers with keywords indicating user expertise. These databases are populated by users filling out a form with their expertise. Experience within NSF on prototypes of reviewer databases have found mixed results. Common problems include: 1. It is difficult for a scientific community to agree upon a taxonomy of keywords. One need only examine the ACM Computing Classification Scheme at http://www.acm.org/class/1998 to gain an appreciation for the difficulty. While this classification is adequate for a coarse sorting of papers into topic areas, the topic areas tend to be too coarse to be of much use in bringing expertise into the reviewing process. For example, the most fine-grained term representing the topic area of this conference is “Data Mining.” If this were used as the basis for assigning reviewers, then a system that uses a keyword-based approach would believe that anyone publishing in this conference would be considered equally qualified to review a proposal or paper on any topic in the conference. The Data Mining field has become sufficiently specialized that one can be an expert in one area (such as association rules) and not have detailed expertise in other areas (such as text classification) and an ideal reviewer for a proposal in one area may not be qualified for another area. 2. It is difficult to maintain such a keyword database over time. New topics arise in rapidly growing fields requiring the taxonomy and database to be updated frequently. This is particularly important for a funding agency that has the goal of funding work at the frontier of science rather than 863 Industrial and Government Applications Track Paper
Industrial and Government Applications Track Paper concentrating on incremental work in mature 3. Revaide We have deployed a prototype system, Revaide, within NSF 3. If unrestricted text is allowed as descriptions for that addresses the problems with previous fully autonomous expertise, it is rare that potenti ystems. The philosophy behind the system is to assist program ogram directors, and proposal authors all select directors and not replace their judgment with a x system the same free text terms. Numerous studies of One key design criteria is that Revaide offers suggestions that have found low may be accepted or declined individually. In this section, we agreement among individuals assigning keywords introduce Revaide, its tasks and solution, and evaluate the utility to content(e.g, [8) of using Revaide. We introduce a measure to evaluate how well of reviewers is suited for a proposal 4. There is not high compliance with requests of users Following the discussion of the key components of Revaide in to enter information into the database. Many this section, we will report on the experiences using the researchers are too usy to fill out forms or algorithm agreeing to review proposals is a service to the funding agency, being asked to review proposals is 3.1 Representing Proposals as welcome to some as other forms of service such Proposals are submitted to NSF in PDF form. Revaide as serving on jury duty converts the proposals to ASCll and represents proposals in the standard TF-IDF vector space [10] as term vectors in the space of 5. The interface for submitting proposals to NSF, all words in the document collection. The entire proposal is used Fastlane, does not allow keywords to be entered ncluding the references and resume of the investigator. On describing the proposals. While this could be simple use of Revaide is to annotate spreadsheets of proposals added to the interface, doing so would require with the 20 terms with highest TF-IDF weights. These keywords onsensus that this will facilitate proposal handling are often more informative to program directors than the title to and this has not been demonstrated convincingly determine what a proposal is about. While early versions of whe Due to the limitations of keyword-based database systems, Revaide used stemming [ll to convert words to root forms, w found that stemming reduced the human comprehensibility of th when they are used within NSF, they are limited to suggesting a resulting term vector representation. Experience showed that and Information Science and Engineering at NSF has using stemming did not increase the quality of the suggestions experimented with a keyword system (e.g, in the 2001 ITR made by Revaide. Therefore, we no longer use stemming competition), it was not used in subsequent years One other enhancement also increased the comprehensibility Finally, NSF has experimented with systems that allow of the resulting term representation. We augmented the stoplist of panelists to indicate preferences for reviewing proposals within a Items that should not be used as keywords. While most stoplists include common words such as articles and prepositions, we reviewing a proposal on a numeric scale. Many conferences also augmented the stoplist to include words that appeared in use similar systems such as Cyberchair [9]. In Cyberchair, a proposals that were not descriptive of the proposal content, constraint satisfaction algorithm assigns people proposals they are including the e-mail addresses of Pls and the name and city of the ost interested in. These systems only address part of the university. These words frequently occur within a few proposals and not in many others giving them high TF-IDF weights, but panelists but only assigning proposals to panelists once they have they contused program directors when used as keywords and these systems as well, i.e., not every panelist promptly enters An example will illustrate the representation used by preferences data and a single person not replying can delay the Revaide for one proposal. The terms with the highest weights and assignments for all others. In addition, it isn't clear what the their weights were image: 0.031, judgments: 0.028 preference scores mean or how much thought goes into th feedback: 0.027, relevance: 0.026, multimodal assignments. While the intent is to judge how well qualified 0.020, retrieval: 0.019, and preference: 0.017 reviewer is to review a proposal, we have observed many To preserve the privacy of the submitter, we cannot provide the panelists having a strong preference for proposals by well known title or abstract, but we find that the automatically extracted researchers and fewer having a preference for proposals by less keywords do indeed provide a compact representation that makes established researchers. While NSF typically asks for preferences sense to program directors and provides a basis to assist on 20-30 proposals, some conferences ask for preference data on reviewers 200-300 papers. The second author admits that when presented with 300 papers in Cyberchair, not as much time is spent 3.2 Representing Reviewer Expertise reviewing the abstracts of the last batch of papers as the first to determine preferences. Finally, there is also a problem with Revaide represents the expertise of a reviewer with the TF multidisciplinary proposals if people from one discipline have IDF representation of the proposals they have submitted to Nsf in reference for a paper. It can occur that all computer scientists the past. While it would be possible to use published papers of and no biologists give high preference scores to a bioinformatics authors downloaded from Citeseer [12] or Google Scholar as proposal, in which case a preference-based system will result in of expertise, there are advantages in using NSF one aspect of the proposal not being reviewed proposals in a practical system deployed at NSF. First, all proposals are similar in style and length. These conditions ar
concentrating on incremental work in mature fields. 3. If unrestricted text is allowed as descriptions for expertise, it is rare that potential reviewers, program directors, and proposal authors all select the same free text terms. Numerous studies of information retrieval systems have found low agreement among individuals assigning keywords to content (e.g., [8]). 4. There is not high compliance with requests of users to enter information into the database. Many researchers are too busy to fill out forms or hesitant to “volunteer” for reviewing. While agreeing to review proposals is a service to the funding agency, being asked to review proposals is as welcome to some as other forms of service such as serving on jury duty. 5. The interface for submitting proposals to NSF, Fastlane, does not allow keywords to be entered describing the proposals. While this could be added to the interface, doing so would require consensus that this will facilitate proposal handling and this has not been demonstrated convincingly. Due to the limitations of keyword-based database systems, when they are used within NSF, they are limited to suggesting a pool of candidates for a panel on a given topic. While Computer and Information Science and Engineering at NSF has experimented with a keyword system (e.g., in the 2001 ITR competition), it was not used in subsequent years. Finally, NSF has experimented with systems that allow panelists to indicate preferences for reviewing proposals within a panel. In such systems, panelists indicate their preference for reviewing a proposal on a numeric scale. Many conferences also use similar systems such as Cyberchair [9]. In Cyberchair, a constraint satisfaction algorithm assigns people proposals they are most interested in. These systems only address part of the reviewer assignment problem. They do not assist with identifying panelists but only assigning proposals to panelists once they have been identified. There has been an issue with compliance on these systems as well, i.e., not every panelist promptly enters preferences data and a single person not replying can delay the assignments for all others. In addition, it isn’t clear what the preference scores mean or how much thought goes into the assignments. While the intent is to judge how well qualified a reviewer is to review a proposal, we have observed many panelists having a strong preference for proposals by well known researchers and fewer having a preference for proposals by less established researchers. While NSF typically asks for preferences on 20-30 proposals, some conferences ask for preference data on 200-300 papers. The second author admits that when presented with 300 papers in Cyberchair, not as much time is spent reviewing the abstracts of the last batch of papers as the first to determine preferences. Finally, there is also a problem with multidisciplinary proposals if people from one discipline have a preference for a paper. It can occur that all computer scientists and no biologists give high preference scores to a bioinformatics proposal, in which case a preference-based system will result in one aspect of the proposal not being reviewed. 3. Revaide We have deployed a prototype system, Revaide, within NSF that addresses the problems with previous fully autonomous systems. The philosophy behind the system is to assist program directors and not replace their judgment with a black box system. One key design criteria is that Revaide offers suggestions that may be accepted or declined individually. In this section, we introduce Revaide, its tasks and solution, and evaluate the utility of using Revaide. We introduce a measure to evaluate how well the expertise of a group of reviewers is suited for a proposal. Following the discussion of the key components of Revaide in this section, we will report on the experiences using the algorithm. 3.1 Representing Proposals Proposals are submitted to NSF in PDF form. Revaide converts the proposals to ASCII and represents proposals in the standard TF-IDF vector space [10] as term vectors in the space of all words in the document collection. The entire proposal is used including the references and resume of the investigator. One simple use of Revaide is to annotate spreadsheets of proposals with the 20 terms with highest TF-IDF weights. These keywords are often more informative to program directors than the title to determine what a proposal is about. While early versions of Revaide used stemming [11] to convert words to root forms, we found that stemming reduced the human comprehensibility of the resulting term vector representation. Experience showed that using stemming did not increase the quality of the suggestions made by Revaide. Therefore, we no longer use stemming. One other enhancement also increased the comprehensibility of the resulting term representation. We augmented the stoplist of items that should not be used as keywords. While most stoplists include common words such as articles and prepositions, we augmented the stoplist to include words that appeared in proposals that were not descriptive of the proposal content, including the e-mail addresses of PIs and the name and city of the university. These words frequently occur within a few proposals and not in many others giving them high TF-IDF weights, but they confused program directors when used as keywords and degraded the quality of Revaide’s suggestions. An example will illustrate the representation used by Revaide for one proposal. The terms with the highest weights and their weights were image: 0.031, judgments: 0.028, feedback: 0.027, relevance: 0.026, multimodal: 0.020, retrieval: 0.019, and preference: 0.017. To preserve the privacy of the submitter, we cannot provide the title or abstract, but we find that the automatically extracted keywords do indeed provide a compact representation that makes sense to program directors and provides a basis to assist reviewers. 3.2 Representing Reviewer Expertise Revaide represents the expertise of a reviewer with the TFIDF representation of the proposals they have submitted to NSF in the past. While it would be possible to use published papers of authors downloaded from Citeseer [12] or Google Scholar as measures of expertise, there are advantages in using NSF proposals in a practical system deployed at NSF. First, all proposals are similar in style and length. These conditions are 864 Industrial and Government Applications Track Paper
Industrial and Government Applications Track Paper deal for keyword extraction with TF-IDF. Second, the proposals The first step in cluster checking is to form a representation ave a variety of meta-data that is useful in other aspects of the of the important terms of the cluster. In Revaide, this is done by process. This meta-data includes the PIs name, e-mail address finding the centroid [10] of the proposals that are in each cluster, and other contact information, and an NSF ID for the Pls essentially creating a term vector for each cluster that is the university. This meta-data simplifies contacting the Pl and average"of the term vectors of the proposals. Next, the cosine checking for conflicts of interest between proposals and similarity [10] is found between each proposals term vector and reviewers. Third, NSF has a strong preference for using people each clusters term vector. REVAIDE produces a summary of with PH. D. degrees as reviewers, and one can't distinguish new e important terms in each cluster. These terms are chosen based graduate students from professors on published papers. By using on a weighted TF/IDF score. The example below illustrates such a people who have submitted to NSF as a reviewer In addition to the TF-IdF weight of each term problem is avoided since those eligible to apply to NSF are also prints out the number of proposals in the cluster that eligible to review. Finally, using proposals also avoids the ch it automatically creates a large pool of potential reviewers. A 24/280p senstem. 2o]ann 28/28) hobo: 0. 1447(n disadvantage of this approach is that it does include people who 22/28)imag: 0.114 (in 22/28) motion: 0.107 /A do not submit to NSF, such as researchers from industry or from 22/28) intellig: 0.104076 (in 25/28) mobil: 0.102 outside the Us. Of course, program directors may identify such (in23/28) agent:0.094(in18/28) autom:0.091 eople through usual means, such as checking the editorial board 0.077(in 23/28) sens: 0.068554 (in 26/28) of journals and program committees of conferences autonom:0.068(in25/28)se1f:0.068(in21/28) assemb1:0.064(in18/28) In practice, we restrict Revaide's pool of reviewers to those If the most similar cluster to a proposal is not the chu review process to insure that the reviewers were thought by their which a proposal has been assigned, that is a sign that a propos peers to have expertise in the area. We also leave out proposals is potentially in the wrong cluster. Such discrepancies are pointed with more than one author so that it is clear who has the e out to the program director with in a proposal. When more than one past proposal is available for a proposal to another panel. Below, the output of cluster checking given author, all of the proposals are combined by adding and is shown omitting any identifying information from the output. then re-normalizing the term vectors to form a model of the The top 20 terms of panel CIP-sc are: sensor section would also serve as the expertise representation of the 32/32) node: 0.147 (in 27/32) transtor: 0.157321 ous0.355(i 0.136(i 29/32) devic:0.132(in30/32) signa1:0.129(in 30/32) traffic:0.129(in22/32)grid:0.119 3.3 Cluster Checking 21/32) event:0.116937(in32/32) energ:0.107 The first task we consider is assisting groups of program (in 29/32) transmiss: 0. 105 (in 25/32)protoco directors to form panels. The most help is needed in large n25/32)mobi1:0.100(in26/32)rout competitions where 500-1500 proposals may be submitted at a 0.096 (in 23/32)agent: 0.092 (in 17/32) safeti time. NSF's system produces a spreadsheet that includes columns 0.091 (in 25/32) containing information such as the author' s name, institution, the Panel DsP is a better match for proposal title of the proposal and links to the abstract and the pdf of the NSFo4XXXX1 than cluster CIP-sC entire proposal. Teams of program directors manually sort these proposals first into general areas and then into panels of 20-30 In our experience, Revaide recommends a better panel for proposals. Due to the short time and large number of proposals, it approximately 5% of the proposals. We have received comments is possible that a proposal could be put into a panel with only a loose relationship to the majority of the proposals. Due to the overlooked that, "in response to Revaide's cluster checking distributed nature of the work, it is also possible that no one Often, Revaide finds a better panel that is a matter of emphasis claims responsibility for a proposal within a proposal, e.g contribution to comput As described earlier, attempts to use automated clustering opposed to making a failed at this task when program directors didn 't accept the results computer vision techniques of the clustering system. Instead of automatically clustering, Revaide checks the clusters produced by program directors for A special case of the cluster checking is when a proposal has coherence and suggests improvements. In addition, Revaide not been put into any panel. This can occur if no member of the suggests panels for"orphan proposals that are not assigned to am directors has identified that a panel. Furhermore, before program directors form panels, the proposal falls within the scope of the panel. In this case, the panel spreadsheet they use is augmented first with the terms that have that is most similar to the proposal is found, together with the next the highest TF-IDF weights of each proposal three, as determined by cosine similarity between the orphan This example shows an earlier version of Revaide that used Although the weights are not included, the terms are ordered by stemming [9], perhaps also illustrating why we turn stemming weigh off in later versions
ideal for keyword extraction with TF-IDF. Second, the proposals have a variety of meta-data that is useful in other aspects of the process. This meta-data includes the PI’s name, e-mail address and other contact information, and an NSF ID for the PI’s university. This meta-data simplifies contacting the PI and checking for conflicts of interest between proposals and reviewers. Third, NSF has a strong preference for using people with PH.D. degrees as reviewers, and one can’t distinguish new graduate students from professors on published papers. By using people who have submitted to NSF as a reviewer pool, this problem is avoided since those eligible to apply to NSF are eligible to review. Finally, using proposals also avoids the problem of disambiguating people with common names. Finally, it automatically creates a large pool of potential reviewers. A disadvantage of this approach is that it does include people who do not submit to NSF, such as researchers from industry or from outside the US. Of course, program directors may identify such people through usual means, such as checking the editorial board of journals and program committees of conferences. In practice, we restrict Revaide’s pool of reviewers to those authors of proposals that have been judged as “fundable” by the review process to insure that the reviewers were thought by their peers to have expertise in the area. We also leave out proposals with more than one author so that it is clear who has the expertise in a proposal. When more than one past proposal is available for a given author, all of the proposals are combined by adding and then re-normalizing the term vectors to form a model of the expertise. The example proposal representation in the previous section would also serve as the expertise representation of the author that submitted the proposal. 3.3 Cluster Checking The first task we consider is assisting groups of program directors to form panels. The most help is needed in large competitions where 500-1500 proposals may be submitted at a time. NSF’s system produces a spreadsheet that includes columns containing information such as the author’s name, institution, the title of the proposal and links to the abstract and the PDF of the entire proposal. Teams of program directors manually sort these proposals first into general areas and then into panels of 20-30 proposals. Due to the short time and large number of proposals, it is possible that a proposal could be put into a panel with only a loose relationship to the majority of the proposals. Due to the distributed nature of the work, it is also possible that no one claims responsibility for a proposal. As described earlier, attempts to use automated clustering failed at this task when program directors didn’t accept the results of the clustering system. Instead of automatically clustering, Revaide checks the clusters produced by program directors for coherence and suggests improvements. In addition, Revaide suggests panels for “orphan” proposals that are not assigned to a panel. Furthermore, before program directors form panels, the spreadsheet they use is augmented first with the terms that have the highest TF-IDF weights1 of each proposal. 1 Although the weights are not included, the terms are ordered by weight. The first step in cluster checking is to form a representation of the important terms of the cluster. In Revaide, this is done by finding the centroid [10] of the proposals that are in each cluster, essentially creating a term vector for each cluster that is the “average” of the term vectors of the proposals. Next, the cosine similarity [10] is found between each proposal’s term vector and each cluster’s term vector. REVAIDE produces a summary of the important terms in each cluster. These terms are chosen based on a weighted TF/IDF score. The example below illustrates such a summary. In addition to the TF-IDF weight of each term2 , Revaide also prints out the number of proposals in the cluster that contain each term. The top 20 terms of panel ROB are: robot: 0.267(in 24/28) sensor: 0.203 (in 28/28) vehicl: 0.144 (in 22/28) imag: 0.114 (in 22/28) motion: 0.107 (in 22/28) intellig: 0.104076 (in 25/28) mobil: 0.102 (in 23/28) agent: 0.094 (in 18/28) autom: 0.091 (in 25/28) movement: 0.078 (in 17/28) action: 0.077 (in 23/28) sens: 0.068554 (in 26/28) autonom: 0.068 (in 25/28) self: 0.068 (in 21/28) assembl: 0.064 (in 18/28) If the most similar cluster to a proposal is not the cluster to which a proposal has been assigned, that is a sign that a proposal is potentially in the wrong cluster. Such discrepancies are pointed out to the program director with a suggestion to move the proposal to another panel. Below, the output of cluster checking is shown omitting any identifying information from the output. The top 20 terms of panel CIP-SC are: sensor: 0.355 (in 31/32) vehicl: 0.2493 (in 22/32) wireless: 0.178 (in 29/32) monitor: 0.157 (in 32/32) node: 0.147 (in 27/32) transport: 0.136 (in 29/32) devic: 0.132 (in 30/32) signal: 0.129 (in 30/32) traffic: 0.129 (in 22/32) grid: 0.119 (in 21/32) event: 0.116937 (in 32/32) energi: 0.107 (in 29/32) transmiss: 0.105 (in 25/32) protocol: 0.103 (in 27/32) flow: 0.103 (in 26/32) layer: 0.100317 (in 25/32) mobil: 0.100 (in 26/32) rout: 0.096 (in 23/32) agent: 0.092 (in 17/32) safeti: 0.091 (in 25/32) Panel DSP is a better match for proposal NSF04XXXX1 than cluster CIP-SC. In our experience, Revaide recommends a better panel for approximately 5% of the proposals. We have received comments from program directors that include, “Thanks, I don’t know how I overlooked that,” in response to Revaide’s cluster checking. Often, Revaide finds a better panel that is a matter of emphasis within a proposal, e.g., determining that a proposal will make a contribution to computer vision for astronomical applications as opposed to making a contribution to astronomy using existing computer vision techniques. A special case of the cluster checking is when a proposal has not been put into any panel. This can occur if no member of the distributed team of program directors has identified that a proposal falls within the scope of the panel. In this case, the panel that is most similar to the proposal is found, together with the next three, as determined by cosine similarity between the orphan 2 This example shows an earlier version of Revaide that used stemming [9], perhaps also illustrating why we turn stemming off in later versions. 865 Industrial and Government Applications Track Paper
Industrial and Government Applications Track Paper proposal vector and the centroids of the panels. The output below responsibility in the initial sort. It is then assigned to a program director in the panel checking stage. In the next section, we op terms of NSF04xXXX2 are: sensor, wireless, discuss assisting in the assignment of reviewers to proposals hierarch hannel, energi alloc, poor, radio, path. Alternate panels for orphan: cluster WoN3, cluster 3.5 Assigning Reviewers WoN,C⊥ uster CIP-SC The most straightforward way to s for a proposal would simply be to select the hors of the previous This algorithm for assigning an orphan proposal to a panel is related to Rocchio's algorithm for text classification [13].The proposals that are the most similar to w prop MailCAT system [14] used the idea of displaying a few possible reviewed. This is the approach that has been used in some past folders for filing e-mail messages analogous to the way that efforts at automatic reviewer assignments (e.g,[15]). This Revaide finds a few possible panels. In both cases, the idea is to approach does a fair job but has some important drawbacks. The ope with the reality that text classification is not 100% accurate main problem occurs when a proposal has more than one topic (a fairly common while providing benefit by focusing a person on a few occurrence)and one topic dominates the match possibilities out of the many that are available with other proposals. This leads to a set of reviewers that all have e same expertise, often leaving other topics in the target document uncovered. For example, consider a document about 3.4 Proposal classification data mining using Gaussian mixture models to predict outcomes Revaide has the capability of performing text classification. in a medical context. Ideally you would want a mix of reviewer The algorithm for recommending a panel for orphan proposals is expertise for this document: general data mining, the specific one use of text classification. This section describes another use: technique being used, as well as the field it is being applied to performing an initial assignment of proposals to program Simply selecting reviewers by document similarity would tend to directors. Recall that teams of program directors sort through select reviewers who matched most closely to the primary topic of proposals to identify the major area before further subdividing the paper (as determined by the TF-IDF weighting process) nto panels. Revaide can use a text classification algorithm to possibly failing to select any reviewers at all for an important perform this initial sort. In this case, the training data is the secondary topic of the docum previous year's proposals and the class is the name of the program To solve this problem, we approach the task slightly officer who organized the review panel the previous year. Ina differently. Instead of finding the N closest matches for the target the goal of the text classification is to find the person who will assume initial ownership of this year's proposal proposal, we look for the set of N proposals that together best responsibilities in the prior year. The initial am director match the target document. We define a measure that indicates either places a proposal into a panel they will or to another program officer who is a better match In a study using cross validation of th or passes it the degree of the overlap between the terms in a proposal vector submitted to Information and Intelligent Systems, the terms represent a proposal as a normalized weighted vector of classification accurad 80.9%. This clearly is not good enough for a fully automated system. However, it provides P1…,P tremendous benefits within the existing workflow. For example, rather than having 10 people each sort through 1000 proposals to we represent a reviewer s expertise as a find proposals of interest, each person is initially assigned normalized vector: approximately 100 by the text classification algorithm. Each program director then reviews those 100 proposals and on average needs to find a better program director for 20 proposals. This has greatly reduced the amount of effort required to identify the best Where P, is the weight of term i in a pre program officer for each proposal weight of term i in a reviewers expertise e define a Revaide assists with ead the panel formation residual term vector to represent the relevant n the pi process, first by recommending an initial program officer. nd that are not in the expertise of the reviewer. The weight of each the final program officer is decided upon for each proposal, the of the residual term vectors is the difference between the weight proposals are manually subdivided into panels and the panels are the proposal and expertise vector with a minimum ofo checked for coherence. A proposal might be"orphaned"if it was initially misrouted or delayed or if no program officer claimed More generally, there is typically more than one reviewer NSF,the initial assignment may be based upon the to e define the residual term vector when there are k reviewers many program officers are rotators who spend a short fficer's predecessors proposals This overview les the process. Two program e.g., at th R max(O tersection of databases and artificial intelli
proposal vector and the centroids of the panels. The output below illustrates this process. Top terms of NSF04XXXX2 are: sensor, wireless, hierarch, node, channel, energi, signal, rout, alloc, poor, radio, path. Cluster WON2 is the best match for NSF04XXXX2 Alternate panels for orphan: Cluster WON3, Cluster WON, Cluster CIP-SC This algorithm for assigning an orphan proposal to a panel is related to Rocchio’s algorithm for text classification [13]. The MailCAT system [14] used the idea of displaying a few possible folders for filing e-mail messages analogous to the way that Revaide finds a few possible panels. In both cases, the idea is to cope with the reality that text classification is not 100% accurate while providing benefit by focusing a person on a few possibilities out of the many that are available. 3.4 Proposal Classification Revaide has the capability of performing text classification. The algorithm for recommending a panel for orphan proposals is one use of text classification. This section describes another use: performing an initial assignment of proposals to program directors. Recall that teams of program directors sort through proposals to identify the major area before further subdividing into panels. Revaide can use a text classification algorithm to perform this initial sort. In this case, the training data is the previous year’s proposals and the class is the name of the program officer who organized the review panel the previous year. That is, the goal of the text classification is to find the person who will assume initial ownership of this year’s proposals based upon their responsibilities in the prior year3 . The initial program director either places a proposal into a panel they will organize or passes it to another program officer who is a better match for the proposal. In a study using cross validation of the 2004 proposals submitted to Information and Intelligent Systems, the classification accuracy was 80.9%. This clearly is not good enough for a fully automated system. However, it provides tremendous benefits within the existing workflow. For example, rather than having 10 people each sort through 1000 proposals to find proposals of interest, each person is initially assigned approximately 100 by the text classification algorithm. Each program director then reviews those 100 proposals and on average needs to find a better program director for 20 proposals. This has greatly reduced the amount of effort required to identify the best program officer for each proposal. Revaide assists with each step of the panel formation process, first by recommending an initial program officer. Once the final program officer is decided upon for each proposal4 , the proposals are manually subdivided into panels and the panels are checked for coherence. A proposal might be “orphaned” if it was initially misrouted or delayed or if no program officer claimed 3 Because many program officers are rotators who spend a short time at NSF, the initial assignment may be based upon the program officer’s predecessor’s proposals. 4 This overview slightly simplifies the process. Two program directors may decide to hold a joint panel, e.g., at the intersection of databases and artificial intelligence. responsibility in the initial sort. It is then assigned to a program director in the panel checking stage. In the next section, we discuss assisting in the assignment of reviewers to proposals. 3.5 Assigning Reviewers The most straightforward way to choose N reviewers for a proposal would simply be to select the N authors of the previous proposals that are the most similar to the new proposal to be reviewed. This is the approach that has been used in some past efforts at automatic reviewer assignments (e.g., [15]). This approach does a fair job but has some important drawbacks. The main problem occurs when a proposal has more than one topic (a fairly common occurrence) and one topic dominates the match with other proposals. This leads to a set of reviewers that all have the same expertise, often leaving other topics in the target document uncovered. For example, consider a document about data mining using Gaussian mixture models to predict outcomes in a medical context. Ideally you would want a mix of reviewer expertise for this document: general data mining, the specific technique being used, as well as the field it is being applied to. Simply selecting reviewers by document similarity would tend to select reviewers who matched most closely to the primary topic of the paper (as determined by the TF-IDF weighting process) possibly failing to select any reviewers at all for an important secondary topic of the document. To solve this problem, we approach the task slightly differently. Instead of finding the N closest matches for the target proposal, we look for the set of N proposals that together best match the target document. We define a measure that indicates the degree of the overlap between the terms in a proposal vector and a set of expertise vectors. We represent a proposal as a normalized weighted vector of terms: n P p ,..., p 1 = r . Similarly, we represent a reviewer’s expertise as a normalized vector: n E e ,..., e 1 = r . Where pi is the weight of term i in a proposal and ri is the weight of term i in a reviewer’s expertise vector. We define a residual term vector to represent the relevant terms in the proposal that are not in the expertise of the reviewer. The weight of each of the residual term vectors is the difference between the weight in the proposal and expertise vector with a minimum of 0. max(0, ),..., max(0, ) 1 1 n e n R = p − e p − r . More generally, there is typically more than one reviewer and we define the residual term vector when there are k reviewers to be ) , ),..., max(0, , max(0, 1 1 = − ∑ − ∑ k i n i e n p k i i R p ε e ε r 866 Industrial and Government Applications Track Paper