Topic Extraction from Scientific Literature for Competency Management Paul Buitelaar, Thomas Eigner DFKI Gmbh Language Technology Lab& Competence Center Semantic Web Stuhlsatzenhausweg 3 66123 Saarbrucken, Germany Abstract We describe an approach towards automatic, dynamic and time- critical support for competency management and expertise search through topic extraction from scientific publications. In the use case we present, we focus on the automatic extraction of scientific topics and technologies from publicly available publications using web sites like Google Scholar. We discuss an ex- periment for our own organization, DFKI, as example of a knowledge organiza- tion. The paper presents evaluation results over a sample of 48 DFKI research ers that responded to our request for a-posteriori evaluation of automatically ex- racted topics. The results of this evaluation are encouraging and provided us with useful feedback for further improving our methods. The extracted topics can be organized in an association network that can be used further to analyze how competencies are interconnected, thereby enabling also a better exchange of expertise and competence between researche 1 Introduction Competency management, the identification and management of experts on and their knowledge in certain competency areas, is a growing area of research as knowl- edge has become a central factor in achieving commercial success. It is of fundamen tal importance for any organization to keep up-to-date with the competencies it covers, in the form of experts among its work force. Identification of experts will be based mostly on recruitment information, but this is not sufficient as competency coverage (competencies of interest to the organization) and structure(interconnections between competencies) change rapidly over time. The automatic identification of competency coverage and structure, e.g. from publications, is therefore of increasing importance, as this allows for a sustainable, dynamic and time-critical approach to competency management o In this paper we present a pattern-based approach to the extraction of competencies a knowledge-based research organization(scientific topics, technologies) from publicly available scientific publications. The core assumption of our approach is that such topics will not occur in random fashion across documents, but instead occur only
Topic Extraction from Scientific Literature for Competency Management Paul Buitelaar, Thomas Eigner DFKI GmbH Language Technology Lab & Competence Center Semantic Web Stuhlsatzenhausweg 3 66123 Saarbrücken, Germany paulb@dfki.de Abstract We describe an approach towards automatic, dynamic and timecritical support for competency management and expertise search through topic extraction from scientific publications. In the use case we present, we focus on the automatic extraction of scientific topics and technologies from publicly available publications using web sites like Google Scholar. We discuss an experiment for our own organization, DFKI, as example of a knowledge organization. The paper presents evaluation results over a sample of 48 DFKI researchers that responded to our request for a-posteriori evaluation of automatically extracted topics. The results of this evaluation are encouraging and provided us with useful feedback for further improving our methods. The extracted topics can be organized in an association network that can be used further to analyze how competencies are interconnected, thereby enabling also a better exchange of expertise and competence between researchers. 1 Introduction Competency management, the identification and management of experts on and their knowledge in certain competency areas, is a growing area of research as knowledge has become a central factor in achieving commercial success. It is of fundamental importance for any organization to keep up-to-date with the competencies it covers, in the form of experts among its work force. Identification of experts will be based mostly on recruitment information, but this is not sufficient as competency coverage (competencies of interest to the organization) and structure (interconnections between competencies) change rapidly over time. The automatic identification of competency coverage and structure, e.g. from publications, is therefore of increasing importance, as this allows for a sustainable, dynamic and time-critical approach to competency management. In this paper we present a pattern-based approach to the extraction of competencies in a knowledge-based research organization (scientific topics, technologies) from publicly available scientific publications. The core assumption of our approach is that such topics will not occur in random fashion across documents, but instead occur only
in specific scientific discourse contexts that can be precisely defined and used as pat terns for topic extraction The remainder of the paper is structured as follows In section 2 we describe related work in competency management and argue for an approach based on natural lan- guage processing and ontology modeling. We describe our specific approach to topic extraction for competency management in detail in section 3. The paper then contin ues with the description of an experiment that we performed on topic extraction for competency management in our own organization, DFKI. Finally, we conclude the paper with some conclusions that can be drawn from our research and ideas for future work that arise from these 2 Related work Competency management is a growing area of knowledge management that is con cerned with the"identification of skills, knowledge, behaviors, and capabilities needed to meet current and future personnel selection needs, in alignment with the differentia tions in strategies and organizational priorities. [1] Our particular focus here is on aspects of competency management relating to the identification and management of nowledge about scientific topics and technologies, which is at the basis of compe- tency management. Most of the work on competency management has been focused on the develop ment of methods for the identification, modeling, and analysis of skills and skills gaps and on training solutions to help remedy the latter. An important initial step in this process is the identification of skills and knowledge of interest, which is mostly done through interviews, surveys and manual analysis of existing competency models. Re- cently, ontology-based approaches have been proposed that aim at modeling the do main model of particular organization types(e.g. computer science, health-care) through formal ontologies, over which matchmaking services can be defined for bring ing together skills and organization requirements(e.g. [213]) The development of formal ontologies for competency management is important but there is an obvious need for automated methods in the construction and dynamic maintenance of such ontologies. Although some work has been done on developing automated methods for competency management through text and web mining(e.g. [4) this is mostly restricted to the extraction of associative networks between people according to documents or other data they are associated with. Instead, for the purpose of automated and dynamic support of competency management a richer analysis of competencies and semantic relations between them is needed, as can be extracted from text through natural language processing 3 Approach Our approach towards the automatic construction and dynamic maintenance of on tologies for competency management is based on the extraction of relevant competen
in specific scientific discourse contexts that can be precisely defined and used as patterns for topic extraction. The remainder of the paper is structured as follows. In section 2 we describe related work in competency management and argue for an approach based on natural language processing and ontology modeling. We describe our specific approach to topic extraction for competency management in detail in section 3. The paper then continues with the description of an experiment that we performed on topic extraction for competency management in our own organization, DFKI. Finally, we conclude the paper with some conclusions that can be drawn from our research and ideas for future work that arise from these. 2 Related Work Competency management is a growing area of knowledge management that is concerned with the “identification of skills, knowledge, behaviors, and capabilities needed to meet current and future personnel selection needs, in alignment with the differentiations in strategies and organizational priorities.” [1] Our particular focus here is on aspects of competency management relating to the identification and management of knowledge about scientific topics and technologies, which is at the basis of competency management. Most of the work on competency management has been focused on the development of methods for the identification, modeling, and analysis of skills and skills gaps and on training solutions to help remedy the latter. An important initial step in this process is the identification of skills and knowledge of interest, which is mostly done through interviews, surveys and manual analysis of existing competency models. Recently, ontology-based approaches have been proposed that aim at modeling the domain model of particular organization types (e.g. computer science, health-care) through formal ontologies, over which matchmaking services can be defined for bringing together skills and organization requirements (e.g. [2], [3]). The development of formal ontologies for competency management is important, but there is an obvious need for automated methods in the construction and dynamic maintenance of such ontologies. Although some work has been done on developing automated methods for competency management through text and web mining (e.g. [4]) this is mostly restricted to the extraction of associative networks between people according to documents or other data they are associated with. Instead, for the purpose of automated and dynamic support of competency management a richer analysis of competencies and semantic relations between them is needed, as can be extracted from text through natural language processing. 3 Approach Our approach towards the automatic construction and dynamic maintenance of ontologies for competency management is based on the extraction of relevant competen-
cies and semantic relations between them through a combination of linguistic patterns, statistical methods as used in information retrieval and machine learning and back- ound knowledge if available Central to the approach as discussed in this paper is the use of domain-specific lin- guistic patterns for the extraction of potentially relevant competencies, such as scien- tific topics and technologies, from publicly available scientific publications. In this text type, topics and technologies will occur in the context of cue phrases such devel- ped a tool for XYor'worked on methods for YZ, where XY, YZ are possibly rele vant competencies that the authors of the scientific publication is or has been working on. Consider for instance the following excerpts from three scientific articles in chem profile refinement method for nuclear and magnetic structures continuum method for modeling surface tension a screening method for the crystallisation of macromolecules In all three cases a method is discussed for addressing a particular problem that can be interpreted as a competency topic: nuclear and magnetic structures,modeling surface tension, 'crystallization of macromolecules. The pattern that we can thus establish from these examples is as follows method for /TOPIC method for/nuclear and magnetic structures) method for/modeling surface tension/ method for((the) crystallization of macromolecules/ Other patterns that we manually identified in this way are: approach for/TOPIC/ approaches for /TOPIC/ pach to/TOPIC/ approaches to /TOPIC/ methods for /TOPIC/ solutions for /TOPIc/ tools for /TOPIC/ We call these the context patterns, which as their name suggests provide the lexi- cal context for the topic extraction. The topics themselves can be described by so- called topic patterns, which describe the linguistic structure of possibly relevant topics that can be found in the right context of the defined context patterns. Topic patterns are defined in terms of part-of-speech tags that indicate if a word is for in- stance a noun, verb, etc. For now, we define only one topic pattern that defines a topic as a noun(optional) followed by a sequence of zero or more adjectives followed by a
cies and semantic relations between them through a combination of linguistic patterns, statistical methods as used in information retrieval and machine learning and background knowledge if available. Central to the approach as discussed in this paper is the use of domain-specific linguistic patterns for the extraction of potentially relevant competencies, such as scientific topics and technologies, from publicly available scientific publications. In this text type, topics and technologies will occur in the context of cue phrases such ‘developed a tool for XY’ or ‘worked on methods for YZ’, where XY, YZ are possibly relevant competencies that the authors of the scientific publication is or has been working on. Consider for instance the following excerpts from three scientific articles in chemistry: …profile refinement method for nuclear and magnetic structures… …continuum method for modeling surface tension… …a screening method for the crystallization of macromolecules… In all three cases a method is discussed for addressing a particular problem that can be interpreted as a competency topic: ‘nuclear and magnetic structures’, ‘modeling surface tension’, ‘crystallization of macromolecules’. The pattern that we can thus establish from these examples is as follows: method for [TOPIC] as in: method for [nuclear and magnetic structures] method for [modeling surface tension] method for [(the) crystallization of macromolecules] Other patterns that we manually identified in this way are: approach for [TOPIC] approaches for [TOPIC] approach to [TOPIC] approaches to [TOPIC] methods for [TOPIC] solutions for [TOPIC] tools for [TOPIC] We call these the ‘context patterns’, which as their name suggests provide the lexical context for the topic extraction. The topics themselves can be described by socalled ‘topic patterns’, which describe the linguistic structure of possibly relevant topics that can be found in the right context of the defined context patterns. Topic patterns are defined in terms of part-of-speech tags that indicate if a word is for instance a noun, verb, etc. For now, we define only one topic pattern that defines a topic as a noun (optional) followed by a sequence of zero or more adjectives followed by a
sequence of one or more nouns. Using the part-of-speech tag set for English of the Penn Treebank [5], this can be defined formally as follows-JJ indicates an adjective, NN a noun, NNS a plural noun *)(NNS?)*S? The objective of our approach is to automatically identify the most relevant topics for a given researcher in the organization under consideration. To this end we download all papers by this researcher through Google Scholar run the context pat- terns over these papers and extract a window of 10 words to the right of each match- ing occurrence We call these extracted text segments the topic text, which may or may not con- tain a potentially relevant topic. To establish this, we first apply a part-of-speech tag- ger(TnT: [6]to each text segment and sub-sequentially run the defined topic pattern over the output of this. Consider for instance the following examples of context pat tern,extracted topic text in its right context, part-of-speech tagged version' and matched topic pattern(highlighted) emantic tagging, using various corpora to derive relevant underspecified lexical ZBG JJ NN TO VB semantic tagging anaphoric expressions. Accordingly, the system consists of three major modules NS RB DT NN VBZ IN CD JJ NNS anaphoric expressions ontology adaptation and for mapping different ontologies should be an C IN VBG JJ NAS MD VB DT ontology adapta oach for modeling similarity which tries to avoid the mentioned problem WDT VBZ TO VB DT VBN NS domain specific semantic lexicon construction that builds on the reuse WDT VBZ N DT NN domain specific semantic lexicon construction Clarification of the part-of-speech tags used: CC: conjunction; DT, WDT: determiner; IN: preposition; MD: modal verb; RB: adverb, TO: to, VB, VBG, VBP, VBN, VBZ: verb
sequence of one or more nouns. Using the part-of-speech tag set for English of the Penn Treebank [5], this can be defined formally as follows - JJ indicates an adjective, NN a noun, NNS a plural noun: (.*?)((NN(S)? |JJ )*NN(S)?) The objective of our approach is to automatically identify the most relevant topics for a given researcher in the organization under consideration. To this end we download all papers by this researcher through Google Scholar run the context patterns over these papers and extract a window of 10 words to the right of each matching occurrence. We call these extracted text segments the ‘topic text’, which may or may not contain a potentially relevant topic. To establish this, we first apply a part-of-speech tagger (TnT: [6]) to each text segment and sub-sequentially run the defined topic pattern over the output of this. Consider for instance the following examples of context pattern, extracted topic text in its right context, part-of-speech tagged version 1 and matched topic pattern (highlighted): approach to semantic tagging , using various corpora to derive relevant underspecified lexical JJ NN , VBG JJ NN TO VB JJ JJ JJ semantic tagging solutions for anaphoric expressions . Accordingly , the system consists of three major modules : JJ NNS . RB , DT NN VBZ IN CD JJ NNS : anaphoric expressions tools for ontology adaptation and for mapping different ontologies should be an NN NN CC IN VBG JJ NNS MD VB DT ontology adaptation approach for modeling similarity measures which tries to avoid the mentioned problems JJ NN NNS WDT VBZ TO VB DT VBN NNS modelling similarity measures methods for domain specific semantic lexicon construction that builds on the reuse NN JJ JJ NN NN WDT VBZ IN DT NN domain specific semantic lexicon construction 1 Clarification of the part-of-speech tags used: CC: conjunction; DT, WDT: determiner; IN: preposition; MD: modal verb; RB: adverb; TO: to; VB, VBG, VBP, VBN, VBZ: verb
As can be observed from the examples above, mostly the topic to be extracted will be found directly at the beginning of the topic text. However, in some cases the topic will be found only later on in the topic text, e.g. in the following examples approach to be used in a lexical choice system, the modelo VB VBN INDT JJ NN NN DT NN M lexical choice system approach for introducing business process-oriented knowledge management, starting on the VBG INDT business process-oriented knowledge management The topics that can be extracted in this way now need to be assigned a measure of elevance for which we use the well-known TF/dF score that is used in information etrieval to assign a weight to each index term relative to each document in the re- trieval data set [7]. For our purposes we apply the same mechanism, but instead of assigning index terms to documents we assign extracted topics (i.e. ' terms)to indi- vidual researchers (i.e. ' documents)for which we downloaded and processed scien- tific publications. The TF/iDF measure we use for this is defined as follows: D={d12d2d} D知y={1,d2…,dn} where freq dp>1 for l<i≤n fd tfidf d=fd *idf here d is a set of researchers and freq opi is the frequency of the topic for re- searcher d The outcome of the whole process, after extraction and relevance scoring, is a ranked list of zero or more topics for each researcher for which we have access to publicly available scientific publications through Google Scholar. 2 Observe that'lexical choice system'is a topic of relevance to nlP in natural language genera
As can be observed from the examples above, mostly the topic to be extracted will be found directly at the beginning of the topic text. However, in some cases the topic will be found only later on in the topic text, e.g. in the following examples 2 : approach to be used in a lexical choice system , the model of VB VBN IN DT JJ NN NN , DT NN IN lexical choice system approach for introducing business process-oriented knowledge management , starting on the … VBG NN JJ NN NN , VBG IN DT … business process-oriented knowledge management The topics that can be extracted in this way now need to be assigned a measure of relevance, for which we use the well-known TF/IDF score that is used in information retrieval to assign a weight to each index term relative to each document in the retrieval data set [7]. For our purposes we apply the same mechanism, but instead of assigning index terms to documents we assign extracted topics (i.e. ‘terms’) to individual researchers (i.e. ‘documents’) for which we downloaded and processed scientific publications. The TF/IDF measure we use for this is defined as follows: { } { } topic topic d topic d topic freq topic topic D topic topic d d topic n topic freq n tfidf tf idf D D idf freq freq tf D d d d i n D d d d * , , , where freq 1 for 1 , , , 1 1 1 2 d 1 2 i = = = = > ≤ ≤ = > > where D is a set of researchers and topic freqd is the frequency of the topic for researcher d The outcome of the whole process, after extraction and relevance scoring, is a ranked list of zero or more topics for each researcher for which we have access to publicly available scientific publications through Google Scholar. 2 Observe that ‘lexical choice system’ is a topic of relevance to NLP in natural language generation