《电子商务 E-business》阅读文献：Topic Extraction from Scientific Literature for Competency Management.pdf

Topic Extraction from Scientific Literature for Competency Management Paul Buitelaar, Thomas Eigner DFKI Gmbh Language Technology Lab& Competence Center Semantic Web Stuhlsatzenhausweg 3 66123 Saarbrucken, Germany Abstract We describe an approach towards automatic, dynamic and time- critical support for competency management and expertise search through topic extraction from scientific publications. In the use case we present, we focus on the automatic extraction of scientific topics and technologies from publicly available publications using web sites like Google Scholar. We discuss an ex- periment for our own organization, DFKI, as example of a knowledge organiza- tion. The paper presents evaluation results over a sample of 48 DFKI research ers that responded to our request for a-posteriori evaluation of automatically ex- racted topics. The results of this evaluation are encouraging and provided us with useful feedback for further improving our methods. The extracted topics can be organized in an association network that can be used further to analyze how competencies are interconnected, thereby enabling also a better exchange of expertise and competence between researche 1 Introduction Competency management, the identification and management of experts on and their knowledge in certain competency areas, is a growing area of research as knowl- edge has become a central factor in achieving commercial success. It is of fundamen tal importance for any organization to keep up-to-date with the competencies it covers, in the form of experts among its work force. Identification of experts will be based mostly on recruitment information, but this is not sufficient as competency coverage (competencies of interest to the organization) and structure(interconnections between competencies) change rapidly over time. The automatic identification of competency coverage and structure, e.g. from publications, is therefore of increasing importance, as this allows for a sustainable, dynamic and time-critical approach to competency management o In this paper we present a pattern-based approach to the extraction of competencies a knowledge-based research organization(scientific topics, technologies) from publicly available scientific publications. The core assumption of our approach is that such topics will not occur in random fashion across documents, but instead occur only

Topic Extraction from Scientific Literature for Competency Management Paul Buitelaar, Thomas Eigner DFKI GmbH Language Technology Lab & Competence Center Semantic Web Stuhlsatzenhausweg 3 66123 Saarbrücken, Germany paulb@dfki.de Abstract We describe an approach towards automatic, dynamic and timecritical support for competency management and expertise search through topic extraction from scientific publications. In the use case we present, we focus on the automatic extraction of scientific topics and technologies from publicly available publications using web sites like Google Scholar. We discuss an experiment for our own organization, DFKI, as example of a knowledge organization. The paper presents evaluation results over a sample of 48 DFKI researchers that responded to our request for a-posteriori evaluation of automatically extracted topics. The results of this evaluation are encouraging and provided us with useful feedback for further improving our methods. The extracted topics can be organized in an association network that can be used further to analyze how competencies are interconnected, thereby enabling also a better exchange of expertise and competence between researchers. 1 Introduction Competency management, the identification and management of experts on and their knowledge in certain competency areas, is a growing area of research as knowledge has become a central factor in achieving commercial success. It is of fundamental importance for any organization to keep up-to-date with the competencies it covers, in the form of experts among its work force. Identification of experts will be based mostly on recruitment information, but this is not sufficient as competency coverage (competencies of interest to the organization) and structure (interconnections between competencies) change rapidly over time. The automatic identification of competency coverage and structure, e.g. from publications, is therefore of increasing importance, as this allows for a sustainable, dynamic and time-critical approach to competency management. In this paper we present a pattern-based approach to the extraction of competencies in a knowledge-based research organization (scientific topics, technologies) from publicly available scientific publications. The core assumption of our approach is that such topics will not occur in random fashion across documents, but instead occur only

in specific scientific discourse contexts that can be precisely defined and used as pat terns for topic extraction The remainder of the paper is structured as follows In section 2 we describe related work in competency management and argue for an approach based on natural lan- guage processing and ontology modeling. We describe our specific approach to topic extraction for competency management in detail in section 3. The paper then contin ues with the description of an experiment that we performed on topic extraction for competency management in our own organization, DFKI. Finally, we conclude the paper with some conclusions that can be drawn from our research and ideas for future work that arise from these 2 Related work Competency management is a growing area of knowledge management that is con cerned with the"identification of skills, knowledge, behaviors, and capabilities needed to meet current and future personnel selection needs, in alignment with the differentia tions in strategies and organizational priorities. [1] Our particular focus here is on aspects of competency management relating to the identification and management of nowledge about scientific topics and technologies, which is at the basis of compe- tency management. Most of the work on competency management has been focused on the develop ment of methods for the identification, modeling, and analysis of skills and skills gaps and on training solutions to help remedy the latter. An important initial step in this process is the identification of skills and knowledge of interest, which is mostly done through interviews, surveys and manual analysis of existing competency models. Re- cently, ontology-based approaches have been proposed that aim at modeling the do main model of particular organization types(e.g. computer science, health-care) through formal ontologies, over which matchmaking services can be defined for bring ing together skills and organization requirements(e.g. [213]) The development of formal ontologies for competency management is important but there is an obvious need for automated methods in the construction and dynamic maintenance of such ontologies. Although some work has been done on developing automated methods for competency management through text and web mining(e.g. [4) this is mostly restricted to the extraction of associative networks between people according to documents or other data they are associated with. Instead, for the purpose of automated and dynamic support of competency management a richer analysis of competencies and semantic relations between them is needed, as can be extracted from text through natural language processing 3 Approach Our approach towards the automatic construction and dynamic maintenance of on tologies for competency management is based on the extraction of relevant competen

in specific scientific discourse contexts that can be precisely defined and used as patterns for topic extraction. The remainder of the paper is structured as follows. In section 2 we describe related work in competency management and argue for an approach based on natural language processing and ontology modeling. We describe our specific approach to topic extraction for competency management in detail in section 3. The paper then continues with the description of an experiment that we performed on topic extraction for competency management in our own organization, DFKI. Finally, we conclude the paper with some conclusions that can be drawn from our research and ideas for future work that arise from these. 2 Related Work Competency management is a growing area of knowledge management that is concerned with the “identification of skills, knowledge, behaviors, and capabilities needed to meet current and future personnel selection needs, in alignment with the differentiations in strategies and organizational priorities.” [1] Our particular focus here is on aspects of competency management relating to the identification and management of knowledge about scientific topics and technologies, which is at the basis of competency management. Most of the work on competency management has been focused on the development of methods for the identification, modeling, and analysis of skills and skills gaps and on training solutions to help remedy the latter. An important initial step in this process is the identification of skills and knowledge of interest, which is mostly done through interviews, surveys and manual analysis of existing competency models. Recently, ontology-based approaches have been proposed that aim at modeling the domain model of particular organization types (e.g. computer science, health-care) through formal ontologies, over which matchmaking services can be defined for bringing together skills and organization requirements (e.g. [2], [3]). The development of formal ontologies for competency management is important, but there is an obvious need for automated methods in the construction and dynamic maintenance of such ontologies. Although some work has been done on developing automated methods for competency management through text and web mining (e.g. [4]) this is mostly restricted to the extraction of associative networks between people according to documents or other data they are associated with. Instead, for the purpose of automated and dynamic support of competency management a richer analysis of competencies and semantic relations between them is needed, as can be extracted from text through natural language processing. 3 Approach Our approach towards the automatic construction and dynamic maintenance of ontologies for competency management is based on the extraction of relevant competen-

cies and semantic relations between them through a combination of linguistic patterns, statistical methods as used in information retrieval and machine learning and back- ound knowledge if available Central to the approach as discussed in this paper is the use of domain-specific lin- guistic patterns for the extraction of potentially relevant competencies, such as scien- tific topics and technologies, from publicly available scientific publications. In this text type, topics and technologies will occur in the context of cue phrases such devel- ped a tool for XYor'worked on methods for YZ, where XY, YZ are possibly rele vant competencies that the authors of the scientific publication is or has been working on. Consider for instance the following excerpts from three scientific articles in chem profile refinement method for nuclear and magnetic structures continuum method for modeling surface tension a screening method for the crystallisation of macromolecules In all three cases a method is discussed for addressing a particular problem that can be interpreted as a competency topic: nuclear and magnetic structures,modeling surface tension, 'crystallization of macromolecules. The pattern that we can thus establish from these examples is as follows method for /TOPIC method for/nuclear and magnetic structures) method for/modeling surface tension/ method for((the) crystallization of macromolecules/ Other patterns that we manually identified in this way are: approach for/TOPIC/ approaches for /TOPIC/ pach to/TOPIC/ approaches to /TOPIC/ methods for /TOPIC/ solutions for /TOPIc/ tools for /TOPIC/ We call these the context patterns, which as their name suggests provide the lexi- cal context for the topic extraction. The topics themselves can be described by so- called topic patterns, which describe the linguistic structure of possibly relevant topics that can be found in the right context of the defined context patterns. Topic patterns are defined in terms of part-of-speech tags that indicate if a word is for in- stance a noun, verb, etc. For now, we define only one topic pattern that defines a topic as a noun(optional) followed by a sequence of zero or more adjectives followed by a

cies and semantic relations between them through a combination of linguistic patterns, statistical methods as used in information retrieval and machine learning and background knowledge if available. Central to the approach as discussed in this paper is the use of domain-specific linguistic patterns for the extraction of potentially relevant competencies, such as scientific topics and technologies, from publicly available scientific publications. In this text type, topics and technologies will occur in the context of cue phrases such ‘developed a tool for XY’ or ‘worked on methods for YZ’, where XY, YZ are possibly relevant competencies that the authors of the scientific publication is or has been working on. Consider for instance the following excerpts from three scientific articles in chemistry: …profile refinement method for nuclear and magnetic structures… …continuum method for modeling surface tension… …a screening method for the crystallization of macromolecules… In all three cases a method is discussed for addressing a particular problem that can be interpreted as a competency topic: ‘nuclear and magnetic structures’, ‘modeling surface tension’, ‘crystallization of macromolecules’. The pattern that we can thus establish from these examples is as follows: method for [TOPIC] as in: method for [nuclear and magnetic structures] method for [modeling surface tension] method for [(the) crystallization of macromolecules] Other patterns that we manually identified in this way are: approach for [TOPIC] approaches for [TOPIC] approach to [TOPIC] approaches to [TOPIC] methods for [TOPIC] solutions for [TOPIC] tools for [TOPIC] We call these the ‘context patterns’, which as their name suggests provide the lexical context for the topic extraction. The topics themselves can be described by socalled ‘topic patterns’, which describe the linguistic structure of possibly relevant topics that can be found in the right context of the defined context patterns. Topic patterns are defined in terms of part-of-speech tags that indicate if a word is for instance a noun, verb, etc. For now, we define only one topic pattern that defines a topic as a noun (optional) followed by a sequence of zero or more adjectives followed by a