Component Search and Reuse: An Ontology-based Approach Awny Alnusair and Tian Zhao Department of Computer Science University of Wisconsin-Milwaukee, USA falnusair, tzhao@uwm. edu Abstract cation domain, and to represent the relationships that hold among these concepts. Due to their solid formal and rea- In order to realize the full potential of software reuse, soning foundation, ontologies can play an important role effective search techniques are indeed essential. In this pa- in domain engineering reuse; they can be used to structure per, we propose a semantic-based approach for identifying and build a source-code knowledge base that can be used by Ind retrieving relevant components from a reuse repository. software agents(e. g, search engine) and certainly serve as This approach relies on building a knowledge base accord- a basis for semantic queries [ 9]. Therefore, ontologies have ing to an ontology model that includes a source-code on- been successfully used by various researchers to improve many aspects of the software engineering processes [12] ogy. Due to the indexing and knowledge population mecha Towards this end, we have developed an ontology model nisms we used, our proof-of-concept supports various kinds that includes an enhanced software representation ontol of search techniques. However, our experiments show evi- This ontology is further extended with additional dence that only pure semantic search that exploits domain component-specific knowledge and automatically popu- nowledge tends to improve precision Based on a usability lated with ontological instances representing various pro- case study, we argue that semantic search is indeed usable gram elements of a given library. These instances are fur ther annotated with respect to concepts from a domain specific ontology Ontology-based component search is thus performed by 1. Introduction the semantic matching of user requests expressed using terms from the domain ontology with component descrip Systematic software reuse enables developers to use ex tions found in the populated knowledge-base. Furthermore isting software components for constructing new quality the knowledge po ation oftware systems. Thus, speeding up the development pro- in our approach still allows searching the knowledge base cess and reducing costs and risks. Failure modes analysis using the most familiar keyword search. This is particularly useful when domain ontologies or semantic component an- of the reuse process shows that in order to be reused, a soft- notations are lacking or incomplete. Users are thus able to ware component must be findable and certainly understand able 51. On one hand, understanding the component's func- keyword queries, type-based queries or a mixture of all tionality as well as its relationships with other components in a reuse library is usually hampered due to the lack of quality descriptions of library services. On the other hand, 2. Ontology model for component reuse finding suitable components is still a significant barrier for exploiting systematic software reuse At the core of our ontology model for object-oriented In this paper, we present an approach for describing, component retrieval is a Source Code Representation On retrieving, and exploring the various relationships among tology(referred to afterwards as SCRO). This ontology pro- software components in object-oriented reuse libraries. In vides a base model for capturing the relationships and de order to provide a formal and precise representation of li- pendencies among source-code artifacts. It models major brary code, our approach relies on ontologizing software concepts and features of object-oriented programs, includ- ing encapsulation, class and interface inheritance, met Ontologies provide means to explicitly describe con- overloading, method overriding, and method signature in- pts, objects, properties and other entities in a given appli- formation. SCRO,'s knowledge is represented using the
Component Search and Reuse: An Ontology-based Approach Awny Alnusair and Tian Zhao Department of Computer Science University of Wisconsin-Milwaukee, USA {alnusair, tzhao}@uwm.edu Abstract In order to realize the full potential of software reuse, effective search techniques are indeed essential. In this paper, we propose a semantic-based approach for identifying and retrieving relevant components from a reuse repository. This approach relies on building a knowledge base according to an ontology model that includes a source-code ontology, a component ontology, and a domain-specific ontology. Due to the indexing and knowledge population mechanisms we used, our proof-of-concept supports various kinds of search techniques. However, our experiments show evidence that only pure semantic search that exploits domain knowledge tends to improve precision. Based on a usability case study, we argue that semantic search is indeed usable and practical. 1. Introduction Systematic software reuse enables developers to use existing software components for constructing new quality software systems. Thus, speeding up the development process and reducing costs and risks. Failure modes analysis of the reuse process shows that in order to be reused, a software component must be findable and certainly understandable [5]. On one hand, understanding the component’s functionality as well as its relationships with other components in a reuse library is usually hampered due to the lack of quality descriptions of library services. On the other hand, finding suitable components is still a significant barrier for exploiting systematic software reuse. In this paper, we present an approach for describing, retrieving, and exploring the various relationships among software components in object-oriented reuse libraries. In order to provide a formal and precise representation of library code, our approach relies on ontologizing software knowledge. Ontologies provide means to explicitly describe concepts, objects, properties and other entities in a given application domain, and to represent the relationships that hold among these concepts. Due to their solid formal and reasoning foundation, ontologies can play an important role in domain engineering reuse; they can be used to structure and build a source-code knowledge base that can be used by software agents (e.g., search engine) and certainly serve as a basis for semantic queries [9]. Therefore, ontologies have been successfully used by various researchers to improve many aspects of the software engineering processes [12]. Towards this end, we have developed an ontology model that includes an enhanced software representation ontology. This ontology is further extended with additional component-specific knowledge and automatically populated with ontological instances representing various program elements of a given library. These instances are further annotated with respect to concepts from a domainspecific ontology. Ontology-based component search is thus performed by the semantic matching of user requests expressed using terms from the domain ontology with component descriptions found in the populated knowledge-base. Furthermore, the knowledge population and indexing mechanisms used in our approach still allows searching the knowledge base using the most familiar keyword search. This is particularly useful when domain ontologies or semantic component annotations are lacking or incomplete. Users are thus able to retrieve components using purely semantic-based queries, keyword queries, type-based queries or a mixture of all. 2. Ontology model for component reuse At the core of our ontology model for object-oriented component retrieval is a Source Code Representation Ontology (referred to afterwards as SCRO). This ontology provides a base model for capturing the relationships and dependencies among source-code artifacts. It models major concepts and features of object-oriented programs, including encapsulation, class and interface inheritance, method overloading, method overriding, and method signature information. SCRO’s knowledge is represented using the
OWL-DL ontology language. OWL is a web-based con- components in a given domain. Therefore, we extends ceptual modelling language used for capturing relationship SCROs semantic representations of API structures and en- semantics among domain concepts, OWL-DL is a subset rich it with additional component-specific descriptions re of oWL based on Description Logic(DL) and has desir- quired to uniquely identify and retrieve an API compe able computational properties for automated reasoning sys- nent. The result is a COMPonent REpresentation ontol- tems. OWL-DLs reasoning support enables inferring addi- ogy(referred to afterwards as COMPRE). In addition to tional knowledge and computing the classification hierar- the concepts, axioms, and properties inherited from SCRO, chy(subsumption reasoning). SCRO defines various OWL COMPRe defines its own class hierarchy and relations concepts that map directly to source-code elements and col- for semantic component descriptions. For instance, the lectively represent the most important concepts found in Component concept represents software components in object-oriented programs. Furthermore, SCRO defines var- general and subsumes other OWL classes that represents ious OWL object properties, datatype properties, and onto- Java-specific components such as Method, classType logical axioms to represent the various relationships among and Interface Type. A fragment of the ontology tax ontological concepts. SCRO is precise, well documented, onomy is shown on the left pane of Figure l and the com and designed with ontology reuse in mind. The availability plete ontology can be examined online [1 of SCRO online [1], allows researchers to reuse or extend its Moreover, comPre defines various ontological ax- representational components to support any semantic-based ioms and object properties that represent relationships application that requires source-code knowledge among software components. These properties link com- ponents with their corresponding semantic descriptions 2.l. Domain-specific ontology specified in the domain-specific ontology. For example, Domain ontologies describe concepts and structures re- corresponding inverse properties are used to annotate in- lated to a particular domain(e.g. finance, shopping, dividual software components with domain terms repre medicine, graphics, or the object-oriented programs domain senting the expected inputs and outputs of the component as specified in SCRO). In our approach to component re- Moreover, COMPRE defines the dependson symmetric trieval, we use a domain ontology that conceptualizes each object property that defines dependency relationship be- software library we need to reuse. This ontology provides tween components. describedBy is also defined to li a common vocabulary with unambiguous and conceptually a component to a domain concept that best describes the sound terms that can be used to annotate software compo- purpose or the nature of the component nents. Annotations in this context serves two key purposes In addition to pure semantic search that is based on Firstly, both software providers and users can communicate using a shared and common vocabulary provided by this component annotations with respect to a domain onto ogy, COMPRE also defines various datatype properties to ontology. Thus enabling a precise retrieval of API compo- model other metadata about components. These proper nents. Secondly, it brings difterent perspectives to typical ties are provided to enable metadata keyword queries that program comprehension tasks. Users can familiarize them- can be used when semantic annotations are lacking or in- selves with terminology and conceptual knowledge that is complete. For instance, the hasInput Terms and the typically implicit in the problem domain. To this end, we have developed a mini-ontology for data words describing the component's input and its expected retrieval in the Semantic Web applications domain. This on tology (referred to afterwards as Swonto) has been used outpu during the evaluation of our approach(cf. Section 4)and serves as a proof of concept that component search can be 23. Knowledge population significantly enhanced through the use of domain ontolo- gies. A small fragment of the ontology's taxonomy is shown on the upper right pane of Figure I and the complete onto- to populate the knowledge base with ontological instances ogy is found online [1] that represent various ontological concepts and their corre- 2.2. Component-specific ontology sponding relationships. Therefore, we have built a knowl- edge extractor subsystem for the Java programming lan In the context of component retrieval, we certainly need guage. Our subsystem performs a comprehensive parsing of profound semantic description of the components in- the Java bytecode and captures every ontology concept that represents a source code element and generates instances ner working structure and its interrelationships with other of all ontological properties defined in our ontologies for http://www.w3.org/2004/owl/ those program elements. The generated semantic instances
OWL-DL1 ontology language. OWL is a web-based conceptual modelling language used for capturing relationship semantics among domain concepts, OWL-DL is a subset of OWL based on Description Logic (DL) and has desirable computational properties for automated reasoning systems. OWL-DLs reasoning support enables inferring additional knowledge and computing the classification hierarchy (subsumption reasoning). SCRO defines various OWL concepts that map directly to source-code elements and collectively represent the most important concepts found in object-oriented programs. Furthermore, SCRO defines various OWL object properties, datatype properties, and ontological axioms to represent the various relationships among ontological concepts. SCRO is precise, well documented, and designed with ontology reuse in mind. The availability of SCRO online [1], allows researchers to reuse or extend its representational components to support any semantic-based application that requires source-code knowledge. 2.1. Domain-specific ontology Domain ontologies describe concepts and structures related to a particular domain (e.g. finance, shopping, medicine, graphics, or the object-oriented programs domain as specified in SCRO). In our approach to component retrieval, we use a domain ontology that conceptualizes each software library we need to reuse. This ontology provides a common vocabulary with unambiguous and conceptually sound terms that can be used to annotate software components. Annotations in this context serves two key purposes. Firstly, both software providers and users can communicate using a shared and common vocabulary provided by this ontology. Thus enabling a precise retrieval of API components. Secondly, it brings different perspectives to typical program comprehension tasks. Users can familiarize themselves with terminology and conceptual knowledge that is typically implicit in the problem domain. To this end, we have developed a mini-ontology for data retrieval in the Semantic Web applications domain. This ontology (referred to afterwards as SWONTO) has been used during the evaluation of our approach (cf. Section 4) and serves as a proof of concept that component search can be significantly enhanced through the use of domain ontologies. A small fragment of the ontology’s taxonomy is shown on the upper right pane of Figure 1 and the complete ontology is found online [1]. 2.2. Component-specific ontology In the context of component retrieval, we certainly need a profound semantic description of the component’s inner working structure and its interrelationships with other 1http://www.w3.org/2004/OWL/ components in a given domain. Therefore, we extends SCROs semantic representations of API structures and enrich it with additional component-specific descriptions required to uniquely identify and retrieve an API component. The result is a COMPonent REpresentation ontology (referred to afterwards as COMPRE). In addition to the concepts, axioms, and properties inherited from SCRO, COMPRE defines its own class hierarchy and relations for semantic component descriptions. For instance, the Component concept represents software components in general and subsumes other OWL classes that represents Java-specific components such as Method, ClassType, and InterfaceType. A fragment of the ontologys taxonomy is shown on the left pane of Figure 1 and the complete ontology can be examined online [1]. Moreover, COMPRE defines various ontological axioms and object properties that represent relationships among software components. These properties link components with their corresponding semantic descriptions specified in the domain-specific ontology. For example, hasDomainInput and hasDomainOutput and their corresponding inverse properties are used to annotate individual software components with domain terms representing the expected inputs and outputs of the component. Moreover, COMPRE defines the dependsOn symmetric object property that defines dependency relationship between components. describedBy is also defined to link a component to a domain concept that best describes the purpose or the nature of the component. In addition to pure semantic search that is based on component annotations with respect to a domain ontology, COMPRE also defines various datatype properties to model other metadata about components. These properties are provided to enable metadata keyword queries that can be used when semantic annotations are lacking or incomplete. For instance, the hasInputTerms and the hasOutputTerms are used to assign meaningful keywords describing the component’s input and its expected output. 2.3. Knowledge population Once the ontology structure is specified, one next needs to populate the knowledge base with ontological instances that represent various ontological concepts and their corresponding relationships. Therefore, we have built a knowledge extractor subsystem for the Java programming language. Our subsystem performs a comprehensive parsing of the Java bytecode and captures every ontology concept that represents a source code element and generates instances of all ontological properties defined in our ontologies for those program elements. The generated semantic instances 2
are serialized using RDF. RDF is web-based language few terms describing its purpose. The third triple, however, suitable for describing resources and provides an extensible tags the same input parameter with a meaningful concept data model for representing machine-processable semantics (QueryText)from the domain ontology. Thus, giving the of data. For each application framework parsed, we thus parameter an agreed-upon and meaningful description other generate an RDF ontology that represents the instantiated than terms or the semantically vague String type. knowledge base for the framework at hand. This knowledge base is managed by Jena [8], an open source Java frame 1. create[.] scro: hasInput Type String work for building Semantic Web applications 2. create[.] compre: hasInputTerms The process of generating semantic instances for the query string concepts and relations specified in SCRO is completely au- 3. create[.] compre: hasDomainInput However, the process of annotating componen according to COMPRE's object properties is currently man- ual as it is the case for semantic annotations in general. ebase http Ourtoolthoughprovidesmeansforinsertingtheseannota-prefixscro:<http:.../ontologies/scro.owl#> tionsdirectlyintotheknowledge-base,thusgraduallybuild-prefixcomprechttp:../ontologies/compre.owl4> ing semantic descriptions for a particular API that can be PREFIx swonto: <htt /ontologies/swonto owl# PREFIx dc <http://purl.org/dc/elements/1.1/ shared, evolved, and reused by a community of users. On the other hand, metatdata modelled by COMPRE's datatype<+QueryFactory create[string, String, Syntax]> properties is generated automatically via direct parsing of a scro: staticMethod the source-code. We thus capture and normalize method scro: hasInputType <tstring> i scro: hasInput Type <tstring> signatures, identifier names, source-code comments, and scro: hasInput Type <#Syntax> available Java annotations in order to obtain a meaningful utputType <#Query> keyword descriptions of components. These descriptions scro: invokesMethod <#parse [Query, string, Syntax]> are lexically analyzed, stored, and indexed using the to- scro: hassignature create[String, string, Syntax] kenization and indexing mechanisms provided by Apache ompre: describedBy I a swonto: QueryCreation]i Lucene, an open-source full-featured text search engine compre: hasDomainInput a swonto: QueryText In the next section, we show how the knowledge gener- compre: hasDomainInput a swonto: URIl ed using this knowledge extractor sub-system can be used compre: hasDomainInput swonto: QueryLanguage Sy for component search. For an extended discussion of our hasDomainoutput ontologies, complete knowledge population samples, we re fer the reader to our ontologies website [1] compre: hasInputrermsquery string .."i ompre: hasInputTerms query syntax URI 3. ontological search compre: hasoutput Terms " query dc: description "create query Listing I shows a partial RDF description obtained luring the knowledge population phase for a Jena API Listing 1. RDF descriptor for an APl method method, the create method. This method belongs to the jueryFactory class and usually used to create a Query This multi-faceted description of components enables object given the specified input. This rdF descriptioN clearly captures the component's metadata at the semantic a) type or signature-based queries; b)metadata keyword queries; c) pure semantic-based queries; or d) blended The underlying data structure of RDF is a labeled di rected graph. Each node-arc-node in this graph represents a queries of the previous three types However, we focus the discussion on pure semantic- triple that consists of three parts, subject, predicate and ob- based queries that rely on domain-specific knowledge ject. Consider Listing I for example, the described method Primarily, search techniques that rely on variations of is always the subject, onto- keyword-based search suffer from synonymity and poly- ogy properties are predicates, and objects are either a re- semic ambiguity that often lead to low recall and precision source, unlabeled node(blank node)or a literal value. For On the other hand, signature matching techniques cannot example, the first triple below uses a property from scro distinguish between components that have the same signa to assert that the method has an input parameter of type ture but serve different purposes, e.g ng Jena api to String. The second triple associates this parameter with create a new query vs read a query from a file. In http://www.w3.org/tr/rdf-primer semantic search, however, these limitations are completely 3http://lucene.apache.org/ dealt with since the semantics of each of the types in si
are serialized using RDF 2 . RDF is web-based language suitable for describing resources and provides an extensible data model for representing machine-processable semantics of data. For each application framework parsed, we thus generate an RDF ontology that represents the instantiated knowledge base for the framework at hand. This knowledge base is managed by Jena [8], an open source Java framework for building Semantic Web applications. The process of generating semantic instances for the concepts and relations specified in SCRO is completely automatic. However, the process of annotating components according to COMPRE’s object properties is currently manual as it is the case for semantic annotations in general. Our tool though provides means for inserting these annotations directly into the knowledge-base, thus gradually building semantic descriptions for a particular API that can be shared, evolved, and reused by a community of users. On the other hand, metatdata modelled by COMPRE’s datatype properties is generated automatically via direct parsing of the source-code. We thus capture and normalize method signatures, identifier names, source-code comments, and available Java annotations in order to obtain a meaningful keyword descriptions of components. These descriptions are lexically analyzed, stored, and indexed using the tokenization and indexing mechanisms provided by Apache Lucene 3 , an open-source full-featured text search engine. In the next section, we show how the knowledge generated using this knowledge extractor sub-system can be used for component search. For an extended discussion of our ontologies, complete knowledge population samples, we refer the reader to our ontologies website [1]. 3. Ontological search Listing 1 shows a partial RDF description obtained during the knowledge population phase for a Jena API method, the create method. This method belongs to the QueryFactory class and usually used to create a Query object given the specified input. This RDF description clearly captures the component’s metadata at the semantic and syntactic level. The underlying data structure of RDF is a labeled directed graph. Each node-arc-node in this graph represents a triple that consists of three parts, subject, predicate and object. Consider Listing 1 for example, the described method in this snippet, create[..], is always the subject, ontology properties are predicates, and objects are either a resource, unlabeled node (blank node) or a literal value. For example, the first triple below uses a property from SCRO to assert that the method has an input parameter of type String. The second triple associates this parameter with 2http://www.w3.org/TR/rdf-primer 3http://lucene.apache.org/ few terms describing its purpose. The third triple, however, tags the same input parameter with a meaningful concept (QueryText) from the domain ontology. Thus, giving the parameter an agreed-upon and meaningful description other than terms or the semantically vague String type. 1. create[..] scro:hasInputType String 2. create[..] compre:hasInputTerms "query string" 3. create[..] compre:hasDomainInput [ a swonto:QueryText] @base <http:.../ontologies/kb.n3> PREFIX scro: <http:.../ontologies/scro.owl#> PREFIX compre: <http:.../ontologies/compre.owl#> PREFIX swonto: <http:.../ontologies/swonto.owl#> PREFIX dc: <http://purl.org/dc/elements/1.1/> <#QueryFactory.create[String,String,Syntax]> a scro:StaticMethod ; scro:hasInputType <#String> ; scro:hasInputType <#String> ; scro:hasInputType <#Syntax> ; scro:hasOutputType <#Query> ; scro:invokesMethod <#parse[Query,String,Syntax]>; scro:hasSignature "create[String,String,Syntax]"; compre:describedBy [ a swonto:QueryCreation]; compre:hasDomainInput [ a swonto:QueryText]; compre:hasDomainInput [ a swonto:URI]; compre:hasDomainInput [ a swonto:QueryLanguageSyntax]; compre:hasDomainOutput[ a swonto:ExtendedQuery]; compre:hasInputTerms "query string ..."; compre:hasInputTerms "base URI ..."; compre:hasInputTerms "query syntax URI ..."; compre:hasOutputTerms "query ..."; dc:description "create query ..."; .... Listing 1. RDF descriptor for an API method This multi-faceted description of components enables four different types of queries against the knowledge base: a) type or signature-based queries; b) metadata keyword queries; c) pure semantic-based queries; or d) blended queries of the previous three types. However, we focus the discussion on pure semanticbased queries that rely on domain-specific knowledge. Primarily, search techniques that rely on variations of keyword-based search suffer from synonymity and polysemic ambiguity that often lead to low recall and precision. On the other hand, signature matching techniques cannot distinguish between components that have the same signature but serve different purposes, e.g. using Jena API to create a new query vs read a query from a file. In semantic search, however, these limitations are completely dealt with since the semantics of each of the types in sig- 3
natures are encoded and processed during search. Besides are used to create, read or even parse a semantic addressing knowledge representation effectively, semantic query. Nevertheless, this search mechanism is flexible since search offers extensible solutions to component retrieval. it allows a wide range of queries to run against the knowl- Since we are focusing on API usage and reuse, the de- edge base. In fact, the expressive power provided by our scriptions shown in Listing I capture the component's inter- ontologies allows users to express their queries in more de- face and its relationships with other components. However, tails than would otherwise be expressed with any altena these description can be easily extended to capture other tive method. For instance, assume that the user was able to facets(e.g, component's environment) via introducing ad- obtain a Query object as described in the previous exam- ditional ontological properties. ple. The next natural step is to find a component that can Reasoning is one of the primary added benefits in seman- take this query as input, execute it, and return the required tic search. In addition of classifying and checking the con- results. Browsing the Jena API looking for such a compo- sistency of our ontologies, a DL reasoner can also be used nent or even querying using typical keyword-based queries to inferring and thus enriching the knowledge base with ad- would not return an answer since there is an intermediate ditional knowledge that is not explicitly stated. Thus, play- query execution object that must be obtained to complete ing a vital role in improving search precision and recall in the task. This appears to be a dead end. However, using comparison with other search techniques. DL Subsumption semantic search, this request can be expressed fairly easily reasoning, for example, is typically used to establish sub- as follows set inclusion relationships between different concepts an the ontology. Consider the descriptions Query E compre: Component n Listing I for example, when pure semantic-based queries (compre: has DomainInput are used, users need only to provide domain concepts de- swonto: SemanticQuery) n scribing the components interface. Therefore, if the user Compre: has DomainOutput 3 compre provides SemanticQuery as a domain output of the re- has DomainOutput swonto: Result Set quested component, the method shown in the listing would This query expresses the fact that we are looking for a com- still match this request since SemanticQuery suDs ponent that takes a semantic query as inpu returns an- ExtendedQuery as specified in the subsumption hierar- oth chy of our domain ontology. Thus, automatically enablin other component that returns a query solution. Thus, query ing for multiple components at the sam an implicit form of query expansion It is notable that semantic-based retrieval alleviates many 3. 1. Implementation and ranking mechanisms problems typically faced by tools that rely on exact key word or type matching. One of the strengths of our ap- proach, however, is the ability to utilize our various ontolo- We have implemented this approach in a tool called gies in order to perform blended search against the knowl- CompRE, conveniently named after the main ontology in edge base. In particular, this is helpful when components in our modeL. CompRE is deployed as a plug-in for the the knowledge base are not completely annotated or when clipse Integrated Development Environment (IDE). Fi users are still in the process of becoming familiar with ure I shows a snapshot of CompREs main views in the the ontology. Consider for example a user who wishes Eclipse workbench. When loaded for the first time, Com- pRE processes the library code and the component ontology type(SemanticQuery)and one of the actual input types in order to generate the initial knowledge base as described ( Syntax) are known. Furthermore, since the user is not about the other input types, she wants to provide a fer ficient(it only took 4.5 seconds for par terms to filter out the results. This request can be expressed the Jena framework). CompRE also includes a module that using the following query in DL-like syntax: allow users to tag components with semantic references that corresponds to concepts from the domain ontology. These Query E compre: Component n dra (compre: hasDomain Output tured by the storage module, and stored automatically in swonto: SemanticQuery)n knowledge base Upon the conclusion of the knowledge Escro: hasInputType kb: Syntax)n population process, a knowledge repository is created and (compre: hasInputTerms value"base uri) becomes ready for answering user requests CompRe provides two separate views for formulating As expected, executing this query returns not only the queries. The first view is provided as a simple data entry method shown in Listing I but also other unrelated meth- form as shown in the figure. In each entry box, users need ods. It turns out that the input terms specified in the query to provide search restrictions that are either prefixed with are very popular and are used to describe API methods that an ontology name or provided as plain keywords enclosed
natures are encoded and processed during search. Besides addressing knowledge representation effectively, semantic search offers extensible solutions to component retrieval. Since we are focusing on API usage and reuse, the descriptions shown in Listing 1 capture the component’s interface and its relationships with other components. However, these description can be easily extended to capture other facets (e.g., component’s environment) via introducing additional ontological properties. Reasoning is one of the primary added benefits in semantic search. In addition of classifying and checking the consistency of our ontologies, a DL reasoner can also be used to inferring and thus enriching the knowledge base with additional knowledge that is not explicitly stated. Thus, playing a vital role in improving search precision and recall in comparison with other search techniques. DL Subsumption reasoning, for example, is typically used to establish subset inclusion relationships between different concepts and properties in the ontology. Consider the descriptions in Listing 1 for example, when pure semantic-based queries are used, users need only to provide domain concepts describing the component’s interface. Therefore, if the user provides SemanticQuery as a domain output of the requested component, the method shown in the listing would still match this request since SemanticQuery subsumes ExtendedQuery as specified in the subsumption hierarchy of our domain ontology. Thus, automatically enabling an implicit form of query expansion. It is notable that semantic-based retrieval alleviates many problems typically faced by tools that rely on exact keyword or type matching. One of the strengths of our approach, however, is the ability to utilize our various ontologies in order to perform blended search against the knowledge base. In particular, this is helpful when components in the knowledge base are not completely annotated or when users are still in the process of becoming familiar with the ontology. Consider for example a user who wishes to find a component in which the component’s domain output type (SemanticQuery) and one of the actual input types (Syntax) are known. Furthermore, since the user is not sure about the other input types, she wants to provide a few terms to filter out the results. This request can be expressed using the following query in DL-like syntax: Query ≡ compre : Component u (∃compre : hasDomainOutput . swonto : SemanticQuery) u (∃scro : hasInputT ype . kb : Syntax) u (∃compre : hasInputT erms value 00base uri00) As expected, executing this query returns not only the method shown in Listing 1 but also other unrelated methods. It turns out that the input terms specified in the query are very popular and are used to describe API methods that are used to create, read or even parse a semantic query. Nevertheless, this search mechanism is flexible since it allows a wide range of queries to run against the knowledge base. In fact, the expressive power provided by our ontologies allows users to express their queries in more details than would otherwise be expressed with any alternative method. For instance, assume that the user was able to obtain a Query object as described in the previous example. The next natural step is to find a component that can take this query as input, execute it, and return the required results. Browsing the Jena API looking for such a component or even querying using typical keyword-based queries would not return an answer since there is an intermediate query execution object that must be obtained to complete the task. This appears to be a dead end. However, using semantic search, this request can be expressed fairly easily as follows: Query ≡ compre : Component u (∃compre : hasDomainInput . swonto : SemanticQuery) u (∃compre : hasDomainOutput . ∃ compre : hasDomainOutput . swonto : ResultSet) This query expresses the fact that we are looking for a component that takes a semantic query as input and returns another component that returns a query solution. Thus, querying for multiple components at the same time. 3.1. Implementation and ranking mechanisms We have implemented this approach in a tool called CompRE, conveniently named after the main ontology in our model. CompRE is deployed as a plug-in for the Eclipse Integrated Development Environment (IDE). Figure 1 shows a snapshot of CompRE’s main views in the Eclipse workbench. When loaded for the first time, CompRE processes the library code and the component ontology in order to generate the initial knowledge base as described in Section 2.3. This process is completely automatic and ef- ficient (it only took 4.5 seconds for parsing and processing the Jena framework). CompRE also includes a module that allow users to tag components with semantic references that corresponds to concepts from the domain ontology. These annotations entered via drag and drop mechanisms, captured by the storage module, and stored automatically in the knowledge base. Upon the conclusion of the knowledge population process, a knowledge repository is created and becomes ready for answering user requests. CompRE provides two separate views for formulating queries. The first view is provided as a simple data entry form as shown in the figure. In each entry box, users need to provide search restrictions that are either prefixed with an ontology name or provided as plain keywords enclosed 4
Edit Source Refactor Navigate Search Project Run s CompRE Window Help COMPRE C Domain ontology 23 日 L thing o-C queryLanguage Syntax A 曰 Component L ClassType pu工⊥。工往 ss Retr1 eveselectec [ puB工1Resu⊥ LaSer execute 区 arrw-OurrwFer e EnerFaceType|‖(Tat( E Component Search 3S a界 ARQL Query 由@ AnnotationTyp Enter your search criteria.. +base +uri swonto SemanticQuery +C InstanceMetho Dependencies: Start Over ontrolstructure L Repetitionstrucy Figure 1. CompRE: showing the component ontology, domain ontology, and the main search view within quotes. As described in the previous section, using prove this initial ranking, we refine this initial order based the compre-kb prefix tells the system that this is in fact an on suitability measures that consider the current user con- actual API type specified in the knowledge base. However, text. We thus parse the code that is currently being de- the swonto prefix refers to a concept from the currently veloped and create a profile that includes all visible types active domain ontology. Since the domain ontology ca that are either declared by the programmer or inherited be different for each APl, its name is provided as an exter- the user's context. We further analyze each retrieved can- nal configuration parameter. Free-text requirements in each didates signature in terms of the new input types that this query may optionally utilize all fuzzy extensions supported candidate will introduce into the current context if selected by the lucene's query parser, thus allowing a full-featured by the user. Naturally, candidates that introduce more types keyword-based search. Once the form is filled, CompRE should be assigned a lower rank value. However, finding the collects the search requirements and automatically gener- newly introduced type in the context profile, will not count ate a query using the SPARQL 4 query language, it then against this candidates score. executes this query against the knowledge base. CompRE With the absence of keywords in user queries, we ap- also provides a query answering view for advanced users ply only context-based heuristics such that candidates with who wish to edit their own SPARQL queries directly, there- exact matches are put at the top of the list while other can- ore, gaining full control over various aspects of the compo- didates are ranked based on the number and type of their nent ontology. For instance, users may wish to specify that input and output types. For example, consider a user who is the desired component extends a particular compone trying to search for a component that requires two particular or perhaps usedBy a certain number of components as a input types, namely ll and 12. Assume that the repository measure of its popularity in the target library. Regardless contains three components, namely Cl, C2, and C3. Lets of the data entry mechanism used, CompRE executes the also assume that Cl is an exact match, C2 has only one in- query, ranks the retrieved instances, and presents the result put type(l1), and C3 requires three input types (ll, 12, and n a viewer that enables further exploration of each recom- 13). The system is then ranks Cl first, C2 is ranked second mended component and c3 is in fact the least desired since it will introduce a Ranking the retrieved candidates according to their rele- new type to the user context, it is thus included in the result ncy to the user needs saves time and efforts. While al lowing blended queries in our approach ensures flexibil- are simple, easy to implement, and work surprisingly well ity and robustness, it however complicates the ranking pro- cess. When blended or pure syntactic queries are submitted, 4. Experiments and results we initially rely on the traditional, however solidly proven, scoring mechanisms supported by Lucene. In order to im- Due to the lack of independent and standard benchmark test data, search tools evaluation is, to some degree, sub- 4http://www.w3.org/tr/rdf-sparQ-qUery jective. However, we designed our experi
Figure 1. CompRE: showing the component ontology, domain ontology, and the main search view within quotes. As described in the previous section, using the compre-kb prefix tells the system that this is in fact an actual API type specified in the knowledge base. However, the swonto prefix refers to a concept from the currently active domain ontology. Since the domain ontology can be different for each API, its name is provided as an external configuration parameter. Free-text requirements in each query may optionally utilize all fuzzy extensions supported by the Lucene’s query parser, thus allowing a full-featured keyword-based search. Once the form is filled, CompRE collects the search requirements and automatically generate a query using the SPARQL 4 query language, it then executes this query against the knowledge base. CompRE also provides a query answering view for advanced users who wish to edit their own SPARQL queries directly, therefore, gaining full control over various aspects of the component ontology. For instance, users may wish to specify that the desired component extends a particular component or perhaps usedBy a certain number of components as a measure of its popularity in the target library. Regardless of the data entry mechanism used, CompRE executes the query, ranks the retrieved instances, and presents the result in a viewer that enables further exploration of each recommended component. Ranking the retrieved candidates according to their relevancy to the user needs saves time and efforts. While allowing blended queries in our approach ensures flexibility and robustness, it however complicates the ranking process. When blended or pure syntactic queries are submitted, we initially rely on the traditional, however solidly proven, scoring mechanisms supported by Lucene. In order to im- 4http://www.w3.org/TR/rdf-sparql-query/ prove this initial ranking, we refine this initial order based on suitability measures that consider the current user context. We thus parse the code that is currently being developed and create a profile that includes all visible types that are either declared by the programmer or inherited in the user’s context. We further analyze each retrieved candidate’s signature in terms of the new input types that this candidate will introduce into the current context if selected by the user. Naturally, candidates that introduce more types should be assigned a lower rank value. However, finding the newly introduced type in the context profile, will not count against this candidate’s score. With the absence of keywords in user queries, we apply only context-based heuristics such that candidates with exact matches are put at the top of the list while other candidates are ranked based on the number and type of their input and output types. For example, consider a user who is trying to search for a component that requires two particular input types, namely I1 and I2. Assume that the repository contains three components, namely C1, C2, and C3. Lets also assume that C1 is an exact match, C2 has only one input type (I1), and C3 requires three input types (I1, I2, and I3). The system is then ranks C1 first, C2 is ranked second, and C3 is in fact the least desired since it will introduce a new type to the user context, it is thus included in the result set to improve recall, however, ranked last. These heuristics are simple, easy to implement, and work surprisingly well. 4. Experiments and results Due to the lack of independent and standard benchmark test data, search tools evaluation is, to some degree, subjective. However, we designed our experiments such that it 5