ConTag: A Semantic Tag Recommendation System Benjamin Adrian,2, Leo Sauermann2, Thomas Roth-Berghofer' (Knowledge-Based Systems Group, Department of Computer Science, University of Kaiserslautern, P.O. Box 3049, 67653 Kaiserslautern) 2(German Research Center for Artificial Intelligence DFKI Gmbh Trippstadter StraBe 122, 67663 Kaiserslautern germany firstname lastname@dfki.de) uments based on Semantic Web ontologies and Web 2.0 services. We designed and im- plemented a process to normalize documents to RDF format, extract document topics using Web 2.0 services and finally match extracted topics to a Semantic Web ontology Due to ConTag we are able to show that the information provided by Web 2.0 services n combination with a Semantic Web ontology enables the generation of relevant se- mantic tag recommendations for documents The main contribution of this work is semantic tag recommendation process based on a choreography of Web 2.0 services Key Words: Ontology, Web 2.0, Semantic Web, Social Software, Tagging Category: H 1. 1, H.3.3 1 Introduction In this paper we describe ConTag, a recommendation system to tag or anno- tate documents with concepts of a Semantic Web ontology In ConTag, Web 2.0 services providing text and term analysis functions such as phrase extraction dictionaries, thesauri, classifications and term associations are used to extract the information content of a document. This approach shows that the conver gence of Web 2.0 and Semantic Web is worthwhile regarding Web 2.0 tagging nd Semantic Web ontologies. The information provided by Web 2.0 services ombined with a Semantic Web ontology enables us to recommend semantic tags for documents In Section 2, we explain the state of the art of tagging in a Semantic Web environment. Section 3 describes the architecture of Con Tag. including different possibilies of retrieving relevant similarities between document topics and onto ogy instances. Section 4 provides concrete implementation details. It illustrates the extraction of document topics based on Web 2.0 services and the recommen- dation of similar ontology instances as semantic tags. The evaluation in Sec- tion 5 confirms the statement that the information provided by Web 2.0 services in combination with a Semantic Web ontology enables the generation of relevant semantic tag recommendations for documents. Finally, Section 6 summarizes the approach and denotes future goals
ConTag: A Semantic Tag Recommendation System Benjamin Adrian1,2 , Leo Sauermann2 , Thomas Roth-Berghofer1,2 1 (Knowledge-Based Systems Group, Department of Computer Science, University of Kaiserslautern, P.O. Box 3049, 67653 Kaiserslautern) 2 (German Research Center for Artificial Intelligence DFKI GmbH, Trippstadter Straße 122, 67663 Kaiserslautern Germany, firstname.lastname@dfki.de) Abstract: ConTag is an approach to generate semantic tag recommendations for documents based on Semantic Web ontologies and Web 2.0 services. We designed and implemented a process to normalize documents to RDF format, extract document topics using Web 2.0 services and finally match extracted topics to a Semantic Web ontology. Due to ConTag we are able to show that the information provided by Web 2.0 services in combination with a Semantic Web ontology enables the generation of relevant semantic tag recommendations for documents. The main contribution of this work is a semantic tag recommendation process based on a choreography of Web 2.0 services. Key Words: Ontology, Web 2.0, Semantic Web, Social Software, Tagging Category: H.1.1, H.3.3 1 Introduction In this paper we describe ConTag, a recommendation system to tag or annotate documents with concepts of a Semantic Web ontology. In ConTag, Web 2.0 services providing text and term analysis functions such as phrase extraction, dictionaries, thesauri, classifications and term associations are used to extract the information content of a document. This approach shows that the convergence of Web 2.0 and Semantic Web is worthwhile regarding Web 2.0 tagging and Semantic Web ontologies. The information provided by Web 2.0 services combined with a Semantic Web ontology enables us to recommend semantic tags for documents. In Section 2, we explain the state of the art of tagging in a Semantic Web environment. Section 3 describes the architecture of ConTag, including different possibilies of retrieving relevant similarities between document topics and ontology instances. Section 4 provides concrete implementation details. It illustrates the extraction of document topics based on Web 2.0 services and the recommendation of similar ontology instances as semantic tags. The evaluation in Section 5 confirms the statement that the information provided by Web 2.0 services in combination with a Semantic Web ontology enables the generation of relevant semantic tag recommendations for documents. Finally, Section 6 summarizes the approach and denotes future goals
2 Related work ConTag generates tag recommendations based on an underlying Semantic Web ontology. The recommendations may be used, e.g in a Semantic Desktop appli- cation for classifying documents with a personal information model. Tag recom- mendations are generated by using existing Web 2.0 services. At the moment, we are not aware of any other system performing this task. Therefore we describe the state of the art of tagging in semantic environments The haystack project Quan et al., 2003 was an early approach of Personal ble to the Personal Information Model Ontology(PIMO)[Sauermann, il Information Management developed with Semantic Web techniques compa NEPOMUK- The Social Semantic Desktop is a project using and building on experiences with gnosis and the PIMO language/ontology. Tagging systems such as the bookmarking manager del icio us, the reference manager Connotea Lund et al., 2005] or the photo sharing service flickr, enable users to annotate documents with self defined keywords called tags The studies [Golder and Huberman, 2005 and [Kipp and Campbell, 2006 point out patterns in tagging systems. Tags are more than just keywords but symbols for personal concepts. They also point kisting semantic difficul- ties such as managing polysemies and synonyms. In an analysis of tag us- age, [ Sen et al., 2006 demanded private tags in tagging systems to be used as personal concepts. Bridging the gap beteen tags and ontologies, the ap- proach of Schmitz, 2006 described the development of ontologies based on tag usages. The general problem of relating tags and ontologies based on social services is called Folksonomy [Wal, 2004. In order to define tags in Semantic Web ontologies, Richard Newman introduced a first idea of a tagging ontology in Newman, 2005. Existing folksonomies are mined for association rules to re- trieve semantic relations between tags using co-occurances [ Schmitz et al., 2006 Piggy Bank [Huynh et al., 2005, CREAM [Handschuh and Staab, 2003 and An note [Kahan and Koivunen, 2001] provide RDF compliant tag or annotation repositories. Bloehdorn and Hotho, 2004 describes techniques to optimize text classification using semantic information As a result of this state of the art analysis, it can be said that by now it is possible to annotate documents with tags, being symbols for personal con cepts. These expressions may be stored as semantic relations in a semantic web ontology. http://nepomuk.semanticdesktop.org tp:/ /delicio.us http://www.flickr.com
2 Related Work ConTag generates tag recommendations based on an underlying Semantic Web ontology. The recommendations may be used, e.g in a Semantic Desktop application for classifying documents with a personal information model. Tag recommendations are generated by using existing Web 2.0 services. At the moment, we are not aware of any other system performing this task. Therefore we describe the state of the art of tagging in semantic environments. The haystack project [Quan et al., 2003] was an early approach of Personal Information Management developed with Semantic Web techniques comparable to the Personal Information Model Ontology (PIMO) [Sauermann, 2006]. NEPOMUK - The Social Semantic Desktop1 is a project using and building on experiences with gnowsis and the PIMO language/ontology. Tagging systems such as the bookmarking manager del.icio.us2 , the reference manager Connotea [Lund et al., 2005] or the photo sharing service flickr 3 , enable users to annotate documents with self defined keywords called tags. The studies [Golder and Huberman, 2005] and [Kipp and Campbell, 2006] point out patterns in tagging systems. Tags are more than just keywords but symbols for personal concepts. They also point out existing semantic difficulties such as managing polysemies and synonyms. In an analysis of tag usage, [Sen et al., 2006] demanded private tags in tagging systems to be used as personal concepts. Bridging the gap beteen tags and ontologies, the approach of [Schmitz, 2006] described the development of ontologies based on tag usages. The general problem of relating tags and ontologies based on social services is called Folksonomy [Wal, 2004]. In order to define tags in Semantic Web ontologies, Richard Newman introduced a first idea of a tagging ontology in [Newman, 2005]. Existing folksonomies are mined for association rules to retrieve semantic relations between tags using co-occurances [Schmitz et al., 2006]. PiggyBank [Huynh et al., 2005], CREAM [Handschuh and Staab, 2003] and Annotea [Kahan and Koivunen, 2001] provide RDF compliant tag or annotation repositories. [Bloehdorn and Hotho, 2004] describes techniques to optimize text classification using semantic information. As a result of this state of the art analysis, it can be said that by now it is possible to annotate documents with tags, being symbols for personal concepts. These expressions may be stored as semantic relations in a semantic web ontology. 1 http://nepomuk.semanticdesktop.org 2 http://del.icio.us 3 http://www.flickr.com
3 The semantic tag recommendation system ConTag In order to generate tag recommendations we used concepts formalized in PIMO vocabulary. In PIMO, concepts are separated between the two classes Thing (e. g. persons, events, locations, etc )and ResourceManifestation(music files, documents,etc).A relation occurrence connects Things to ResourceManifes tations, using the following semantic: A thing occurs in a document. Instances in a PIMO ontology are called things. Entities occurring in documents, are called topics. Expressing relevant similarities between things and topics may assume four different shapes in Con Equivalence a topic corresponds directly to a thing. Classification If a topic's class corresponds directly to an ontology class, the topic is recommended as new thing of the ontology class Superordination If a topic's class does not correspond to any ontology class he topic is recommended as new thing of a new ontology class Relation If a topic is semantically related to a thing without being equivalent a suitable relationship between topic and thing should be proposed In the actual version of Con Tag we focus on realising the similarity case equina- lence. Other semantic relations can be found in Horak, 2006 and are discussed in future work Generally, the idea of using things as tags(instead of labels) entails some asic advantages. Things are identified by URIs and labeled by rdfs: label or alternative labels pimo: altLabel. This design overcomes existing semantic problems such as synonyms, homonyms, acronyms and different spelling, which current tagging systems suffer, by separating the tags label from its identifica- tion. Additionally, things may possess a set of further describing rdF properties providing the capability to better retrieve similarities Con Tag is based on a Semantic Tag Recommendation Process(see Fig. 1) 1. During the first step, Normalisation, the document's content is tranformed to RDF format to gain a fulltext description. We use the Apertureframework to extract data and metadata such as author. creator and creation date 2. During the second step, Topic Ertraction, topics are extracted by requesting Web 2.0 services. This results in a topic map using SKOS vocabulary(Simple Knowledge Organis isation System)Miles and Brickley, 2005. In succeeding lookup iterations, each topic entity is enriched by a set of semantic properties such as definition http://www.aperture.sourceforge.net
3 The semantic tag recommendation system ConTag In order to generate tag recommendations we used concepts formalized in PIMO vocabulary. In PIMO, concepts are separated between the two classes Thing (e.g. persons, events, locations, etc.) and ResourceManifestation (music files, documents, etc). A relation occurrence connects Things to ResourceManifestations, using the following semantic: A thing occurs in a document. Instances in a PIMO ontology are called things. Entities occurring in documents, are called topics. Expressing relevant similarities between things and topics may assume four different shapes in ConTag: Equivalence A topic corresponds directly to a thing. Classification If a topic’s class corresponds directly to an ontology class, the topic is recommended as new thing of the ontology class. Superordination If a topic’s class does not correspond to any ontology class, the topic is recommended as new thing of a new ontology class. Relation If a topic is semantically related to a thing without being equivalent, a suitable relationship between topic and thing should be proposed. In the actual version of ConTag we focus on realising the similarity case Equivalence. Other semantic relations can be found in [Horak, 2006] and are discussed in future work. Generally, the idea of using things as tags (instead of labels) entails some basic advantages. Things are identified by URIs and labeled by rdfs:label or alternative labels pimo:altLabel. This design overcomes existing semantic problems such as synonyms, homonyms, acronyms and different spelling, which current tagging systems suffer, by separating the tag’s label from its identification. Additionally, things may possess a set of further describing RDF properties providing the capability to better retrieve similarities. ConTag is based on a Semantic Tag Recommendation Process (see Fig. 1): 1. During the first step, Normalisation, the document’s content is tranformed to RDF format to gain a fulltext description. We use the Aperture4 framework to extract data and metadata such as author, creator and creation date. 2. During the second step, Topic Extraction, topics are extracted by requesting Web 2.0 services. This results in a topic map using SKOS vocabulary (Simple Knowledge Organisation System) [Miles and Brickley, 2005]. In succeeding lookup iterations, each topic entity is enriched by a set of semantic properties, such as definitions and synonyms. 4 http://www.aperture.sourceforge.net
docunent SKOS tag recorrunendations epresentation Soan Normalisation Topic Extraction Alignnent Generation Aligment Execution Figure 1: ConTag's Semantic Tag Recommendation Process 3. The Alignment Generation is based on document classification methods. For each topic in the topic map, several weighted alignment possibilities are computed to retrieve similar things. 4. The forth step is called Alignment Execution. The alignment scheme is visu- alized as tag recommendations. The user decides whether to accept or reject recommendations. Accepted recommendations are processed to: (1)create new occurrence relations in case of equivalence,(2)create new instances in case of Classification, (3) create new classes in case of Superordination and(4)create new relation types in cases of other semantic relations 4 Implementation datails The following sections describe parts of the Semantic Tag Recommendation Pro- cess, namely Topic Extraction and Alignment Generation. We used the rdF store Sesame 2 to manage ontologies in RDFS and topic maps in SKOS 4.1 Topic Extraction The topic extraction step is the most valuable step in the Tag recommendatio Process. It results in developing a document specific topic map by executing a Web 2.0 service choreography to extract document entities. The SKOs vocab- ulary distinguishes topics between instances and classes similar to PIMO lan- guage using relations(broaderInstantive, narrowerInstantive). Each topic possesses a name prefLabel and alternative labels altLabel. Each topic may be further explained by fulltext definitions written in natural language using definition The topic extraction step is based on querying Web 2.0 services. The chore- ography starts with extracting relevant keyphrases of the document. At the
Figure 1: ConTag’s Semantic Tag Recommendation Process 3. The Alignment Generation is based on document classification methods. For each topic in the topic map, several weighted alignment possibilities are computed to retrieve similar things. 4. The forth step is called Alignment Execution. The alignment scheme is visualized as tag recommendations. The user decides whether to accept or reject recommendations. Accepted recommendations are processed to: (1) create new occurrence relations in case of Equivalence, (2) create new instances in case of Classification, (3) create new classes in case of Superordination, and (4) create new relation types in cases of other semantic relations. 4 Implementation datails The following sections describe parts of the Semantic Tag Recommendation Process, namely Topic Extraction and Alignment Generation. We used the RDF store Sesame 2 to manage ontologies in RDFS and topic maps in SKOS. 4.1 Topic Extraction The topic extraction step is the most valuable step in the Tag Recommendation Process. It results in developing a document specific topic map by executing a Web 2.0 service choreography to extract document entities. The SKOS vocabulary distinguishes topics between instances and classes similar to PIMO language using relations (broaderInstantive, narrowerInstantive). Each topic possesses a name prefLabel and alternative labels altLabel. Each topic may be further explained by fulltext definitions written in natural language using definition. The topic extraction step is based on querying Web 2.0 services. The choreography starts with extracting relevant keyphrases of the document. At the
moment Web 2.0 services such as Tagthe. net, Yahoo's Term Extraction service and Topicalizer are used to extract relevant keyphrases. The results are stored In a succeeding iteration, for each topic in the topic map, three succeeding lookups request Web 2.0 services to gather for more information: 1. a definition lookup queries web dictionaries such as WordNet for existing definitions. These definitions are copied and attached to their grounding topics to be used in the succeeding hypernym extraction and to further provide explanations 2. A succeeding hypernym lookup requests a self written hypernym extraction service called DefTag' to extract topic classes. These classes are stored as topics and link to instances using broaderInstantive and narrower- Instantive relation 3. a third association lookup requests services for word associations concerning each topic. This lookup considers four different services at the moment: (1+2) Two web services hosted by Ontok Wikipedia provide an access to wikipedia Online Encyclopedia, a collaborative web dictionary system. (3+4)Two web dictionary services(Moby Thesaurus II, WordNet Dictionary) are requested sing the dict protocol to extract a set of synonyms for a given term. The topic extraction step results in a document specific topic map written in SKOS. It describes each topic with definitions and word associations. See Horak, 2006 for more information about the used services 4.2 Aligning topics to things The alignment generation searches for similarities between topics and things. It results in an alignment scheme which is visualized as a list of tag recommenda- tions In order to express and weight similarities with confidence ratios, we used an ontology alignment vocabulary Due to a topological analysis of PIMO ontologies and document topic maps we assume that an ontology contains more entities than a topic map. Addi- tionally, ontologies contain class hierarchies, whether topic maps are rather fat structured. Therefore we focussed on aligning topics to things by applying hi- erarchical document classification techniques instead of using topolical ontology matching methods. In this paper, we describe a rather simple alignment ap- proach. Other appraches can be found in [Horak, 2006 http://tagthe.net tp://www.topicali 7http://www.dfki.uni-kl.de/-horak/2006/cont 8http://phaselibs.opendfki.de/wiki/alignmentontologY
moment Web 2.0 services such as Tagthe.net5 , Yahoo’s Term Extraction service and Topicalizer6 are used to extract relevant keyphrases. The results are stored into a document specific topic map. In a succeeding iteration, for each topic in the topic map, three succeeding lookups request Web 2.0 services to gather for more information: 1. A definition lookup queries web dictionaries such as WordNet for existing definitions. These definitions are copied and attached to their grounding topics to be used in the succeeding hypernym extraction and to further provide explanations. 2. A succeeding hypernym lookup requests a self written hypernym extraction service called DefTag7 to extract topic classes. These classes are stored as topics and link to instances using broaderInstantive and narrowerInstantive relations. 3. A third association lookup requests services for word associations concerning each topic. This lookup considers four different services at the moment: (1+2) Two web services hosted by Ontok Wikipedia provide an access to Wikipedia Online Encyclopedia, a collaborative web dictionary system. (3+4) Two web dictionary services (Moby Thesaurus II, WordNet Dictionary) are requested using the DICT protocol to extract a set of synonyms for a given term. The topic extraction step results in a document specific topic map written in SKOS. It describes each topic with definitions and word associations. See [Horak, 2006] for more information about the used services. 4.2 Aligning topics to things The alignment generation searches for similarities between topics and things. It results in an alignment scheme which is visualized as a list of tag recommendations. In order to express and weight similarities with confidence ratios, we used an ontology alignment vocabulary8 . Due to a topological analysis of PIMO ontologies and document topic maps we assume that an ontology contains more entities than a topic map. Additionally, ontologies contain class hierarchies, whether topic maps are rather flat structured. Therefore we focussed on aligning topics to things by applying hierarchical document classification techniques instead of using topolical ontology matching methods. In this paper, we describe a rather simple alignment approach. Other appraches can be found in [Horak, 2006]. 5 http://tagthe.net 6 http://www.topicalizer.com 7 http://www.dfki.uni-kl.de/~horak/2006/contag 8 http://phaselibs.opendfki.de/wiki/AlignmentOntology