On Deriving Tagsonomies: Keyword Relations Coming from crowd Michal barla and maria bielikova Institute of Informatics and Software Engineering, Faculty of Informatics and Information Technologies, Slovak University of Technology Ilkovicova 3. 842 16 Bratislava, Slovakia I Name Surname fiit. stuba sk Abstract. Many keyword-based approaches to text classification, in formation retrieval or even user modeling for adaptive web-based system could benefit from knowledge on relations between various key words which gives further possibilities to compare them, evaluate their distance etc. This paper proposes an approach how to determine keyword rela- tions(mainly a parent-child relationship) by leveraging collective wisdom of the masses, present in data of collaborative(social)tagging systems on the Web. The feasibility of our approach is demonstrated on the data coming from the social bookmarking systems delicious and CiteULik 1 Introduction Phenomenon of the Social Web, with its roots in Web 2.0, is gaining a lot of attention all over the world in both research and practice. We are studying the power and wisdom of masses, when millions of people switched from the pas- sive reading of the content to the active participation in its creation. People are blogging, sharing wikis, connecting themselves in various social applications nd above all-tagging almost everything they get into touch: bookmarks, pho- tographs, videos, publications, blogposts, articles etc. People got used to classify items by ng few simple tags to it and to use those tags for a future retrieval of their favorite items. More, they are often expecting to find a new, yet unseen content by using their own tags. a part of the Web 2.0 success lies in an implicit agreement of masses on a shared(but never explicitly defined) vocabulary used to tag items -folksonomies. At the same time, vast amount of content requires efficient navigation sup- port, content reorganization or filtering -personalization and adaptation of the web. Its efficiency is dependent on adaptive systems ability to capture and maintain user model. A lot of research was devoted to finding the most suitable fexible or most generic and all-encompassing user model representation [1, 2 however, so far we are not aware of any explicit agreement on an ideal model The obvious question, when analyzing the success of Web 2.0, is whether an assignments of keywords(tags)to user instead of to pages (i.e, creation of
On Deriving Tagsonomies: Keyword Relations Coming from Crowd Michal Barla and M´aria Bielikov´a Institute of Informatics and Software Engineering, Faculty of Informatics and Information Technologies, Slovak University of Technology Ilkoviˇcova 3, 842 16 Bratislava, Slovakia {Name.Surname}@fiit.stuba.sk Abstract. Many keyword-based approaches to text classification, information retrieval or even user modeling for adaptive web-based system could benefit from knowledge on relations between various keywords, which gives further possibilities to compare them, evaluate their distance etc. This paper proposes an approach how to determine keyword relations (mainly a parent-child relationship) by leveraging collective wisdom of the masses, present in data of collaborative (social) tagging systems on the Web. The feasibility of our approach is demonstrated on the data coming from the social bookmarking systems delicious and CiteULike. 1 Introduction Phenomenon of the Social Web, with its roots in Web 2.0, is gaining a lot of attention all over the world in both research and practice. We are studying the power and wisdom of masses, when millions of people switched from the passive reading of the content to the active participation in its creation. People are blogging, sharing wikis, connecting themselves in various social applications and above all – tagging almost everything they get into touch: bookmarks, photographs, videos, publications, blogposts, articles etc. People got used to classify items by assigning few simple tags to it and to use those tags for a future retrieval of their favorite items. More, they are often expecting to find a new, yet unseen content by using their own tags. A part of the Web 2.0 success lies in an implicit agreement of masses on a shared (but never explicitly defined) vocabulary used to tag items – folksonomies. At the same time, vast amount of content requires efficient navigation support, content reorganization or filtering – personalization and adaptation of the web. Its efficiency is dependent on adaptive system’s ability to capture and maintain user model. A lot of research was devoted to finding the most suitable, flexible or most generic and all-encompassing user model representation [1, 2], however, so far we are not aware of any explicit agreement on an ideal model representation. The obvious question, when analyzing the success of Web 2.0, is whether an assignments of keywords (tags) to user instead of to pages (i.e., creation of
tag-based user model) could lead to simple, viable and efficient approach to user modeling for adaptive web-based system. The challenge is then to combine the user model with the models of communities coming from the emerging social web and create a solid platform for personalization based on both traditional(e. g as presented in 3), and social approaches, such as the one presented in [4 When considering tag-based user models in a(tag-based)Web 2.0 environ ment, we are facing the need to be able to compare various tags. We might need to compare user characteristics or even whole user models, represented by tags o find similarities between users, which could serve for model maintenance as well as for more complex tasks such as online community creation or recom- mending in a recommender system. We might also need to compare the domai items represented by tags (e. g, a web page)in order to evaluate the explicit implicit feedback 5 and update the user model appropriately In this paper, we propose a method for inferring various relationships be- tween tags, which allows for full-blown usage of the tag-based user model for personalization and adaptation on the Web The paper is structured as follows: In section 2 we explain our approach to finding relationships between tags. Section 3 presents data we acquired and results of experiments we performed. In section 4 we summarize the related works, which served as an inspiration for our algorithm for building hierarchies from folksonomies. Finally, we give conclusions 2 Finding Relationships between Tags Our approach to finding relationships between tags combines three distinct ap- 1. Deriving of parent-child relationships between tags from a given folksonomy 2. Determining similarity between tags by applying spreading activation on the top of the folksonomy graph 3. Interconnecting tags by additional semantic relationships as well as enriching the whole tag corpus by adding external keywords: both coming from the Wordnet lexical database 2.1 Building Hierarchies from Folksonomies lksonomy is defined as a hypergraph 6h: where the set of vertices V=AUTUI and AnT=0, AnI=0, TOI=O and the set of ternary edges E={(a,t,)|a∈A,t∈T,t∈}. A social tagging system can be represented by such a hypergraph with following definitions of the sets A, T and i (we will use them for the rest of the paper as well) Actors(users)A=a1,.,ak] Tags(keywords, concepts)T=ti, t) Items(objects, instances)I=i1,.,imI
tag-based user model) could lead to simple, viable and efficient approach to user modeling for adaptive web-based system. The challenge is then to combine the user model with the models of communities coming from the emerging social web and create a solid platform for personalization based on both traditional (e.g., as presented in [3]), and social approaches, such as the one presented in [4]. When considering tag-based user models in a (tag-based) Web 2.0 environment, we are facing the need to be able to compare various tags. We might need to compare user characteristics or even whole user models, represented by tags, to find similarities between users, which could serve for model maintenance as well as for more complex tasks such as online community creation or recommending in a recommender system. We might also need to compare the domain items represented by tags (e.g., a web page) in order to evaluate the explicit or implicit feedback [5] and update the user model appropriately. In this paper, we propose a method for inferring various relationships between tags, which allows for full-blown usage of the tag-based user model for personalization and adaptation on the Web. The paper is structured as follows: In section 2 we explain our approach to finding relationships between tags. Section 3 presents data we acquired and results of experiments we performed. In section 4 we summarize the related works, which served as an inspiration for our algorithm for building hierarchies from folksonomies. Finally, we give conclusions. 2 Finding Relationships between Tags Our approach to finding relationships between tags combines three distinct approaches: 1. Deriving of parent-child relationships between tags from a given folksonomy; 2. Determining similarity between tags by applying spreading activation on the top of the folksonomy graph; 3. Interconnecting tags by additional semantic relationships as well as enriching the whole tag corpus by adding external keywords; both coming from the Wordnet lexical database. 2.1 Building Hierarchies from Folksonomies Folksonomy is defined as a hypergraph [6] H := hV, Ei, where the set of vertices V = A ∪T ∪I and A ∩T = ∅, A ∩I = ∅, T ∩I = ∅ and the set of ternary edges E = {(a, t, i) | a ∈ A, t ∈ T, i ∈ I}. A social tagging system can be represented by such a hypergraph with following definitions of the sets A, T and I (we will use them for the rest of the paper as well): – Actors (users) A = {a1, ..., ak} – Tags (keywords, concepts) T = {t1, ..., tl} – Items (objects, instances) I = {i1, ..., im}
foremen ntioned hypergraph can be reduced into three bipartite graphs a graph holding the associations between actors and tags(AT), actors and items (AD) and tags and items(TD). For example, the AT valued bipartite graph defined as follows:AT:=(4A×T,Eat),Eat={(a,t)|i∈I:(a,t,i)∈E} Each such bipartite graph XY: =(XY, Exy) can be furthermore folded into two one-mode graphs Gx Ex), Gy =(Vy, Ew), where Vr =X V=Y, and Er={(a,b)|a,b∈X,y∈Y:(a,y)∈Exy,(b,y)∈Ery} Ey={(a,b)|a,b∈Y,丑x∈X:(x,a)∈Exy,(,b)∈Exy} We can furthermore define a weight of an edge eab E Ei in such a one-mode graph as u(eab):=|k|k∈K:(a,k)∈E1k,(b,k)∈Eik}. In other words, the weight w(ea. b) shows the number of times the a and b were linked together in an original bipartite graph By folding aforementioned bipartite graphs, we get six different one-mode graphs, each representing different semantic network encoded in the folksonomy. For example, by folding AT graph through tags, we get a social network of actors(users) based on overlapping sets of tags, where the links are bet people who have used the same tags with weights showing the number of tags they have used in common. Similarly, we can get a social network based on overlapping sets of items(two people are linked if they have tagged the same item, with weight showing the number of items they have tagged in common) In our work, we were primarily interested in folding TI graph through items giving us a semantic network of tags The proposed approach to creation of hierarchy from the folded folksonomy hypergraph as defined above is based on a rather simple assumptions of set theory In an ideal situation, the tag ta is a parent of tag tb if the set of entities (persons or items)classified under tb is a subset of the entities under ta In other words, all items classified under narrow tag also appear under the broader tag Moreover, since our goal was to produce a reusable hierarchy of hich could be mapped to users' interests and use this hierarchy as a basis for reasoning on those interests, we were not interested in tags, which were used only by a small amount of users, even if they were using it quite extensively. We wanted only what "crowd agrees upon"and were filtering-out tags not achieving a certain degree of popularity (i. e, it is not used by at least k% of all users), even if this decision reduced drastically the amount of tags in the resulting hierarchy The algorithm 1 shows the basic idea of our approach using a simple code. First, we create an artificial root of the hierarchy and put it in the set of already processed tags(ordered tags in the algorithm). Then, we process all tags from the folksonomy in the following manner 1. If the tag t does not reach the popularity threshold, we omit it immediately and n to the next one 2. Otherwise, we compute its overlap(intersect)with every tag from the ordered tags set, resulting in identifying the tag to with a maximum overlap
The aforementioned hypergraph can be reduced into three bipartite graphs: a graph holding the associations between actors and tags (AT), actors and items (AI) and tags and items (T I). For example, the AT valued bipartite graph is defined as follows: AT := hA × T, Eati, Eat = {(a, t) | ∃i ∈ I : (a, t, i) ∈ E}. Each such bipartite graph XY := hX × Y, Exyi can be furthermore folded into two one-mode graphs GX := hVx, Exi, GY := hVy, Eyi, where Vx = X, Vy = Y , and Ex = {(a, b) | a, b ∈ X, ∃y ∈ Y : (a, y) ∈ Exy, (b, y) ∈ Exy}, Ey = {(a, b) | a, b ∈ Y, ∃x ∈ X : (x, a) ∈ Exy, (x, b) ∈ Exy}. We can furthermore define a weight of an edge eab ∈ Ei in such a one-mode graph as w(eab) := |{k | k ∈ K : (a, k) ∈ Eik, (b, k) ∈ Eik}|. In other words, the weight w(ea,b) shows the number of times the a and b were linked together in an original bipartite graph. By folding aforementioned bipartite graphs, we get six different one-mode graphs, each representing different semantic network encoded in the folksonomy. For example, by folding AT graph through tags, we get a social network of actors (users) based on overlapping sets of tags, where the links are between people who have used the same tags with weights showing the number of tags they have used in common. Similarly, we can get a social network based on overlapping sets of items (two people are linked if they have tagged the same item, with weight showing the number of items they have tagged in common). In our work, we were primarily interested in folding T I graph through items, giving us a semantic network of tags. The proposed approach to creation of hierarchy from the folded folksonomy hypergraph as defined above is based on a rather simple assumptions of set theory: In an ideal situation, the tag ta is a parent of tag tb if the set of entities (persons or items) classified under tb is a subset of the entities under ta. In other words, all items classified under narrow tag also appear under the broader tag. Moreover, since our goal was to produce a reusable hierarchy of tags, which could be mapped to users’ interests and use this hierarchy as a basis for reasoning on those interests, we were not interested in tags, which were used only by a small amount of users, even if they were using it quite extensively. We wanted only what “crowd agrees upon” and were filtering-out tags not achieving a certain degree of popularity (i.e., it is not used by at least k% of all users), even if this decision reduced drastically the amount of tags in the resulting hierarchy. The algorithm 1 shows the basic idea of our approach using a simple pseudocode. First, we create an artificial root of the hierarchy and put it in the set of already processed tags (ordered tags in the algorithm). Then, we process all tags from the folksonomy in the following manner: 1. If the tag t does not reach the popularity threshold, we omit it immediately and pass on to the next one. 2. Otherwise, we compute its overlap (intersect) with every tag from the ordered tags set, resulting in identifying the tag to with a maximum overlap
3. The parent-child or sibling relationship is established if the ratio of maximum overlap to the overall use of the tag t reaches the pre-defined threshold. The roles of the tags in relationship is determined as follows: if the tag to is used significantly more often than t, we declare t to be a child of to and vice versa If the usage of both tags is more or less equal, we declare them as sibling Before the assignment of relationships, we check whether it will not "break the context in the hierarchy, i. e, we compute an average overlap of all ances- tors or children in the branch respectively. If the average overlap falls below the threshold, we create a duplicate of tag to, assign tag t to it appropriately and make a new branch of it. The creation of the duplicate aims at solving the homonymy problem, where the to tag has multiple meanings, depending on a context Spreading Activation Spreading activation is a method for associative retrieval [7 in associative and semantic networks. hence a network data structure consisting of nodes and links modeling relationships between nodes. Searching is done by activating the se- lected node with an activation energy and spreading this energy through the edges to its neighbors so that Energylneighbor l- Energylorigin] The process runs recursively until convergence. At the end, the nodes activation levels represent the re of similarity to the initially chosen node or nodes. Spreading activation search in the folksonomy finds new relationships, which are not "visible when considering only set theory assumptions and the algo- rithm 1. If these relations are added to the already existing parent-child rela- tionships, it allows us to make contextual "jumps"between the tags, which we believe could have an interesting impact for the tag- based user modeling and Since the folksonomy does not provide direct links between tags, the spread- ng activation is performed on either the bipartite graph TI (tags-items)or TA (tags-actors)or on a combination of the two, all depending on the definition what neighbors means for the algorithm (i. e, the spreading activation on the Ta graph is performed if the neighbors of a tag are all actors which used that tag). However, due to the vast amount of connections present in every social tagging system, which leads to enormous time and space complexity of the re- cursive process, we needed to select a sub-graph of the whole folksonomy. For instance, for the TI bipartite graph, we select the sub-graph by modifying the neighbors function of a tag/item so that it will return only popular items/tags where popularity of a tag or item is defined(following the same principles as in algorithm 1)as a ratio of actors which used that particular tag or tagge that particular item. When spreading through TA graph, we define an actor as popular if he or she used at least k% of popular tags and tagged at least l% of popular items
3. The parent-child or sibling relationship is established if the ratio of maximum overlap to the overall use of the tag t reaches the pre-defined threshold. The roles of the tags in relationship is determined as follows: if the tag to is used significantly more often than t, we declare t to be a child of to and vice versa. If the usage of both tags is more or less equal, we declare them as siblings. 4. Before the assignment of relationships, we check whether it will not “break” the context in the hierarchy, i.e., we compute an average overlap of all ancestors or children in the branch respectively. If the average overlap falls below the threshold, we create a duplicate of tag to, assign tag t to it appropriately and make a new branch of it. The creation of the duplicate aims at solving the homonymy problem, where the to tag has multiple meanings, depending on a context. 2.2 Finding Related Tags by Spreading Activation Spreading activation is a method for associative retrieval [7] in associative and semantic networks, hence a network data structure consisting of nodes and links modeling relationships between nodes. Searching is done by activating the selected node with an activation energy and spreading this energy through the edges to its neighbors so that Energy[neighbori ] = Energy[origin] |neighbors| The process runs recursively until convergence. At the end, the nodes’ activation levels represent the measure of similarity to the initially chosen node or nodes. Spreading activation search in the folksonomy finds new relationships, which are not “visible” when considering only set theory assumptions and the algorithm 1. If these relations are added to the already existing parent-child relationships, it allows us to make contextual “jumps” between the tags, which we believe could have an interesting impact for the tag-based user modeling and personalization. Since the folksonomy does not provide direct links between tags, the spreading activation is performed on either the bipartite graph T I (tags-items) or T A (tags-actors) or on a combination of the two, all depending on the definition what neighbors means for the algorithm (i.e., the spreading activation on the T A graph is performed if the neighbors of a tag are all actors which used that tag). However, due to the vast amount of connections present in every social tagging system, which leads to enormous time and space complexity of the recursive process, we needed to select a sub-graph of the whole folksonomy. For instance, for the T I bipartite graph, we select the sub-graph by modifying the neighbors function of a tag/item so that it will return only popular items/tags, where popularity of a tag or item is defined (following the same principles as in algorithm 1) as a ratio of actors which used that particular tag or tagged that particular item. When spreading through T A graph, we define an actor as popular if he or she used at least k% of popular tags and tagged at least l% of popular items
Algorithm 1: Creation of tag hierarchy Result: Popular tags structured in a hierarchy create root ordered tags←root for each tag t do t if t popularity popularity threshold then t。←mnar( tatar,vtag∈ ordered tags) if ol overlap threshold th if rightcontert? (to ancestors, t) then I to children←t root, children←toc end else if Itol<t then f rightcontert? (to children, t) then t parent to parent children←to create a copy toc of to root, children end else end end else root. children←t end end 2.3 Applying Knowledge on Keywords from Wordnet ordnet 8(wordnet. princeton. edu) is a large lexical database for the Er glish language developed and maintained by Cognitive Science Laboratory at Princeton University. It groups nouns, verbs, adjectives and adverbs into sets of cognitive synonyms(synsets). Moreover, it provides conceptual-semantic and
Algorithm 1: Creation of tag hierarchy Data: a folksonomy Result: Popular tags structured in a hierarchy create root ordered tags ← root for each tag t do t.popularity = |t.users| |overallusers| if t.popularity < popularity threshold then continue end to ← max( t ∩ tagx, ∀tagx ∈ ordered tags ) if |to| |t| > overlap threshold then if |to| |t| then if rightcontext?(to.ancestors, t) then to.children ← t end else create a copy toc of to root.children ← toc to.children ← t end end else if |to| |t| then if rightcontext?(to.children, t) then t.parent = to.parent t.children ← to end else create a copy toc of to t.children ← toc root.children ← t end end else to.sibling ← t end end else root.children ← t end end 2.3 Applying Knowledge on Keywords from Wordnet Wordnet [8] (wordnet.princeton.edu) is a large lexical database for the English language developed and maintained by Cognitive Science Laboratory at Princeton University. It groups nouns, verbs, adjectives and adverbs into sets of cognitive synonyms (synsets). Moreover, it provides conceptual-semantic and