Vocabulary Patterns in Free-for-all Collaborative I dexing systems g Maass, Tobias Kowatsch, and Timo Munster Hochschule Furtwangen University(HFU) Robert-Gerwig-Platz 1, D-78120 Furtwangen, germany wolfgang maass, tobias. kowatsch, timo. muenster lchs-furtwangende Abstract. In collaborative indexing systems users generate a big amount of metadata by labelling web-based content. These labels are known as tags and form a shared vocabulary. In order to understand the charac- teristics of that vocabulary, we study structural patterns of these tags by implying the theory of self-organizing systems. Therefore, we utilize the graph theoretic notion to model the network of tags and their valued nections, which represent frequency rates of co-occu irical data ided by the free-for-all collaborative indexing syste Delicious, Connotea and CiteULike. First, we measure the frequency d tribution of co-occurring tags. Secondly, we correlate these tags their rank over time. Results indicate a strong relationship among a few tags as well as a notable persistence of these tags over time. Therefore we make the educated guess that the observed collaborative indexing systems are self-organizing systems towards a shared vocabulary build ing. Implications on the results are the presence of semantic domains based on high frequency rates of co-occurring tags, which reflect topics of interest among the user community. When observing those semant domains over time, that information can be used to provide a historical or trend-setting development of the community s interests, thus enhanc- ng collaborative indexing systems in general as well as providing H new tool to develop community-based products and services at the same time. Key words: data, tagging, shared vocabulary, online community collaborative software, self-organizing system Introduction Cooperative, distributed labelling of content in the worldwide web is called col- laborative indexing or social tagging. Within a collaborative indexing system users annotate different contents e. g. events, video clips, music, pictures http://upcoming.org http://flickr.comhttp://espgame.org ESOE. Busan. Korea. November 2007
Vocabulary Patterns in Free-for-all Collaborative Indexing Systems Wolfgang Maass, Tobias Kowatsch, and Timo Münster Hochschule Furtwangen University (HFU) Robert-Gerwig-Platz 1, D-78120 Furtwangen, Germany {wolfgang.maass,tobias.kowatsch,timo.muenster}@hs-furtwangen.de Abstract. In collaborative indexing systems users generate a big amount of metadata by labelling web-based content. These labels are known as tags and form a shared vocabulary. In order to understand the characteristics of that vocabulary, we study structural patterns of these tags by implying the theory of self-organizing systems. Therefore, we utilize the graph theoretic notion to model the network of tags and their valued connections, which represent frequency rates of co-occurring tags. Empirical data is provided by the free-for-all collaborative indexing systems Delicious, Connotea and CiteULike. First, we measure the frequency distribution of co-occurring tags. Secondly, we correlate these tags towards their rank over time. Results indicate a strong relationship among a few tags as well as a notable persistence of these tags over time. Therefore, we make the educated guess that the observed collaborative indexing systems are self-organizing systems towards a shared vocabulary building. Implications on the results are the presence of semantic domains based on high frequency rates of co-occurring tags, which reflect topics of interest among the user community. When observing those semantic domains over time, that information can be used to provide a historical or trend-setting development of the community’s interests, thus enhancing collaborative indexing systems in general as well as providing a new tool to develop community-based products and services at the same time. Key words: Metadata, tagging, shared vocabulary, online community, collaborative software, self-organizing system 1 Introduction Cooperative, distributed labelling of content in the worldwide web is called collaborative indexing or social tagging. Within a collaborative indexing system users annotate different contents e.g.: events1, video clips2, music3, pictures4, 1 http://upcoming.org 2 http://youtube.com 3 http://last.fm 4 http://flickr.com, http://espgame.org ESOE, Busan - Korea, November 2007 45
articles and references, weblogs or websites?. These collaborative indexing sys- tems facilitate mass categorization establishing so-called folksonomies, which is a bottom up categorization made by a large user base A collaborative indexing system has basically two features. First, it is used for future retrieval of self-indexed content. Secondly, it provides recommenda- tions, which are based upon the co-occurrence of highly used tags within all annotations, whereas we call one single process of annotation an indexing task. 8 The recommendations are shown to the user by committing a tag query. For instance: content tagged with html will be frequently tagged with css as well The data collected within an indexing task contains the name of the user n url linking to the content, one or more tags and time-stamp information Theers, tags and content in a given period of time. All tags together represent the efore, the data within a collaborative indexing system is basically a network hared vocabulary of the user community. In this paper we study the structural patterns of that vocabulary, thus focusing only on the partial network of tags Analyzing this partial network requires some constructs of the graph theory Ve assume the shared vocabulary to be a self-organizing system by means of the systems theory [1]. Hence, stable patterns as well as specific correlations are determined throughout the vocabulary. In addition, implications on these patterns are presented. To support the quirements of self-organizing systems by reducing external restrictions and force we choose the free-for-all collaborative indexing systems Delicious, Connotea and CiteULike for empirical data extraction, where any user can index any content element. Thus, indexing rights are not restricted as identified by Marlow et al n This paper starts with related work covering collaborative indexing systems d the systems theory. Then, we hypothesize two assumptions regarding stable patterns within the vocabulary. Afterwards, we build up a model based on the graph theoretic notion, clarify the methodic approach and present the empirical data used to prove the assumptions. Subsequently, we present and discuss the results of our analysis and draw implications on them. Finally, we give an outlook 2 Related Work a general review on collaborative indexing systems is given by Voss 3]. Mathes discusses the organization of information via tags and points out that generated metadata is of an uncontrolled nature and fundamentally chaotic com pared to a controlled vocabulary. but he also mentions that collaborative index. http://citeulike.orghttp://connotea.orghttp://bibsonomy.org 6http://technorati.com http://del.icio.us Dotcom There may also exist occurrence of highly use his information s free we cs hole on the International Workshop on Emergent Semantics and Ontology Evolution
articles and references5, weblogs6 or websites7. These collaborative indexing systems facilitate mass categorization establishing so-called folksonomies, which is a bottom up categorization made by a large user base. A collaborative indexing system has basically two features. First, it is used for future retrieval of self-indexed content. Secondly, it provides recommendations, which are based upon the co-occurrence of highly used tags within all annotations, whereas we call one single process of annotation an indexing task.8 The recommendations are shown to the user by committing a tag query. For instance: content tagged with html will be frequently tagged with css as well. The data collected within an indexing task contains the name of the user, an url linking to the content, one or more tags and time-stamp information. Therefore, the data within a collaborative indexing system is basically a network of users, tags and content in a given period of time. All tags together represent the shared vocabulary of the user community. In this paper we study the structural patterns of that vocabulary, thus focusing only on the partial network of tags. Analyzing this partial network requires some constructs of the graph theory. We assume the shared vocabulary to be a self-organizing system by means of the systems theory [1]. Hence, stable patterns as well as specific correlations are determined throughout the vocabulary. In addition, implications on these patterns are presented. To support the requirements of self-organizing systems by reducing external restrictions and forces we choose the free-for-all collaborative indexing systems Delicious, Connotea and CiteULike for empirical data extraction, where any user can index any content element. Thus, indexing rights are not restricted as identified by Marlow et al. [2]. This paper starts with related work covering collaborative indexing systems and the systems theory. Then, we hypothesize two assumptions regarding stable patterns within the vocabulary. Afterwards, we build up a model based on the graph theoretic notion, clarify the methodic approach and present the empirical data used to prove the assumptions. Subsequently, we present and discuss the results of our analysis and draw implications on them. Finally, we give an outlook on further research. 2 Related Work A general review on collaborative indexing systems is given by Voss [3]. Mathes [4] discusses the organization of information via tags and points out that user generated metadata is of an uncontrolled nature and fundamentally chaotic compared to a controlled vocabulary. But he also mentions that collaborative index- 5 http://citeulike.org, http://connotea.org, http://bibsonomy.org 6 http://technorati.com 7 http://del.icio.us, http://myweb.yahoo.com 8 There may also exist other recommender implementations, but we focus on the cooccurrence of highly used tags because this information is freely accessible on the web. 46 International Workshop on Emergent Semantics and Ontology Evolution
ng systems are highly responsive to the users needs and their vocabulary roving them into the process of organization. Vander Wal [ 5] disting ow folksonomies depending on the amount of users, who tag one specific content element. He also defines the difference between pure tagging and folksonomy tagging Voss [6 discovers power law distributions of tag frequency rates in Deli cious and wikipedia supporting the presence of self-organizing systems. Hoth et al. 7 and QQuintarelli 8 find law distributions according to collabo tive indexing systems, too. Lund et al. 9 measure a po law distribution of user shared tags within Connotea. Results of Golder and Huberman [10] show regularities of dynamic structures within Delicious. Moreover, they introduce a lassification on the semantics of tags as well as Zhichen et al. llll Wu et al. [12 distinguish the potential of collaborative indexing systems as a technological infrastructure for acquiring social knowledge. Millen et al. [131 tudy the deployment of a collaborative indexing system within a company and highlight the remarkable acceptance rate of the users as well as its personal and organizational usefulness. In addition, Damianos et al. [141, Farrell and Lau [15 well as John and Seligmann [16 also examine the potential of collaborative dexing systems for the enterprise covering people's expertise, social networks nd the integration of those systems in existing collaborative applications. An early classification of collaborative indexing systems is done by hammond et al. [17] confronting scholarly and general resources with links and web pages. In more detailed classification Marlow et al. [2 distinguish the design of a system and present several user incentives Heymann and Garcia-Molina [18 develop an algorithm, which generates a hierarchical taxonomy of a tag network. For the e purpose Mika [19 uses social network analysis on the network of users, tags and content. Hotho et al. 7 develop a search algorithm for folksonomies to find communities of interest within collaborative indexing systems. Cattuto et al. [20 design a stochastic model for the analysis of indexing tasks over time consisting of tags and users. Dubinko et al. [21 visualize tags over time with data from Flickr, whereas Zhichen et al. [1l propose an algorithm for tag suggestions to upport the user within an indexing task. An overview of self-organizing systems 3 Motivation As mentioned above, this paper deals with the partial network of tags. The ncept of tags is central in collaborative indexing systems. The same tags used by different users to annotate similar content show a common understanding of the users. The set of all tags utilized by the user community represents the hared vocabulary. Users and content elements are linked to each other through tags, which are also directly connected when they are used together indexing task. Figures I and 2 are representing such an indexing tas as the resulting network of the tags sports, worldcup and soccer. D current work, the value of those tag connections is an essential dimer ESOE. Busan. Korea. November 2007
ing systems are highly responsive to the users needs and their vocabulary by involving them into the process of organization. Vander Wal [5] distinguishes between broad and narrow folksonomies depending on the amount of users, who tag one specific content element. He also defines the difference between pure tagging and folksonomy tagging. Voss [6] discovers power law distributions of tag frequency rates in Delicious and Wikipedia supporting the presence of self-organizing systems. Hotho et al. [7] and Quintarelli [8] find power law distributions according to collaborative indexing systems, too. Lund et al. [9] measure a power law distribution of user shared tags within Connotea. Results of Golder and Huberman [10] show regularities of dynamic structures within Delicious. Moreover, they introduce a classification on the semantics of tags as well as Zhichen et al. [11]. Wu et al. [12] distinguish the potential of collaborative indexing systems as a technological infrastructure for acquiring social knowledge. Millen et al. [13] study the deployment of a collaborative indexing system within a company and highlight the remarkable acceptance rate of the users as well as its personal and organizational usefulness. In addition, Damianos et al. [14], Farrell and Lau [15] as well as John and Seligmann [16] also examine the potential of collaborative indexing systems for the enterprise covering people’s expertise, social networks and the integration of those systems in existing collaborative applications. An early classification of collaborative indexing systems is done by Hammond et al. [17] confronting scholarly and general resources with links and web pages. In a more detailed classification Marlow et al. [2] distinguish the design of a system and present several user incentives. Heymann and Garcia-Molina [18] develop an algorithm, which generates a hierarchical taxonomy of a tag network. For the same purpose Mika [19] uses social network analysis on the network of users, tags and content. Hotho et al. [7] develop a search algorithm for folksonomies to find communities of interest within collaborative indexing systems. Cattuto et al. [20] design a stochastic model for the analysis of indexing tasks over time consisting of tags and users. Dubinko et al. [21] visualize tags over time with data from Flickr, whereas Zhichen et al. [11] propose an algorithm for tag suggestions to support the user within an indexing task. An overview of self-organizing systems is given by Heylighen [1]. 3 Motivation As mentioned above, this paper deals with the partial network of tags. The concept of tags is central in collaborative indexing systems. The same tags used by different users to annotate similar content show a common understanding of the users. The set of all tags utilized by the user community represents the shared vocabulary. Users and content elements are linked to each other through tags, which are also directly connected when they are used together within one indexing task. Figures 1 and 2 are representing such an indexing task as well as the resulting network of the tags sports, worldcup and soccer. Due to the current work, the value of those tag connections is an essential dimension, which ESOE, Busan - Korea, November 2007 47
is based on the rate of tags co-occurring within all indexing tasks. A prerequisite for ement of this f tags can be assigned to one resource by multiple users al worldcup description FIFAworldcup com The Official Site of FIFA World Cup] soccer sports Fig. 2. Resulting network of Fig. 1. Graphical input mask for an indexing task the indexing task in Fig. Prior work on stable patterns suggests that collaborative indexing system are self-organizing systems [10, 2, 6, 8, 9]. The vocabulary -consisting of tags and generated within all indexing tasks by all users- is a part of this system, which organizes its structure by itself, without a centralized control mechanism. The users of a collaborative indexing system generate this vocabulary in a decentral ized approach, not even aware of it. On its own this system evolves over time into a more stable state Contrary to the aforementioned work, we explore patterns emerging out of co- occurring tags. Therefore, we want to know if the power law distribution, which consists of a few tags co-occurring with high f tommie is common in broad folksonomies [ 7, 9, 8, is also applicable to the structi of co-occurring tags. This would represent a community's vocabulary, which occurring with low frequency rates. Such a pattern- we call it tag economics ould indicate a strong consensus on a particular subpart of t vocabulary, from which particular interests of the users can be identified. Due to these considerations, we hypothesize the relation of co-occurring tags as follow: frequency distribution of all valued connections from T, to T flop e ranked HI Let T be a tag and T, all tags co-occurring with T w curve Additionally, we focus on the frequency dynamics of tags over time depending their position in the aforementioned frequency distribution. We assume that tags co-occurring with high frequency rates(higher position on the power law curve) are more stable over time than tags co-occurring with low frequency rates. This would represent persistence of the community's interests or, when tags with high frequency rates change to a low position, one can suggest a shift of the community's common understanding. Therefore, the current work has the second objective to examine the relationship of the frequency rates of ccurring tags over time. We hypothesize this relationship as follows: H2 The higher the frequency rates of the tags T, the more stable are they over International Workshop on Emergent Semantics and Ontology Evolution
is based on the frequency rate of tags co-occurring within all indexing tasks. A prerequisite for a measurement of this frequency is the bag-model for aggregation of tags, in which multiple tags can be assigned to one resource by multiple users as discussed by Marlow et al. [2]. tags sports worldcup soccer description FIFAworldcup.com The Official Site of FIFA World Cup url http://www.fifaworldcup.com Fig. 1. Graphical input mask for an indexing task worldcup sports soccer Fig. 2. Resulting network of the indexing task in Fig. 1 Prior work on stable patterns suggests that collaborative indexing systems are self-organizing systems [10, 2, 6, 8, 9]. The vocabulary - consisting of tags and generated within all indexing tasks by all users - is a part of this system, which organizes its structure by itself, without a centralized control mechanism. The users of a collaborative indexing system generate this vocabulary in a decentralized approach, not even aware of it. On its own this system evolves over time into a more stable state. Contrary to the aforementioned work, we explore patterns emerging out of cooccurring tags. Therefore, we want to know if the power law distribution, which is common in broad folksonomies [7, 9, 8], is also applicable to the structure of co-occurring tags. This would represent a community’s vocabulary, which consists of a few tags co-occurring with high frequency rates and many tags cooccurring with low frequency rates. Such a pattern - we call it tag economics - would indicate a strong consensus on a particular subpart of the community’s vocabulary, from which particular interests of the users can be identified. Due to these considerations, we hypothesize the relation of co-occurring tags as follows: H1 Let Ti be a tag and Tj i all tags co-occurring with Ti. Then the ranked frequency distribution of all valued connections from Ti to Tj i follows a power law curve. Additionally, we focus on the frequency dynamics of tags over time depending on their position in the aforementioned frequency distribution. We assume that tags co-occurring with high frequency rates (higher position on the power law curve) are more stable over time than tags co-occurring with low frequency rates. This would represent persistence of the community’s interests or, when tags with high frequency rates change to a low position, one can suggest a shift of the community’s common understanding. Therefore, the current work has the second objective to examine the relationship of the frequency rates of cooccurring tags over time. We hypothesize this relationship as follows: H2 The higher the frequency rates of the tags Tj i , the more stable are they over time. 48 International Workshop on Emergent Semantics and Ontology Evolution
4 Model Let an indexing task be a quadruple comprised of user, url, timestamp One user enters an url with none, one or more tags into a collaborative indexing system at a certain time. Only two entities are ant according to our theses, namely timestamp and tag. Therefore, The communitys vocabular is modelled as an undirected, valued and finite graph V within a given peri of time 8. This period of time is essential, because the frequency of time based dexing tasks is subject to fluctuations, which occur in the course of a day, a reek or month. Furthermore, d can be used to affect directly the size of the ocabulary V to ease the analysis. The vocabulary V consists of a set of nodes(here tags) and a set of valued links, which represent the frequency values of co-occurring tags. Hence, we refer this vocabulary as the network of tags, too. The links are undirected since each ag i, which co-occurs with a tag ], also means that the tag j co-occurs with the tag i, respectively. To better handle these frequency values, the vocabulary can be described by a symmetric frequency matrix F, such that the value on the ith row and jth column represents the frequency rate of the co- g tags i and j over all indexing tasks within 8, denoted as f(i,). Self references are excluded since we focus only on co-occurring tags. Thus, the diagonal values f(i, ]) with i=j are always zero. Figure 3 exemplifies an undirected, valued graph of the vocabulary V, whereas Fig. 4 shows the corresponding frequency matrix F Based upon this graph theoretic notion and the corresponding frequency matrix we are able to illustrate and compute the frequency distribution of co-occurrin 3. Undirected, valued graph of the abulary V including 5 tags Fig 4. frequency matrix F of the vocabulary 4.1 Method A frequency matrix F(6,)is built within a given period of time. Afterwards, the frequency values f(i, j) for each tag Ti are summed up. Consecutively, those ESOE. Busan. Korea. November 2007
4 Model Let an indexing task be a quadruple comprised of < user, url, timestamp, tag∗ >. One user enters an url with none, one or more tags into a collaborative indexing system at a certain time. Only two entities are important according to our hypotheses, namely timestamp and tag. Therefore, The community’s vocabulary is modelled as an undirected, valued and finite graph V within a given period of time δ. This period of time is essential, because the frequency of time based indexing tasks is subject to fluctuations, which occur in the course of a day, a week or month. Furthermore, δ can be used to affect directly the size of the vocabulary V to ease the analysis. The vocabulary V consists of a set of nodes (here tags) and a set of valued links, which represent the frequency values of co-occurring tags. Hence, we refer to this vocabulary as the network of tags, too. The links are undirected since each tag i, which co-occurs with a tag j, also means that the tag j co-occurs with the tag i, respectively. To better handle these frequency values, the vocabulary can be described by a symmetric frequency matrix F, such that the value on the ith row and jth column represents the frequency rate of the co-occurring tags i and j over all indexing tasks within δ, denoted as f(i, j). Self references are excluded since we focus only on co-occurring tags. Thus, the diagonal values f(i, j) with i = j are always zero. Figure 3 exemplifies an undirected, valued graph of the vocabulary V , whereas Fig. 4 shows the corresponding frequency matrix F. Based upon this graph theoretic notion and the corresponding frequency matrix, we are able to illustrate and compute the frequency distribution of co-occurring tags. 1 2 4 3 5 2 6 3 2 1 4 2 2 Fig. 3. Undirected, valued graph of the vocabulary V including 5 tags 0 0 0 0 2 0 2 0 3 0 4 2 2 1 6 i 1 2 3 4 5 j 1 2 3 4 5 0 0 0 0 2 2 4 1 0 0 0 6 3 2 2 Fig. 4. Corresponding frequency matrix F of the vocabulary V 4.1 Method A frequency matrix F(δ1) is built within a given period of time. Afterwards, the frequency values f(i, j) for each tag Ti are summed up. Consecutively, those ESOE, Busan - Korea, November 2007 49