Mining Association Rules in Folksonomies Christoph Schmitz, Andreas Hotho, Robert Jaschke,2, Gerd Stumme, 2 1 Knowledge Data Engineering Group, Department of Mathematics and Computer Science, University of Kassel, wilhelmshher Allee 73. D-34121 Kassel Germanyhttp://www.kde.cs.uni-kassel.de 2 Research Center L3S, Expo Plaza 1, D-30539 Hannover, Germany, Abstract. Social bookmark tools are rapidly emerging on the Web. In such syster users are setting up lightweight conceptual structures called folksonomies. These sys- tems provide currently relatively few structure. We discuss in this paper, how assc ciation rule mining can be adopted to analyze and structure folksonomies, and how the results can be used for ontology learning and supporting emergent semantics. We demonstrate our approach on a large scale dataset stemming from an online system 1 Introduction A new family of so-called Web 2.0applications is currently emerging platforms like Wikis, Blogs, and social resource sharing systems. In this paper e the Web. These include user-centric publishing and knowledge manageme we focus on resource sharing systems, which all use the same kind of lightweight knowledge representation, called folksonomy. The word'folksonomy'is a blend of the words 'taxonomy'andfolk', and stands for conceptual structures created by the people Resource sharing systems, such as Flickr or del icio us, have acquired large numbers of users(from discussions on the del icio us mailing list, one can ap- proximate the number of users on del icio us to be more than one hundred thousand) within less than two years. The reason for their immediate success is the fact that no specific skills are needed for participating, and that these tools yield immediate benefit for each individual user (e. g. organizing ones bookmarks in a browser-independent, persistent fashion) without too much overhead. Large numbers of users have created huge amounts of information within a very short period of time. As these systems grow larger, however, the users feel the need for more structure for better organizing their resources. For instance, approaches for tagging tags, or for bundling them, are currently dis- cussed on the corresponding news groups. Currently, however, there is a lack of theoretical foundations adapted to the new opportunities which has to be A first step towards more structure within such systems is to discover knowl- edge that is already implicitly present by the way different users assign tags to resources. This knowledge may be used for recommending both a hierarchy Ihttp://www.flickr.com http://del.icio.us
Mining Association Rules in Folksonomies Christoph Schmitz1 , Andreas Hotho1 , Robert J¨aschke1,2 , Gerd Stumme1,2 1 Knowledge & Data Engineering Group, Department of Mathematics and Computer Science, University of Kassel, Wilhelmshher Allee 73, D–34121 Kassel, Germany, http://www.kde.cs.uni-kassel.de 2 Research Center L3S, Expo Plaza 1, D–30539 Hannover, Germany, http://www.l3s.de Abstract. Social bookmark tools are rapidly emerging on the Web. In such systems users are setting up lightweight conceptual structures called folksonomies. These systems provide currently relatively few structure. We discuss in this paper, how association rule mining can be adopted to analyze and structure folksonomies, and how the results can be used for ontology learning and supporting emergent semantics. We demonstrate our approach on a large scale dataset stemming from an online system. 1 Introduction A new family of so-called “Web 2.0” applications is currently emerging on the Web. These include user-centric publishing and knowledge management platforms like Wikis, Blogs, and social resource sharing systems. In this paper, we focus on resource sharing systems, which all use the same kind of lightweight knowledge representation, called folksonomy. The word ‘folksonomy’ is a blend of the words ‘taxonomy’ and ‘folk’, and stands for conceptual structures created by the people. Resource sharing systems, such as Flickr1 or del.icio.us,2 have acquired large numbers of users (from discussions on the del.icio.us mailing list, one can approximate the number of users on del.icio.us to be more than one hundred thousand) within less than two years. The reason for their immediate success is the fact that no specific skills are needed for participating, and that these tools yield immediate benefit for each individual user (e.g. organizing ones bookmarks in a browser-independent, persistent fashion) without too much overhead. Large numbers of users have created huge amounts of information within a very short period of time. As these systems grow larger, however, the users feel the need for more structure for better organizing their resources. For instance, approaches for tagging tags, or for bundling them, are currently discussed on the corresponding news groups. Currently, however, there is a lack of theoretical foundations adapted to the new opportunities which has to be overcome. A first step towards more structure within such systems is to discover knowledge that is already implicitly present by the way different users assign tags to resources. This knowledge may be used for recommending both a hierarchy 1 http://www.flickr.com/ 2 http://del.icio.us
Date Bsarbeten Ansicht Geh 中·命·自p r's Bibsonomy::r bookmarks BibTex Semigroup or Fig 1. Bibsonomy displays bookmarks and(BIBTEXbased) bibliographic references on the already existing tags, and additional tags, ultimately leading towards emergent semantics(Staab et al.(2002), Steels(1998)) by converging use of the same vocabulary. In this sense, knowledge discovery(KDD)techniques are a promising tool for bottom-up building of conceptual structures In this paper, we will focus on a selected KDD technique, namely association rules. Since folksonomies provide a three-dimensional dataset(users, tags, and resources)instead of a usual two-dimensional one(items and transactions we present first a systematic overview of projecting a folksonomy onto a two- dimensional structure. Then we will show the results of mining rules from two selected projections on the del icio us system This paper is organized as follows. Section 2 reviews recent developments in the area of social bookmark systems, and presents a formal model In Section 3 we briefly recall the notions of association rules, before providing a systematic overview over the projections of a folksonomy onto a two-dimensional dataset in Section 4 In Section 5, we present the results of mining association rules on data of the del icio us system. Section 6 concludes the paper with a discussion of further research topics on knowledge discovery within folksonomies 2 Social Resource Sharing and Folksonomies ocial resource sharing systems are web-based systems that allow users to up- load their resources, and to label them with names. The systems can be distin- guished according to what kind of resources are supported. Flickr, for instance allows the sharing of photos, del.icio us" the sharing of bookmarks, CiteULike and Connotea the sharing of bibliographic references, and 43Things?'even the sharing of goals in private life. Our own upcoming system, called Bib Sonomy, will allow to share simultaneously bookmarks and BIBTEX entries(see Fig. 1) In their core, these systems are all very similar. Once a user is logged in, he can add a resource to the system, and assign arbitrary labels, so-called 3http://www.flickr.com/ OUs tp://www.citeulike.or http://www.connotea.org/ 7http://www.43things.com/ http://www.bibsonomy.org 2
Fig. 1. Bibsonomy displays bookmarks and (BibTEXbased) bibliographic references simultaneously. on the already existing tags, and additional tags, ultimately leading towards emergent semantics (Staab et al. (2002), Steels (1998)) by converging use of the same vocabulary. In this sense, knowledge discovery (KDD) techniques are a promising tool for bottom-up building of conceptual structures. In this paper, we will focus on a selected KDD technique, namely association rules. Since folksonomies provide a three-dimensional dataset (users, tags, and resources) instead of a usual two-dimensional one (items and transactions), we present first a systematic overview of projecting a folksonomy onto a twodimensional structure. Then we will show the results of mining rules from two selected projections on the del.icio.us system. This paper is organized as follows. Section 2 reviews recent developments in the area of social bookmark systems, and presents a formal model. In Section 3, we briefly recall the notions of association rules, before providing a systematic overview over the projections of a folksonomy onto a two-dimensional dataset in Section 4. In Section 5, we present the results of mining association rules on data of the del.icio.us system. Section 6 concludes the paper with a discussion of further research topics on knowledge discovery within folksonomies. 2 Social Resource Sharing and Folksonomies Social resource sharing systems are web-based systems that allow users to upload their resources, and to label them with names. The systems can be distinguished according to what kind of resources are supported. Flickr,3 for instance, allows the sharing of photos, del.icio.us4 the sharing of bookmarks, CiteULike5 and Connotea6 the sharing of bibliographic references, and 43Things7 even the sharing of goals in private life. Our own upcoming system, called BibSonomy, 8 will allow to share simultaneously bookmarks and BibTEX entries (see Fig. 1). In their core, these systems are all very similar. Once a user is logged in, he can add a resource to the system, and assign arbitrary labels, so-called 3 http://www.flickr.com/ 4 http://del.icio.us/ 5 http://www.citeulike.org/ 6 http://www.connotea.org/ 7 http://www.43things.com/ 8 http://www.bibsonomy.org 2
tags, to it. We call the collection of all his assignments his personomy, and the collection of all personomies is called folksonomy. The user can also explore the personomies of other users in all dimensions: for a given user he can see the resources that user has uploaded, together with the tags he has assigned to them(see Fig. 1);when clicking on a resource he sees which other users have uploaded this resource and how they tagged it; and when clicking on a tag he sees who assigned it to which resources The systems allow for additional functionality. For instance, one can copy a resource from another user, and label it with one owns tags. Overall, these systems provide a very intuitive navigation through the data 2. 1 State of the art web collaboration systems. Among the rare exceptions are Hammond et al (2005)and Lund et al.(2005) who provide good overviews of social book marking tools with special emphasis on folksonomies, and Mathes(2004)who discusses strengths and limitations of folksonomies. The main discussion on folksonomies and related topics is currently only going on mailing lists, e. g. Con- note(2005). To the best of our knowledge, the ideas presented in this paper have not been explored before, but there is a lot of recent work dealing with folksonomies Mika(2005) defines a model of semantic-social networks for extracting light eight ontologies from del icio us besides calculating measures like the cluster ing coefficient, (local)betweenness centrality or the network constraint on the extracted one-mode network, Mika uses co-occurence techniques for clustering e conce eton There are several systems working on top of del icio us to explore the un- derlying folksonomy. Collaborative Rank provides ranked search results on top of del icio us bookmarks. The ranking takes into account, how early someone bookmarked an URL and how many people followed him or her. Other sys- tems show popular sites(pop )of statistics about delicio. ulicious0)or focus on graphical representations 2.2 a Formal model for folksonomies A folksonomy basically describes users, resources, tags, and allows users to assign(arbitrary) tags to resources. We present here a formal definition of folksonomies, which is also underlying our BibSonomy system Definition 1. A folksonomy is a tuple F:=(U,T, R,Y, < )where .U, T, and R are finite sets, whose elements are called users, tags and resources, resp. °htt/ ollabrank org/ http://populicio.us http://cloudalicio.us/ http://www.neuroticweb.com/recursos/del.icio.us-graphs/ 3
tags, to it. We call the collection of all his assignments his personomy, and the collection of all personomies is called folksonomy. The user can also explore the personomies of other users in all dimensions: for a given user he can see the resources that user has uploaded, together with the tags he has assigned to them (see Fig. 1); when clicking on a resource he sees which other users have uploaded this resource and how they tagged it; and when clicking on a tag he sees who assigned it to which resources. The systems allow for additional functionality. For instance, one can copy a resource from another user, and label it with one owns tags. Overall, these systems provide a very intuitive navigation through the data. 2.1 State of the Art There are currently virtually no scientific publications about folksonomy-based web collaboration systems. Among the rare exceptions are Hammond et al. (2005) and Lund et al. (2005) who provide good overviews of social bookmarking tools with special emphasis on folksonomies, and Mathes (2004) who discusses strengths and limitations of folksonomies. The main discussion on folksonomies and related topics is currently only going on mailing lists, e.g. Connotea (2005). To the best of our knowledge, the ideas presented in this paper have not been explored before, but there is a lot of recent work dealing with folksonomies. Mika (2005) defines a model of semantic-social networks for extracting lightweight ontologies from del.icio.us. Besides calculating measures like the clustering coefficient, (local) betweenness centrality or the network constraint on the extracted one-mode network, Mika uses co-occurence techniques for clustering the concept network. There are several systems working on top of del.icio.us to explore the underlying folksonomy. CollaborativeRank9 provides ranked search results on top of del.icio.us bookmarks. The ranking takes into account, how early someone bookmarked an URL and how many people followed him or her. Other systems show popular sites (Populicious10) or focus on graphical representations (Cloudalicious11, Grafolicious12) of statistics about del.icio.us. 2.2 A Formal Model for Folksonomies A folksonomy basically describes users, resources, tags, and allows users to assign (arbitrary) tags to resources. We present here a formal definition of folksonomies, which is also underlying our BibSonomy system. Definition 1. A folksonomy is a tuple F := (U, T, R, Y, ≺) where • U, T , and R are finite sets, whose elements are called users, tags and resources, resp., 9 http://collabrank.org/ 10 http://populicio.us/ 11 http://cloudalicio.us/ 12 http://www.neuroticweb.com/recursos/del.icio.us-graphs/ 3
Y is a ternary relation between them, i.e.,YCUxTXR, called assign ments and < is a user-specific subtag/supertag-relation, i. e, <CUX(TxT)\(t, t) t∈T}) The personomy Pu of a given user u E U is the restriction of F to u, P:=(T,Rn,l,xu) with I:={(t,r)∈T×R|(u,t,r)∈Y},Tn:=丌1(Ix) Ru:=丌2(lu), and <u:={(t1,t2)∈T×T(u,t,t2)∈} Users are typically described by their user ID, and tags may be arbitrary strings. What is considered as a resource depends on the type of system. In delicio us. for instance. the resources are URLs. and in Flickr. the resources are pictures. In our BibSonomy system, we have two types of resources, book marks and BIBTEXentries. From an implementation point of view, resources are internally represented by some ID In this paper, we do not make use of the subtag/ supertag relation for sake of simplicity. I.e., <=0, and we will simply note a folksonomy as a quadruple F: =(U,T, R, Y). This structure is known in Formal Concept Analysis(wille (1982), Ganter and Wille(1999)) as a triadic contert(Lehmann and Wille (1995), Stumme(2005)). An equivalent view on folksonomy data is that of a ripartite (undirected) hypergraph G=(V, E), where V= UUTUR is the set of nodes, and E= u, t, rl(u,t, r)EY is the set of hyperedges 2.3 Del.ico. us- A Folksonomy-Based Social Bookmark System In order to evaluate our folksonomy mining approach, we have analyzed the popular social bookmarking sytem del icio us. Delicio us is a server-based sys- tem with a simple-to-use interface that allows users to organize and share book- marks on the internet. It is able to store in addition to the url a description, a note, and tags(i. e, arbitrary labels). We chose delicio. us rather than our own system, Bibsonomy, as the latter is going online only after the time of writing of this article. For our experiments, we collected from the del. ico. us system U= 75, 242 users, T= 533, 191 tags and R=3, 158, 297 resources, related by in total Y= 17, 362, 212 triples 3 Association Rule Mining We assume here that the reader is familiar with the basics of association rule mining introduced by Agrawal et al. (1993). As the work presented in this pa- per is on the conceptual rather than on the computational level, we refrain in particular from describing the vast area of developing efficient algorithms Many of the existing algorithms can be found at the Frequent Itemset Min- ing Implementations Repository. Instead, we just recall the definition of the association rule mining problem, which was initially stated by Agrawal et al. (1993), in order to clarify the notations used in the following. We will not use http://fimi.cs.helsinkifi/ 4
• Y is a ternary relation between them, i. e., Y ⊆ U × T × R, called assignments, and • ≺ is a user-specific subtag/supertag-relation, i. e., ≺⊆ U ×((T ×T )\ {(t, t) | t ∈ T }). The personomy Pu of a given user u ∈ U is the restriction of F to u, i. e., Pu := (Tu, Ru, Iu, ≺u) with Iu := {(t, r) ∈ T × R | (u, t, r) ∈ Y }, Tu := π1(Iu), Ru := π2(Iu), and ≺u:= {(t1, t2) ∈ T × T | (u, t1, t2) ∈≺}. Users are typically described by their user ID, and tags may be arbitrary strings. What is considered as a resource depends on the type of system. In del.icio.us, for instance, the resources are URLs, and in Flickr, the resources are pictures. In our BibSonomy system, we have two types of resources, bookmarks and BibTEXentries. From an implementation point of view, resources are internally represented by some ID. In this paper, we do not make use of the subtag/supertag relation for sake of simplicity. I. e., ≺ = ∅, and we will simply note a folksonomy as a quadruple F := (U, T, R, Y ). This structure is known in Formal Concept Analysis (Wille (1982), Ganter and Wille (1999)) as a triadic context (Lehmann and Wille (1995), Stumme (2005)). An equivalent view on folksonomy data is that of a tripartite (undirected) hypergraph G = (V, E), where V = U∪˙ T∪˙ R is the set of nodes, and E = {{u, t, r} | (u, t, r) ∈ Y } is the set of hyperedges. 2.3 Del.ico.us — A Folksonomy-Based Social Bookmark System In order to evaluate our folksonomy mining approach, we have analyzed the popular social bookmarking sytem del.icio.us. Del.icio.us is a server-based system with a simple-to-use interface that allows users to organize and share bookmarks on the internet. It is able to store in addition to the URL a description, a note, and tags (i. e., arbitrary labels). We chose del.icio.us rather than our own system, Bibsonomy, as the latter is going online only after the time of writing of this article. For our experiments, we collected from the del.ico.us system |U| = 75, 242 users, |T | = 533, 191 tags and |R| = 3, 158, 297 resources, related by in total |Y | = 17, 362, 212 triples. 3 Association Rule Mining We assume here, that the reader is familiar with the basics of association rule mining introduced by Agrawal et al. (1993). As the work presented in this paper is on the conceptual rather than on the computational level, we refrain in particular from describing the vast area of developing efficient algorithms. Many of the existing algorithms can be found at the Frequent Itemset Mining Implementations Repository.13 Instead, we just recall the definition of the association rule mining problem, which was initially stated by Agrawal et al. (1993), in order to clarify the notations used in the following. We will not use 13 http://fimi.cs.helsinki.fi/ 4
the original terminology of Srikant et al, but rather exploit the vocabulary of Formal Concept Analysis(FCA)(Wille(1982)), as it better fits with the formal folksonomy model introduced in Definition 1.14 Definition 2. A formal contert is a dataset K: =(G, M, I) consisting of a set G of objects, a set M of attributes, and a binary relation ICGxM, where (9,m)∈ i is read as“ object g has attribute m In the usual basket analysis scenario, M is the set of items sold by a supermar- ket, G is the set of all transactions, and, for a given transaction g E G, the set g:=mE MI(g, m)E] contains all items bought in that transaction Definition 3. For a set X of attributes, we define A': =I9 EG Vm E X: (9, m)EI. The support of A is calculated by supp(A) Definition 4(Association Rule Mining Problem(Agrawal et al (1993). Let K be a formal context, and minsupp, minconf E [0, 1, called minimum support and minimum confidence thresholds, resp. The association rule mining problem consists now of determining all pairs A - B of subsets of M whose support supp(A -B): supp(A U B) is above the thresh- old minsupp, and whose confidence conf(A-B): =sspp is above the threshold minc As the rules A-B and A-B A carry the same information, and in particular have same support and same confidence, we will consider in this paper the additional constraint prevalent in the data mining community, that premise A and conclusion B are to be disjoint When comparing Definitions 1 and 2, we observe that association rules annot be mined directly on folksonomies, because of their triadic nature. One either has to define some kind of triadic association rules, or to transform the triadic folksonomy into a dyadic formal context. In this paper, we follow the latter approach 4 Projecting the Folksonomy onto two Dimensions As discussed in the previous section, we have to reduce the three-dimensional folksonomy to a two-dimensional formal context before we can apply any as sociation rule mining technique. Several such projections have already been introduced in Lehmann and Wille(1995 ). In Stumme(2005), we provide a more complete approach, which we will slightly adapt to the association rule For a detailed discussion about the role of FCa for association rule mining see In contrast, in FCA, one often requires A to be a subset of B, as this fits better with the notion of closed itemsets which arose of applying FCa to the association mining problem(Pasquier et al.(1999), Zaki and Hsiao(1999), Stumme(1999)) 5
the original terminology of Srikant et al, but rather exploit the vocabulary of Formal Concept Analysis (FCA) (Wille (1982)), as it better fits with the formal folksonomy model introduced in Definition 1.14 Definition 2. A formal context is a dataset K := (G, M, I) consisting of a set G of objects, a set M of attributes, and a binary relation I ⊆ G × M, where (g, m) ∈ I is read as “object g has attribute m”. In the usual basket analysis scenario, M is the set of items sold by a supermarket, G is the set of all transactions, and, for a given transaction g ∈ G, the set g I := {m ∈ M|(g, m) ∈ I} contains all items bought in that transaction. Definition 3. For a set X of attributes, we define A′ := {g ∈ G | ∀m ∈ X: (g, m) ∈ I}. The support of A is calculated by supp(A) := |A ′ | |G| . Definition 4 (Association Rule Mining Problem (Agrawal et al. (1993))). Let K be a formal context, and minsupp, minconf ∈ [0, 1], called minimum support and minimum confidence thresholds, resp. The association rule mining problem consists now of determining all pairs A → B of subsets of M whose support supp(A → B) := supp(A ∪ B) is above the threshold minsupp, and whose confidence conf(A → B) := supp(A∪B) supp(A) is above the threshold minconf. As the rules A → B and A → B \ A carry the same information, and in particular have same support and same confidence, we will consider in this paper the additional constraint prevalent in the data mining community, that premise A and conclusion B are to be disjoint.15 When comparing Definitions 1 and 2, we observe that association rules cannot be mined directly on folksonomies, because of their triadic nature. One either has to define some kind of triadic association rules, or to transform the triadic folksonomy into a dyadic formal context. In this paper, we follow the latter approach. 4 Projecting the Folksonomy onto two Dimensions As discussed in the previous section, we have to reduce the three-dimensional folksonomy to a two-dimensional formal context before we can apply any association rule mining technique. Several such projections have already been introduced in Lehmann and Wille (1995). In Stumme (2005), we provide a more complete approach, which we will slightly adapt to the association rule mining scenario. 14 For a detailed discussion about the role of FCA for association rule mining see (Stumme (2002)). 15 In contrast, in FCA, one often requires A to be a subset of B, as this fits better with the notion of closed itemsets which arose of applying FCA to the association mining problem (Pasquier et al. (1999), Zaki and Hsiao (1999), Stumme (1999)). 5