Tracking Usage in Collaborative Tagging Communities Elizeu Santos-Neto Matei Ripeanu Adriana lamnitchi Electrical Computer Engineering Electrical& Computer Engineering Computer Science and Engineering University of British Columbia University of British Columbia University of South Florida 2332 Mail Mall-KAIS 4075 2356 Mail Mall-KAIS 4033 4202 E. Fowler Ave, Tampa, FL +1604.8274270 +1.604.822.7281 +1.813.974.5357 elizeus@ece. ubc.ca matei@ece. ubc. ca anda@cse. usf. edu ABSTRACT [16 report that in January 2006 Flickr congregated about one Collaborative tagging has recently attracted the attention of both million users. Similarly, del.icio. us reached one million users in industry and academia due to the popularity of content-sharing September 2006 [26] systems such as CiteULike, del. icio us, and Flickr. These systems Although collaborative tagging is attracting increasing attention give users the opportunity to add data items and to attach their from both industry and academia, there are few studies that assess own metadata (i. e, tags)to stored data. The result is an effective the characteristics of communities of users who share and ta content management tool for individual users. Recent studies, content. In particular, little research has been done on the however, suggest that, as tagging communities grow, the added content and the metadata become harder to manage due te tracking usage patterns in collaborative tagging communities. Moreover, recent investigations have shown increased content diversity. Thus, mechanisms that cope with that, as the user population grows, the efficiency of information ncrease of diversity are fundamental to improve the scalability retrieval based on user generated tags tends to decrease [2] and usability of collaborative tagging systems. Mining usage patterns is an efficient method to improve the This paper analyzes whether usage patterns can be harnessed to quality of service provided by information retrieval mechanisms mprove navigability in a growing knowledge To this end in the web context. For example, usage patterns can be harnessed it presents a characterization of two collaborative tagging to improve" browsing experience via recommendation systems or ommunities that target the management of scientific literature to predict buying patterns and consequently increase revenue of CiteULike and Bibsonomy. We explore three main directions e-commerce operations [8[9[10J[ 17[18[1920 First, we analyze the tagging activity distribution across the user population. Second, we define new metrics for similarity in user This work is motivated by the following conjecture:usage interest and use these metrics to uncover the structure of the patterns can be harnessed nt relevant. contextualized tagging communities we study The properties of the structure we information and deal with the navigability generated by uncover suggest a clear segmentation of interests into a large informational overload in large tagging com number of individuals with unique preferences and a core set of We present encouraging preliminary steps to substantiate the above conjecture: We characterize two collaborative tagging community can be used to facilitate content retrieval and systems: (CiteULike [3] and Bibsonomy [4])as a first step towards avigation as communities scale introducing related work (Section 2), we present a formal definition for tagging communities (Section 3)and the H.1.1[General: Systems and Information Theory -Information then characterize tagging activity distribution among.We Categories and Subject Descriptors communities and the data sets this study explores( Section 4) users Theory. H3.5 [Information Storage and Retrieval]: On-line Section 5) and we investigate the structure of users shared Information Services-Web-based services interests(Section 6). Finally, we present preliminary results on the efficacy of using contextualized attention based on the structure of General terms shared interests to improve the navigability in the system(Section 7). Section 8 summarizes our findings and outlines future research Measurement, Experimentation, Human Factors, Theory Keywords 2. RELATED WORK Collaborative Tagging, Usage Patterns, Modeling User Attention, Two types of techniques, implicit and explicit, are traditionally CiteULike, Bibsonomy used to elicit user preferences in the Web context [16[15]] Explicit techniques are based on direct input from a user with 1. INTRODUCTION espect to her preferences and interests(e.g, page rating scales, ollaborative tagging systems are online communities that allow item reviews, categories of interest ). Implicit techniques infer a sers to assign terms from an uncontrolled vocabulary (i.e, tags) definition of user interests from her activity, e.g, using client-side to items of interest. This simple tagging feature proves to be a or service-side mechanisms such as browser plug-ins, client werful mechanism for personal knowledge management(e.g extensions, and server-side logs to track usage patterns. Clearly, systems like CiteULike [3] and content sharing (e.g,in each technique has its own advantages and limitations in terms of ommunities such as Flickr [25]). Recently, collaborative tagging accuracy, cost to the user, privacy control, or ability to adapt to systems have attracted massive user communities: Novak et al
Tracking Usage in Collaborative Tagging Communities Elizeu Santos-Neto Electrical & Computer Engineering University of British Columbia 2332 Mail Mall – KAIS 4075 +1.604.827.4270 elizeus@ece.ubc.ca Matei Ripeanu Electrical & Computer Engineering University of British Columbia 2356 Mail Mall – KAIS 4033 +1.604.822.7281 matei@ece.ubc.ca Adriana Iamnitchi Computer Science and Engineering University of South Florida 4202 E. Fowler Ave, Tampa, FL +1.813.974.5357 anda@cse.usf.edu ABSTRACT Collaborative tagging has recently attracted the attention of both industry and academia due to the popularity of content-sharing systems such as CiteULike, del.icio.us, and Flickr. These systems give users the opportunity to add data items and to attach their own metadata (i.e., tags) to stored data. The result is an effective content management tool for individual users. Recent studies, however, suggest that, as tagging communities grow, the added content and the metadata become harder to manage due to increased content diversity. Thus, mechanisms that cope with increase of diversity are fundamental to improve the scalability and usability of collaborative tagging systems. This paper analyzes whether usage patterns can be harnessed to improve navigability in a growing knowledge space. To this end, it presents a characterization of two collaborative tagging communities that target the management of scientific literature: CiteULike and Bibsonomy. We explore three main directions: First, we analyze the tagging activity distribution across the user population. Second, we define new metrics for similarity in user interest and use these metrics to uncover the structure of the tagging communities we study. The properties of the structure we uncover suggest a clear segmentation of interests into a large number of individuals with unique preferences and a core set of users with interspersed interests. Finally, we offer preliminary results that suggest that the interest-based structure of the tagging community can be used to facilitate content retrieval and navigation as communities scale. Categories and Subject Descriptors H.1.1 [General]: Systems and Information Theory - Information Theory. H.3.5 [Information Storage and Retrieval]: On-line Information Services - Web-based services. General Terms Measurement, Experimentation, Human Factors, Theory. Keywords Collaborative Tagging, Usage Patterns, Modeling User Attention, CiteULike, Bibsonomy. 1. INTRODUCTION Collaborative tagging systems are online communities that allow users to assign terms from an uncontrolled vocabulary (i.e., tags) to items of interest. This simple tagging feature proves to be a powerful mechanism for personal knowledge management (e.g., in systems like CiteULike [3]) and content sharing (e.g., in communities such as Flickr [25]). Recently, collaborative tagging systems have attracted massive user communities: Novak et al. [16] report that in January 2006 Flickr congregated about one million users. Similarly, del.icio.us reached one million users in September 2006 [26]. Although collaborative tagging is attracting increasing attention from both industry and academia, there are few studies that assess the characteristics of communities of users who share and tag content. In particular, little research has been done on the potential benefits of tracking usage patterns in collaborative tagging communities. Moreover, recent investigations have shown that, as the user population grows, the efficiency of information retrieval based on user generated tags tends to decrease [2]. Mining usage patterns is an efficient method to improve the quality of service provided by information retrieval mechanisms in the web context. For example, usage patterns can be harnessed to improve ‘browsing experience’ via recommendation systems or to predict buying patterns and consequently increase revenue of e-commerce operations [8][9][10][17][18][19][20]. This work is motivated by the following conjecture: usage patterns can be harnessed to present relevant, contextualized information and deal with the reduced navigability generated by informational overload in large tagging communities. We present encouraging preliminary steps to substantiate the above conjecture: We characterize two collaborative tagging systems: (CiteULike [3] and Bibsonomy [4]) as a first step towards a model to represent user interests based on tagging activity. After introducing related work (Section 2), we present a formal definition for tagging communities (Section 3) and the communities and the data sets this study explores (Section 4). We then characterize tagging activity distribution among users (Section 5) and we investigate the structure of user’s shared interests (Section 6). Finally, we present preliminary results on the efficacy of using contextualized attention based on the structure of shared interests to improve the navigability in the system (Section 7). Section 8 summarizes our findings and outlines future research directions. 2. RELATED WORK Two types of techniques, implicit and explicit, are traditionally used to elicit user preferences in the Web context [1][6][15]. Explicit techniques are based on direct input from a user with respect to her preferences and interests (e.g., page rating scales, item reviews, categories of interest). Implicit techniques infer a definition of user interests from her activity, e.g., using client-side or service-side mechanisms such as browser plug-ins, client extensions, and server-side logs to track usage patterns. Clearly, each technique has its own advantages and limitations in terms of accuracy, cost to the user, privacy control, or ability to adapt to changes on user interests trends
In a tagging community context, the tags themselves can be acilitate browsing through tagging systems, it is interpreted as explicit metadata added by each user. Additionally, ngly important to take into account user attention in terms observed tagging activity including the volume and frequency with which items are added, the number of tagged items, or tag vocabulary size can be harnessed to extract implicit information Niwa et al. [17 propose a recommendation system based on the affinity between users and tags, and on the explicit site Due to the youth of collaborative tagging systems, relatively little preferences expressed by the user. Our study differs from this work has been done on tracking usage and exploring work as we use implicit user profiles and propose the use of contextualized user attention in these communities. However entropy as a metric to characterize their effectiveness managing user attention metadata in the wider web context Outside the academic area, a number of projects explore the use of implicitly-gathered user information. We mention Google's without exploring tagging features [1[6][15]. These techniques initiative to explore users' past search history to refine the results clude post processing of usage logs, tracking user input (e.g. search terms)and eliciting explicit user preferences. Other provided by the Page Rank 89]. Commercial interest in investigations are concerned with methods to use contextualize contextualized user attention highlights that tracking user e web search [1[151 attention and characterizing collective online behavior is not only an intriguing research topic, but also a potentially attractive As a first step to modeling user attention in tagging communities, business opportunity it is necessary to characterize collaborative tagging behavior. In this respect, Golder and Huberman [5] study user activity pattens 3. BACKGROUND regarding system utilization and tag usage in del icio, us -a social A collaborative tagging community allows users to tag items via a web site Users interact with the website by searching for items, they observe a low correlation between the number of items in adding new items to the community, or tagging existent items. user. Next, they discuss the models that could explain this lack of The tagging action performed by a user is generally referred as a correlation and suggest it is an effect of shared knowledge and lag assignment. imitation in associating tags. Finally, the authors suggest that the For example, in CiteULike and Bibsonomy, each user has a urn model proposed by Eggenberger Polya [14]is an library, i.e., a set of links to scientific publications and books appropriate model to derive the evolution of tag usage frequenc Each item in the library is associated with a set of terms( tags) assigned by users. It is important to highlight that, in both The urn model can be formulated as follows consider an urn that Cite ULike and Bibsonomy, the process of assigning tags to items contains two colored balls. Iteratively, a ball is drawn at random is collaborative, in the sense that all users can inspect other users libraries and assigned tags. User can thus repeat tags used by from the urn and returned to the urn together with a new ball of others to mark a particular item. This is unlike other communities fraction of balls of a particular color stabilizes. The interesting define who has permissions to see the content and apply tags to it. onverges to a different number. Golder and Huberman argue that In CiteULike and Bibsonomy users have two options to add items this model captures the evolution of tag proportion observed in to their libraries the del icio us data set. In studies related to contextualized user attention, this model may be valuable to predict future user 1. Browse the content of popular scientific literature portals agging assignments which can be a useful (e.g. ACM Portal, IEEE Explorer, ar Xiv. org), to add recommendation mechanisms. Golder and Huberman's study, publications to their own library, and however, is limited in scale: their results on tagging behavior 2. Search for items present in other users' libraries and add dynamics rely on only four days of tracked activity them to their own library Other authors follow different approaches to investigate the While posting an item, a user can mark it with terms (i.e, tags) characteristics of tagging systems. Schimtz [10J[11 studies that can be used for future retrieval. The collaborative nature of structural properties of del icio. us and Bibsonomy, uses a tri- tagging relies on the fact that users potentially share interests and partite hypergraph representation, and adapts the small-world use similar items and tags. Thus, while the tagging activity of one pattern definitions to this representation. Cattuto et al. [12] model user may be self-centered the set of tags used may facilitate the sage behavior via unipartite projections from a tripartite graph job of other users in finding content of interest. ur approach differs from these studies in terms of scale and in the use of dynamic metrics to define shared user interest: we We represent a collaborative tagging community by the tuple define metrics that scale as the community grows and/or user C=(U, L, T, A), where U represents the set of users, I is the set of s (Section 6). items, T is the tagging vocabulary, and A the set of tag By analyzing del icio us, Chi and Mytkoswicz [2] find that the efficiency of social tagging decreases as the communities grow. The set of tag assignments is denoted by A=I(u,LP)HEU,IE onsequent s are becoming less and less descriptive and T, PE /). From this definition of tag assignments, we can derive it becomes harder to find a particular item using the definition of an individual user, item and tag, as follows them. Simultaneously, it becomes harder to find tags that efficiently mark an item for future retrieval. These results indicate
In a tagging community context, the tags themselves can be interpreted as explicit metadata added by each user. Additionally, observed tagging activity including the volume and frequency with which items are added, the number of tagged items, or tag vocabulary size can be harnessed to extract implicit information. Due to the youth of collaborative tagging systems, relatively little work has been done on tracking usage and exploring contextualized user attention in these communities. However, several studies present techniques and models for collecting and managing user attention metadata in the wider web context without exploring tagging features [1][6][15]. These techniques include post processing of usage logs, tracking user input (e.g. search terms) and eliciting explicit user preferences. Other investigations are concerned with methods to use contextualized attention to improve web search [1][15]. As a first step to modeling user attention in tagging communities, it is necessary to characterize collaborative tagging behavior. In this respect, Golder and Huberman [5] study user activity patterns regarding system utilization and tag usage in del.icio.us – a social bookmarking tool that allows users to share and tag URLs. First, they observe a low correlation between the number of items in each user's bookmark list and the number of tags used by each user. Next, they discuss the models that could explain this lack of correlation and suggest it is an effect of shared knowledge and imitation in associating tags. Finally, the authors suggest that the urn model proposed by Eggenberger & Polya [14] is an appropriate model to derive the evolution of tag usage frequency on a particular item. The urn model can be formulated as follows: consider an urn that contains two colored balls. Iteratively, a ball is drawn at random from the urn and returned to the urn together with a new ball of the same color. If this process is repeated a number of times, the fraction of balls of a particular color stabilizes. The interesting aspect of this model is that if the process is restarted, this fraction converges to a different number. Golder and Huberman argue that this model captures the evolution of tag proportion observed in the del.icio.us data set. In studies related to contextualized user attention, this model may be valuable to predict future user tagging assignments which can be a useful input to recommendation mechanisms. Golder and Huberman’s study, however, is limited in scale: their results on tagging behavior dynamics rely on only four days of tracked activity. Other authors follow different approaches to investigate the characteristics of tagging systems. Schimtz [10][11] studies structural properties of del.icio.us and Bibsonomy, uses a tripartite hypergraph representation, and adapts the small-world pattern definitions to this representation. Cattuto et al. [12] model usage behavior via unipartite projections from a tripartite graph. Our approach differs from these studies in terms of scale and in the use of dynamic metrics to define shared user interest: we define metrics that scale as the community grows and/or user activity increases (Section 6). By analyzing del.icio.us, Chi and Mytkoswicz [2] find that the efficiency of social tagging decreases as the communities grow: that is, tags are becoming less and less descriptive and consequently it becomes harder to find a particular item using them. Simultaneously, it becomes harder to find tags that efficiently mark an item for future retrieval. These results indicate that, to facilitate browsing through tagging systems, it is increasingly important to take into account user attention in terms of observed tagging activity. Niwa et al. [17] propose a recommendation system based on the affinity between users and tags, and on the explicit site preferences expressed by the user. Our study differs from this work as we use implicit user profiles and propose the use of entropy as a metric to characterize their effectiveness. Outside the academic area, a number of projects explore the use of implicitly-gathered user information. We mention Google's initiative to explore users’ past search history to refine the results provided by the Page Rank [8][9]. Commercial interest in contextualized user attention highlights that tracking user attention and characterizing collective online behavior is not only an intriguing research topic, but also a potentially attractive business opportunity. 3. BACKGROUND A collaborative tagging community allows users to tag items via a web site. Users interact with the website by searching for items, adding new items to the community, or tagging existent items. The tagging action performed by a user is generally referred as a tag assignment. For example, in CiteULike and Bibsonomy, each user has a library, i.e., a set of links to scientific publications and books. Each item in the library is associated with a set of terms (tags) assigned by users. It is important to highlight that, in both CiteULike and Bibsonomy, the process of assigning tags to items is collaborative, in the sense that all users can inspect other users’ libraries and assigned tags. User can thus repeat tags used by others to mark a particular item. This is unlike other communities (e.g., Flickr) where each user has a fine-grained access control to define who has permissions to see the content and apply tags to it. In CiteULike and Bibsonomy users have two options to add items to their libraries: 1. Browse the content of popular scientific literature portals (e.g. ACM Portal, IEEE Explorer, arXiv.org), to add publications to their own library, and 2. Search for items present in other users' libraries and add them to their own library. While posting an item, a user can mark it with terms (i.e., tags) that can be used for future retrieval. The collaborative nature of tagging relies on the fact that users potentially share interests and use similar items and tags. Thus, while the tagging activity of one user may be self-centered the set of tags used may facilitate the job of other users in finding content of interest. We represent a collaborative tagging community by the tuple: C=(U,I,T,A), where U represents the set of users, I is the set of items, T is the tagging vocabulary, and A the set of tag assignments. The set of tag assignments is denoted by A = {(u, t, p) | u ∈ U, t ∈ T, p∈ I}. From this definition of tag assignments, we can derive the definition of an individual user, item and tag, as follows:
A user Wk EU is denoted by a pair uk=(, Ty, where Ik is the approximately 14% of the original data set, while the users set of items user k has ever tas Thus, an item p elk if and removed from Bibsonomy are around O.6% of the original only if 3(urL, P)EA, for any t E Tk. Similarly, Tk is the set of set. Table 1 summarizes the characteristics of each data set ags user Hp applied before, where t E Tk if and only if a(uk,t, p) 5. TAGGING ACTIVITY To gain an understanding on the usage patterns in these two a An item Pi E I is denoted by Pi=(Ui, Ti), where Ui is the set sers who tagged this item, and Ty is set of tags this item has several metrics: the number of items per user, number of tagging assignments performed, and number of tags used. The question A tag t E T is denoted by t=(U,1), where Uj is the set of answered in this section is the followin users who used the tag t before, and Ik is the set of items Q/: How is the tagging activity distributed among users annotated with the tag t; We aim to quantify the volume of user interaction with the system, either by adding new content to the community, or by 4. DATA SETS AND DATA CLEANING tagging an existing item. Intuitively, one would expect that a few Both tagging communities we analyze: CiteULike [3 and users are very active while the majority rarely interacts with the Bibsonomy [4], aim to improve users organization and community management of research publications. Both provide functionality to import and export citation records in formats like BibTex, for Determining how often users perform tag assign important to help designing systems that track user attention. For example, in a context where activity information is used to The data sets analyzed in this article were provided by the recommend new items based on tag similarity, it can be necessary administrators of the respective web sites. Thus, the data to compute the similarity level at the same rate as the rate witl represents a global snapshot of each system within the period which new information is added into the system. Figure I presents determined by the timestamps in the traces we have obtained the user rank according to the number of tag assignments Table 1). It is important to point out that the bibsonomy data set performed during the time frame of our data set. In the results that has timestamps starting at 1995, which we considered a bug follow, we present the data points observed together with a curve Moreover, Bibsonomy has two separate datasets, scientific that provides a good model to the observed data (i.e. Hoerl literature and URL bookmarks. We concentrated our analysis on function [21]). At the end of this section we comment more on the the scientific literature part of the data s curve In the original CiteULike data set, the most popular tag is"bibtex- CiteULike Bibsonomy import" while the second most popular tag is"no-lag automatically assigned when a user does not assign any tag to a ew item. The popularity of these two tags indicates that a large art of users use cite ulike as a tool to convert their list of citations to BibTex format, and that users tend not to tag items at the time they post a new item to their individual libraries. Clearly this is relevant information for system designers who might want usHr和us to invest effort in improving the features of most interest. Figure 1: User rank based on the number of tag assignments Also, in Cite ULike one user posted and tagged more than 3.000 ote the logarithmic scales on both axes. ems within approximately 5 minutes(according to the A second metric for tagging activity is the size of user libraries timestamps in the data set). Obviously, this behavior is due to an Figure 2 plots user library size for users ranked in decreasing automatic mechanism order according to the size of their libraries for CiteULike and Table 1: Summary of cleaned data sets used in this study Bibsonomy, respectively. This shows the size of the set of items a CiteULike particular user pays attention to. The results confirm that the users Bibsonomy in these two systems are heterogeneous in terms lI/2004—04/2006 intensity, as it has already been indicated by the tag Users (UD 5,954 656 activity Items (D 99,512 CiteULlke Tags (TD 51.079 Assignments (A Our objective is to concentrate only on those users who are using a interactively to bookmark and share artic Consequently, for the analysis that follows, we have the"robot user (i.e, a user with 3, 000 items tagged within 5 minutes)and users who used only the tags bibtex-import and/or no-tag. The total number of users removed from CiteULike represents Figure 2: User rank based on the library size
A user uk ∈ U is denoted by a pair uk = (Ik , Tk ), where Ik is the set of items user k has ever tagged. Thus, an item p ∈ Ik if and only if ∃ (uk ,t, p) ∈ A, for any t ∈ Tk . Similarly, Tk is the set of tags user uk applied before, where t ∈ Tk if and only if ∃ (uk , t, p) ∈ A. An item pi ∈ I is denoted by pi = (Ui , Ti ), where Ui is the set of users who tagged this item, and Ti is set of tags this item has received. A tag tj ∈ T is denoted by tj = (Uj , Ij ), where Uj is the set of users who used the tag tj before, and Ik is the set of items annotated with the tag tj . 4. DATA SETS AND DATA CLEANING Both tagging communities we analyze: CiteULike [3] and Bibsonomy [4], aim to improve user’s organization and management of research publications. Both provide functionality to import and export citation records in formats like BibTeX, for example. The data sets analyzed in this article were provided by the administrators of the respective web sites. Thus, the data represents a global snapshot of each system within the period determined by the timestamps in the traces we have obtained (Table 1). It is important to point out that the Bibsonomy data set has timestamps starting at 1995, which we considered a bug. Moreover, Bibsonomy has two separate datasets, scientific literature and URL bookmarks. We concentrated our analysis on the scientific literature part of the data. In the original CiteULike data set, the most popular tag is “bibteximport” while the second most popular tag is “no-tag”, automatically assigned when a user does not assign any tag to a new item. The popularity of these two tags indicates that a large part of users use CiteULike as a tool to convert their list of citations to BibTex format, and that users tend not to tag items at the time they post a new item to their individual libraries. Clearly, this is relevant information for system designers who might want to invest effort in improving the features of most interest. Also, in CiteULike one user posted and tagged more than 3,000 items within approximately 5 minutes (according to the timestamps in the data set). Obviously, this behavior is due to an automatic mechanism. Table 1: Summary of cleaned data sets used in this study CiteULike Bibsonomy Period 11/2004—04/2006 ??—12/2006 # Users (|U|) 5,954 656 # Items (|I|) 199,512 67,034 # Tags (|T|) 51,079 21,221 # Assignments (|A|) 451,980 257,261 Our objective is to concentrate only on those users who are using the system interactively to bookmark and share articles. Consequently, for the analysis that follows, we have the “robot” user (i.e., a user with 3,000 items tagged within 5 minutes) and users who used only the tags bibtex-import and/or no-tag. The total number of users removed from CiteULike represents approximately 14% of the original data set, while the users removed from Bibsonomy are around 0.6% of the original data set. Table 1 summarizes the characteristics of each data set after the data cleaning operation. 5. TAGGING ACTIVITY To gain an understanding on the usage patterns in these two communities, we start by evaluating the activity levels along several metrics: the number of items per user, number of tagging assignments performed, and number of tags used. The question answered in this section is the following: Q1: How is the tagging activity distributed among users? We aim to quantify the volume of user interaction with the system, either by adding new content to the community, or by tagging an existing item. Intuitively, one would expect that a few users are very active while the majority rarely interacts with the community. Determining how often users perform tag assignments is important to help designing systems that track user attention. For example, in a context where activity information is used to recommend new items based on tag similarity, it can be necessary to compute the similarity level at the same rate as the rate with which new information is added into the system. Figure 1 presents the user rank according to the number of tag assignments performed during the time frame of our data set. In the results that follow, we present the data points observed together with a curve that provides a good model to the observed data (i.e. Hoerl function [21]). At the end of this section we comment more on the characteristics of this curve. Figure 1: User rank based on the number of tag assignments. Note the logarithmic scales on both axes. A second metric for tagging activity is the size of user libraries. Figure 2 plots user library size for users ranked in decreasing order according to the size of their libraries for CiteULike and Bibsonomy, respectively. This shows the size of the set of items a particular user pays attention to. The results confirm that the users in these two systems are heterogeneous in terms of activity intensity, as it has already been indicated by the tag assignment activity. Figure 2: User rank based on the library size
The correlation between a user's library size and her vocabulary is collaborative tagging community, one may draw a comparison nportant to understand whether the diversity of the vocabulary between the potential diversity found in the users' library used by each user grows with the number of items in her personal regarding of items in it, and the bio-diversity library. We observe that, in both communities the users'library distribution across geographic regions and vocabulary sizes are strongly correlated for CiteULike (R=0.98, n=5954) and less strongly, but still positivel Although a Hoerl function is a good fit for orrelated, for Bibsonomy (R2=0.80, n=654). Although such distributions, this does not directly imply that diversity of user orrelation may seem intuitive, since users with a more diverse set libraries or vocabularies represents a phenomenon which is of items would need more tags to describe them, this behavior is similar to those presented by studies on biodiversity different from that observed by golder and Huberman in Nevertheless, the Hoerl function does provide a good model for el icio us [5]. A possible explanation is that in delicio us user is collaborative tagging activity and it can be useful to study user presented with tag suggestions based on past tagging activity diversity in collaborative tagging systems in the future when adding a new bookmark. These suggestions may bias and To summarize: in the communities we study, the intensity of user it the size of a user vocabulary. However, further investigation activity is distributed over multiple orders of magnitude, it is well necessary to assess how a user vocabulary is affected by tagging modeled using the Hoerl function and, unlike in other recommendation communities, there is a strong correlation in activity in terms of items set and vocabulary sizes. CiteULlk Bibsonomy 6. EVALUATING USER SIMIILARITY While the analysis above is important for an overall usage profile evaluation of each community, it provides little information about user interests. Assessing the commonality in user interests is important for identifying user groups that may form around content of common interest. Thus, a natural set of questions that Figure 3: User rank by vocabulary size we aim to answer in this section are A second finding is that the tagging activity (i.e, number of Q2: Is the tagging community segmented into several sub tagging assignments)and library size per user are strongly commumities with different interests? Do users cluster aroune correlated for both communities (with R above 0.97) while the particular items and tags? correlations between the tagging activity and the vocabulary size To address these questions, we define the interest-sharing grap strong for CiteULike(R=0.99), but weaker for Bibsonomy after the intuition of data-sharing graphs introduced by lamnitchi et al. [27]. An interest-sharing graph captures the commonality in A third finding is that ivity distributions are not well user interest for an entire user population: Intuitively, users ar modeled by a Zipf-like distribution. Instead, a Hoerl model [21] connected in the interest-sharing graph if they focus on the same that extends the powe ily and it is defined by Equation subset of items and/or speak similar language (i.e, share a subset 1)fits better f(x)=abx More formally, consider a graphG=(U, E) where nodes are users Table 2 contains the Hoerl parameters a, b, and c determined via a similarity between users. The rest of this study explores three curve fitting process for each of the ranking distributions possible definitions for user interest or activity similarity. All these definitions employ a threshold I for the percentage of items Table 2: coefficients determined for the hoerl function or tags shared between two users CiteULike 1)The User-Item similarity considers two users 9.767.13 0.9979 interests similar if veen the sizes of the intersection and the size of of their item libraries Library Size 2,60977 0.9988 -0.4772 larger than a threshold t. This is expressed by equation 2 3.338 0.9992 -0.5964 E User-Item(2) 28,96929.9864 -0.6888 6,13749 -0.5461 2)The User-Tag definition is similar to the definition above but considers the vocabularies of the two users rather than their Similar to the Zipf distribution, the hoerl function has been used braies to model a large number of natural phenomena. The most relevant to collaborative tagging is the use of Hoerl function to describe the distribution of bio-diversity across a geographic region [221[24. Considering each users library a region in
The correlation between a user’s library size and her vocabulary is important to understand whether the diversity of the vocabulary used by each user grows with the number of items in her personal library. We observe that, in both communities the users’ library and vocabulary sizes are strongly correlated for CiteULike (R 2 = 0.98, n = 5954) and less strongly, but still positively correlated, for Bibsonomy (R 2 = 0.80, n = 654). Although such correlation may seem intuitive, since users with a more diverse set of items would need more tags to describe them, this behavior is different from that observed by Golder and Huberman in del.icio.us [5]. A possible explanation is that in del.icio.us user is presented with tag suggestions based on past tagging activity when adding a new bookmark. These suggestions may bias and limit the size of a user vocabulary. However, further investigation is necessary to assess how a user vocabulary is affected by tagging recommendation. Figure 3: User rank by vocabulary size A second finding is that the tagging activity (i.e., number of tagging assignments) and library size per user are strongly correlated for both communities (with R 2 above 0.97) while the correlations between the tagging activity and the vocabulary size is strong for CiteULike (R 2 = 0.99), but weaker for Bibsonomy (R 2 = 0.67). A third finding is that tagging activity distributions are not well modeled by a Zipf-like distribution. Instead, a Hoerl model [21] that extends the power-law family and it is defined by Equation (1) fits better: f(x) = abx x c (1) Table 2 contains the Hoerl parameters a, b, and c determined via a curve fitting process for each of the ranking distributions observed. Table 2: Coefficients determined for the Hoerl function CiteULike a b c Tag Assignments 9,767.13 0.9979 -0.4754 Library Size 2,609.77 0.9988 -0.4772 Vocabulary Size 3,338.55 0.9992 -0.5964 Bibsonomy Tag Assignments 28,969.29 0.9864 -0.6888 Library Size 6,137.49 0.9850 -0.5461 Vocabulary Size 2,608.45 0.9907 -0.5126 Similar to the Zipf distribution, the Hoerl function has been used to model a large number of natural phenomena. The most relevant to collaborative tagging is the use of Hoerl function to describe the distribution of bio-diversity across a geographic region [22][24]. Considering each user's library a region in a collaborative tagging community, one may draw a comparison between the potential diversity found in the users' library regarding the number of items in it, and the bio-diversity distribution across geographic regions. Although a Hoerl function is a good fit for the activity distributions, this does not directly imply that diversity of user libraries or vocabularies represents a phenomenon which is similar to those presented by studies on biodiversity. Nevertheless, the Hoerl function does provide a good model for collaborative tagging activity and it can be useful to study user diversity in collaborative tagging systems in the future. To summarize: in the communities we study, the intensity of user activity is distributed over multiple orders of magnitude, it is well modeled using the Hoerl function and, unlike in other communities, there is a strong correlation in activity in terms of items set and vocabulary sizes. 6. EVALUATING USER SIMILARITY While the analysis above is important for an overall usage profile evaluation of each community, it provides little information about user interests. Assessing the commonality in user interests is important for identifying user groups that may form around content of common interest. Thus, a natural set of questions that we aim to answer in this section are: Q2: Is the tagging community segmented into several subcommunities with different interests? Do users cluster around particular items and tags? To address these questions, we define the interest-sharing graph after the intuition of data-sharing graphs introduced by Iamnitchi et al. [27]. An interest-sharing graph captures the commonality in user interest for an entire user population: Intuitively, users are connected in the interest-sharing graph if they focus on the same subset of items and/or speak similar language (i.e., share a subset of tags). More formally, consider a graph G = (U, E) where nodes are users and edges represent the existence of shared interests or activity similarity between users. The rest of this study explores three possible definitions for user interest or activity similarity. All these definitions employ a threshold t for the percentage of items or tags shared between two users: 1) The User-Item similarity definition considers two users’ interests similar if the ratio between the sizes of the intersection and the size of the union of their item libraries is larger than a threshold t. This is expressed by Equation 2. k j k j kj I I I I e E ∪ ∩ ∈ ⇔ User-Item (2) 2) The User-Tag definition is similar to the definition above but considers the vocabularies of the two users rather than their libraries. k j k j kj T T T T e E ∪ ∩ ∈ ⇔ User-Tag (3)
3)Unlike the User-Item definition in Equation 2 above, the increases(Note that we exclude isolated nodes from this count of Directed User-Item considers two users' interests similar if connected graph components the ratio between the intersection of their item libraries and the size of one user library is larger than a threshold t. The The plots in Figure 4 show that the number of connected idea is to explore the role played by users with large libraries components increases up to a certain value of our similarity threshold. After a certain value of t. the number of connected components in the graph starts decreasing, since more and more E connected components will contain only one node and will thus Directed-User-Item(4) be excluded. The critical threshold value is different for each user similarity definition g assignment traces from the two tagging communities, even with low values for the sharing ratio threshold the final graph contains a large number of isolated nodes Indeed, by setting the threshold as low as one single item (i.e. two users are connected if they share at least one item); we find that, in CiteULike, 2, 672 users(44.87%)are not connected to any other user. This suggests that a large population of users has individual preferences Directed ue Figure 5: Total number of nodes in the interest sharing graph and in the largest component for CiteULike( top) an The initial increase in the number of connected components can be explained by the fact that, as the threshold increases, components split to form new islands. Since these islands naturally based on user similarity this result is encouraging since Figure 4: Number of connected components for CiteULike it offers the potential to cluster users according to their interests. (top)and Bibsonomy(bottom) As t continues to increase the definition of similarity becomes too strict and leads to more and more isolated nodes Figure 4 presents, for the three similarity metrics defined above, Two observations about the results in Figure 4 can be noted: first the number of connected components for both CiteULike and as the threshold increases, the number of components decreases Bibsonomy, for thresholds i varying from %o to g9%. These faster for the User-tem graph(Equation 2) than for the directed- connected components follow a similar trend as the threshold User-Item graph(Equation 4). This illustrates the effect of using an asymmetrical definition for shared interests. The idea explored
3) Unlike the User-Item definition in Equation 2 above, the Directed User-Item considers two users’ interests similar if the ratio between the intersection of their item libraries and the size of one user library is larger than a threshold t. The idea is to explore the role played by users with large libraries via the introduction of direction to the edges in the graph. k k j kj I I I e E ∩ ∈ ⇔ Directed-User-Item (4) In our analysis of real tag assignment traces from the two tagging communities, even with low values for the sharing ratio threshold t, the final graph contains a large number of isolated nodes. Indeed, by setting the threshold as low as one single item (i.e., two users are connected if they share at least one item); we find that, in CiteULike, 2,672 users (44.87%) are not connected to any other user. This suggests that a large population of users has individual preferences. Figure 4 presents, for the three similarity metrics defined above, the number of connected components for both CiteULike and Bibsonomy, for thresholds t varying from 1% to 99%. These results show that regardless of the graph definition the number of connected components follow a similar trend as the threshold increases (Note that we exclude isolated nodes from this count of connected graph components). The plots in Figure 4 show that the number of connected components increases up to a certain value of our similarity threshold. After a certain value of t, the number of connected components in the graph starts decreasing, since more and more connected components will contain only one node and will thus be excluded. The critical threshold value is different for each user similarity definition. The initial increase in the number of connected components can be explained by the fact that, as the threshold increases, large components split to form new islands. Since these islands form naturally based on user similarity this result is encouraging since it offers the potential to cluster users according to their interests. As t continues to increase the definition of similarity becomes too strict and leads to more and more isolated nodes. Two observations about the results in Figure 4 can be noted: first, as the threshold increases, the number of components decreases faster for the User-Item graph (Equation 2) than for the DirectedUser-Item graph (Equation 4). This illustrates the effect of using an asymmetrical definition for shared interests. The idea explored Figure 4: Number of connected components for CiteULike (top) and Bibsonomy (bottom) Figure 5: Total number of nodes in the interest sharing graph and in the largest component for CiteULike (top) and Bibsonomy (bottom)