Hierarchical Bayesian Models for Collaborative Tagging Systems Markus Bundschus, Shipeng Yut, Volker Trespt, Achim Rettingers, Mathaeus Dejorit and Hans-Peter Kriegel* Institute for Computer Science, Ludwig-Maximilians-Universitat Miinchen, Oettingenstr: 67, 80538 Miinchen, germany CAD Knowledge Solutions, Siemens Medical Solutions, 5/ Valley Stream Parkway, Malvern, PA 19355, USA hipeng.yu@siemens.com Corporate Technology, Siemens AG, Otto-Hahn-Ring 6, 81739 Miinchen, Germany volkertresp@siemens.com INstitute for Computer Science(i7), Technische Universitat Minchen, Boltzmannstr. 3, 85748 Garching, germany Email: achim ettinger@cs. tum.edu iIntegrated Data Systems Dep, Siemens Corporate Research, 755 College Road East, Princeton, NJ08540, USA Emailmathaeus.dejori@siemens.com Abstract-Collaborative tagging systems with user generated information_retrieval, information-retrieval and IR). Also ontent have become a fundamental element of websites such as the meaning of a particular tag, such as to-read, might be Delicious, Flickr or CiteULike. By sharing comme n knowledge massively linked semantic data sets are generated that provide subjective to individuals and does not necessarily express ew challenges for data mining. In this paper, we reduce the the same shared semantic for the whole community. These ata complexity in these systems by finding meaningful topics aspects make the extraction of meaningful information from hat serve to group similar users and serve to recommend tags collaborative systems both challenging and rewarding or resources to users. We propose a well-founded probabilistic In this paper, we present a unified probabilistic frame- approach that can model every aspect of a collaborative work for collaborative tagging systems, which has a sound tagging system By integrating both user information and tag formation into the well-known Latent Dirichlet Allocation heoretical foundation in Hierarchical Bayesian Statistics ramework, the developed models can be used to solve a By extending one well established model for document number of important information extraction and retrieval collections, the Latent Dirichlet Allocation(LDA)model, tasks we are able to exploit the complete spectrum of infor- eywords-collaborative tagging; LDA; user modeling mation available in collaborative tagging systems. Hereby, all involved entities. i.e. the users. their resources and the L. INTRODUCTION assigned resource tags are modeled by a latent multinomial Collaborative knowledge platforms have recently emerged topic distribution. With this strategy, we map each entity into lar frameworks for sharing information between a common lower dimensional latent topic space and thus users with common interests. Some popular examples of are able to extract structure and drastically reduce the great such systems are Delicious, Cite ULike! or Flickr. a key variety of ambiguous information inherent in collaborative eature of these systems is that large numbers of users tagging systems. The here proposed models can be applied upload certain resources of interest and label them with naturally to various tasks. We present results for the ex personalized tags. The resources are in most cases some traction of statistical relationships between users, resources of high-dimensional data such as text documents or and tags. As a quantitative evaluation, we present results mages. Without further processing, those resources do not on assessing user similarities, a perplexity analysis on tag ontain any semantic information that is usable for auto- annotation quality, and results on personalized tag recor mated analysis. However, meaningful annotations adding mendation. In the latter case, we outperform several standard semantic to the raw resources are also given in the form tag recommendation algorithms. We train our models on a of user specified tags. In contrast to taxonomies, where raction of the CiteULike system. CiteULike is a system labels represent ordered predefined categories, no restrictions that allows researchers to manage their scientitic reference apply to tags, which are flat and chosen arbitrarily. These articles. It tries to help scientists to cope with the increasing free-form strings actually serve the purpose to organize the interdependent topical complexity of today's researchWhile resources of one single specific user Tags might be polys- in this work, we focus on collaborative tagging systems nous and different users use slightly different variations of based on text, the described models are general and could tags to express the same semantics(e. g. consider the tags handle various types of resources such as pictures as well The outline of the paper is as follows: In Section I Ihttp://www.citeulike.org we briefly summarize existing related work. Section Ill
Hierarchical Bayesian Models for Collaborative Tagging Systems Markus Bundschus∗ , Shipeng Yu† , Volker Tresp‡ , Achim Rettinger§ , Mathaeus Dejori¶ and Hans-Peter Kriegel∗ ∗ Institute for Computer Science, Ludwig-Maximilians-Universitat M ¨ unchen, Oettingenstr. 67, 80538 M ¨ unchen, Germany ¨ Email: {bundschu, kriegel}@dbs.ifi.lmu.de †CAD & Knowledge Solutions, Siemens Medical Solutions, 51 Valley Stream Parkway, Malvern, PA 19355, USA Email: shipeng.yu@siemens.com ‡Corporate Technology, Siemens AG, Otto-Hahn-Ring 6, 81739 Munchen, Germany ¨ Email: volker.tresp@siemens.com § Institute for Computer Science (i7), Technische Universitat M ¨ unchen, Boltzmannstr. 3, 85748 Garching, Germany ¨ Email: achim.rettinger@cs.tum.edu ¶Integrated Data Systems Dep., Siemens Corporate Research, 755 College Road East, Princeton, NJ 08540, USA Email: mathaeus.dejori@siemens.com Abstract—Collaborative tagging systems with user generated content have become a fundamental element of websites such as Delicious, Flickr or CiteULike. By sharing common knowledge, massively linked semantic data sets are generated that provide new challenges for data mining. In this paper, we reduce the data complexity in these systems by finding meaningful topics that serve to group similar users and serve to recommend tags or resources to users. We propose a well-founded probabilistic approach that can model every aspect of a collaborative tagging system. By integrating both user information and tag information into the well-known Latent Dirichlet Allocation framework, the developed models can be used to solve a number of important information extraction and retrieval tasks. Keywords-collaborative tagging; LDA; user modeling; I. INTRODUCTION Collaborative knowledge platforms have recently emerged as popular frameworks for sharing information between users with common interests. Some popular examples of such systems are Delicious, CiteULike1 or Flickr. A key feature of these systems is that large numbers of users upload certain resources of interest and label them with personalized tags. The resources are in most cases some type of high-dimensional data such as text documents or images. Without further processing, those resources do not contain any semantic information that is usable for automated analysis. However, meaningful annotations adding semantic to the raw resources are also given in the form of user specified tags. In contrast to taxonomies, where labels represent ordered predefined categories, no restrictions apply to tags, which are flat and chosen arbitrarily. These free-form strings actually serve the purpose to organize the resources of one single specific user. Tags might be polysemous and different users use slightly different variations of tags to express the same semantics (e. g. consider the tags 1http://www.citeulike.org/ information retrieval, information-retrieval and IR). Also the meaning of a particular tag, such as to read, might be subjective to individuals and does not necessarily express the same shared semantic for the whole community. These aspects make the extraction of meaningful information from collaborative systems both challenging and rewarding. In this paper, we present a unified probabilistic framework for collaborative tagging systems, which has a sound theoretical foundation in Hierarchical Bayesian Statistics. By extending one well established model for document collections, the Latent Dirichlet Allocation (LDA) model, we are able to exploit the complete spectrum of information available in collaborative tagging systems. Hereby, all involved entities, i. e. the users, their resources and the assigned resource tags are modeled by a latent multinomial topic distribution. With this strategy, we map each entity into a common lower dimensional latent topic space and thus are able to extract structure and drastically reduce the great variety of ambiguous information inherent in collaborative tagging systems. The here proposed models can be applied naturally to various tasks. We present results for the extraction of statistical relationships between users, resources and tags. As a quantitative evaluation, we present results on assessing user similarities, a perplexity analysis on tag annotation quality, and results on personalized tag recommendation. In the latter case, we outperform several standard tag recommendation algorithms. We train our models on a fraction of the CiteULike system. CiteULike is a system that allows researchers to manage their scientific reference articles. It tries to help scientists to cope with the increasing interdependent topical complexity of today’s research. While in this work, we focus on collaborative tagging systems based on text, the described models are general and could handle various types of resources such as pictures as well. The outline of the paper is as follows: In Section II we briefly summarize existing related work. Section III
introduces the hierarchical Bayesian models for collaborative Note that Tur C Tu S Tags and Tu represent the set of tagging systems and Section IV describes our experimental tags for a specific user u. A tag label l denotes a specific setup in detail and presents results. Concluding statements tag from Tu as well as an outlook for future work is given in Section V II. RELATED WOR B. Classical lda Collaborative Tagging Systems Research on collabora In LDA, the generation of a resource collection is modeled tive tagging systems, also referred to as folksonomies or as a three step process. First, for each resource r, a distri- social bookmarking systems, is a relatively new research bution over topics is sampled from a Dirichlet distribution area. Here, the most popular application is tag recommen- Second, for each word w in the resource r, a single topic dation. Some type of collaborative filtering techniques ar is chosen accordingly. Finally, a word is sampled from often applied to this problem [1] or some type of machine multinomial distribution over words specific to the samplec learning algorithm such as Support Vector Machines are used topic. The hierarchical Bayesian model shown in Figure for prediction of the most popular tags [2]. Typically, these I(left) depicts this generative process. a and B denote algorithms are applied on a"dense"fra action of resources and symmetric Dirichlet priors. e represents the resource specific tags, i.e. the resources and tags have to co-occur a sufficient multinomial distribution over T topics, each being drawn number of times. [1] presents an algorithm FolkRank, which independently from a. denotes the multinomial distribu- is based on the original Page Rank algorithm for ranking web tion over N vocabulary items for each of T topics being sites. So far, the recommendation algorithms exploit either drawn independently from B. For each of the Nr words w the provided information from the entire community or the in resource r, z denotes the topic assignment for w, drawn graph structure of the folksonomy to make predictions. In from 0. w is drawn from the topic distribution conditioned content-based recommendation algorithms, tags are derived from an analysis of the content of the resource. But tag recommendation is only one of many interesting tasks in C. Topic-Tag(TT) Model these complex systems. Information retrieval issues 3], the The first proposed model aims at exploiting the additional extraction of statistical relations between involved entities in information inherent in the user specified tags of collab he folksonomy and its mapping to taxonomies [4] as well orative tagging systems. It extends the LDA framework as knowledge acquisition [5] are also of particular interest. by simultaneously modeling the process of generating a Our contribution provides an integrated view on the just resource r and the process of indexing a resource with tags outlined work and applications. Since we define a unified In addition to the steps mentioned in the section above, two babilistic model for collaborative tagging systems, we further steps are introduced(see Figure 1(middle)). For each can apply our models in very different scenarios and tasks of the M, tags in the resource, a topic i is uniformly drawn ( see SectionⅣV). based on the topic assignments for each word in the resource Probabilistic Topic Models In this area, powerful tech- Finally, each tag label l is sampled from a multinomial niques such as Latent Dirichlet Allocation(LDA)[6] have distribution over tag labels specific to the sampled topic been proposed for the automated extraction of useful in- The topic assignment 2 is selected uniformly out off the formation from large document collections. In its classical assignments of topics from the N words in resource r application, LDA tries to find the underlying latent semantic Thus, we first sample an index i from Uniform(1, 2, .. N) properties in an unsupervised fashion. Depending on the and then use the topic assignment from the word with index i addressed generative process, various extensions of the LDa to sample a tag label L, from tag-topic distribution Tz. This framework have been proposed(see e. g.[7],[8],[9D). leads to a coupling between both generative components IL. MODELs Thus, the tags of the resource are conditioned on the factors which are present in r. This model captures the notion of first generating the content of the resource and than the tities in a social tagging system consist of finite sets of tags which annotate the resource. The principle idea of users U, resources R and Tags. Following the notion of [3], coupling e and T has previously been applied successfully a social tagging system or folksonomy F can be represented to modeling images and their captions [6]. There, this model outperformed several other generative annotation models. In this model .i denotes the vector of multinomial distribution () over M, tags for each of T topics being drawn independently where PCUXR X Tags denotes a ternary relation. Each from a symmetric Dirichlet prior ?. After the generation post p can be represented as a triple of words, a topic 2 is drawn from the resource specific distribution, and a label l is drawn from the 2 specific pciu, r, Tur :uEU, rE R, Tur E Tu.(2) distribution T. Instead of estimating the parameters directly
introduces the hierarchical Bayesian models for collaborative tagging systems and Section IV describes our experimental setup in detail and presents results. Concluding statements as well as an outlook for future work is given in Section V. II. RELATED WORK Collaborative Tagging Systems Research on collaborative tagging systems, also referred to as folksonomies or social bookmarking systems, is a relatively new research area. Here, the most popular application is tag recommendation. Some type of collaborative filtering techniques are often applied to this problem [1] or some type of machine learning algorithm such as Support Vector Machines are used for prediction of the most popular tags [2]. Typically, these algorithms are applied on a ”dense” fraction of resources and tags, i. e. the resources and tags have to co-occur a sufficient number of times. [1] presents an algorithm FolkRank, which is based on the original PageRank algorithm for ranking web sites. So far, the recommendation algorithms exploit either the provided information from the entire community or the graph structure of the folksonomy to make predictions. In content-based recommendation algorithms, tags are derived from an analysis of the content of the resource. But tag recommendation is only one of many interesting tasks in these complex systems. Information retrieval issues [3] , the extraction of statistical relations between involved entities in the folksonomy and its mapping to taxonomies [4] as well as knowledge acquisition [5] are also of particular interest. Our contribution provides an integrated view on the just outlined work and applications. Since we define a unified probabilistic model for collaborative tagging systems, we can apply our models in very different scenarios and tasks (see Section IV). Probabilistic Topic Models In this area, powerful techniques such as Latent Dirichlet Allocation (LDA) [6] have been proposed for the automated extraction of useful information from large document collections. In its classical application, LDA tries to find the underlying latent semantic properties in an unsupervised fashion. Depending on the addressed generative process, various extensions of the LDA framework have been proposed (see e. g. [7], [8], [9]). III. MODELS A. Terminology Entities in a social tagging system consist of finite sets of users U, resources R and T ags. Following the notion of [3], a social tagging system or folksonomy F can be represented as a four-tuple: F = hU, R, T ags, Pi, (1) where P ⊆ U × R × T ags denotes a ternary relation. Each post p can be represented as a triple: p ⊆ {hu, r, Turi : u ∈ U, r ∈ R, Tur ∈ Tu}. (2) Note that Tur ⊆ Tu ⊆ T ags and Tu represent the set of tags for a specific user u. A tag label l denotes a specific tag from Tu. B. Classical LDA In LDA, the generation of a resource collection is modeled as a three step process. First, for each resource r, a distribution over topics is sampled from a Dirichlet distribution. Second, for each word w in the resource r, a single topic is chosen accordingly. Finally, a word is sampled from a multinomial distribution over words specific to the sampled topic. The hierarchical Bayesian model shown in Figure 1 (left) depicts this generative process. α and β denote symmetric Dirichlet priors. θ represents the resource specific multinomial distribution over T topics, each being drawn independently from α. Φ denotes the multinomial distribution over N vocabulary items for each of T topics being drawn independently from β. For each of the Nr words w in resource r, z denotes the topic assignment for w, drawn from θ. w is drawn from the topic distribution Φ conditioned on z. C. Topic-Tag (TT) Model The first proposed model aims at exploiting the additional information inherent in the user specified tags of collaborative tagging systems. It extends the LDA framework by simultaneously modeling the process of generating a resource r and the process of indexing a resource with tags. In addition to the steps mentioned in the section above, two further steps are introduced (see Figure 1 (middle)). For each of the Mr tags in the resource, a topic z˜ is uniformly drawn based on the topic assignments for each word in the resource. Finally, each tag label l is sampled from a multinomial distribution over tag labels specific to the sampled topic. The topic assignment z˜ is selected uniformly out off the assignments of topics from the Nr words in resource r. Thus, we first sample an index i from Uniform(1, 2, . . . , Nr) and then use the topic assignment from the word with index i to sample a tag label lj from tag-topic distribution Γz˜i . This leads to a coupling between both generative components. Thus, the tags of the resource are conditioned on the factors, which are present in r. This model captures the notion of first generating the content of the resource and than the tags which annotate the resource. The principle idea of coupling Θ and Γ has previously been applied successfully to modeling images and their captions [6]. There, this model outperformed several other generative annotation models. In this model, Γ denotes the vector of multinomial distribution over Mr tags for each of T topics being drawn independently from a symmetric Dirichlet prior γ. After the generation of words, a topic z˜ is drawn from the resource specific distribution, and a label l is drawn from the z˜ specific distribution Γ. Instead of estimating the parameters directly
1. Graphical models in plate notation with observed(gray circles) and latent variables(white circles). Left: standard LDA. Middle: Topic-Tag odel. Right: User-Topic- Tag (UTT) model Table I [8], we estimate 6 and I from the posterior distributions CORPUS STATISTICS CITEULIKE DATA SET D. User-Topic-Tag (UTT) Model 18.62818.628 In this section we introduce the most elaborate model Word tokens 14 489 1.161.794 for collaborative tagging systems. The UTT model adds an additional layer accounting for the most essential entity: the Users 1.393 18.628 user, who assigns one or more tags to web resources. Thi can be formalized by a two-step process where a user first ites an article based on his interests and afterwards assigns resource. Obviously, this is a simplifying modeling assump- tags based on the content of the resource. This process can be tion. However, this assumption yielded promising results modeled by a hierarchical generative model, in which each in the past when modeling authors and their interests (7 word w of a resource r is associated with two variables: a Furthermore, once we have trained a UTT model, we can user u and a latent topic variable t. We assume that each estimate the resource specific topic distribution based on a user is interested in several topics, thus each user has single user. This provides a personalized view on a resource multinomial distribution over topics. First, a user u is chose and results in a potential bette uniformly at random for each word of a certain resource. Section IV-B4) Hereby u is chosen from the users Ur, the users which cite IV EXPERIMENTS the resource r. Second, a topic is sampled for each word from the user specific topic distribution eu from user ur A. Experimental Setup chosen for that word. Third, for each of the tags associated Data Set: CiteULike provides data snapshots on their th the resource, a topic is uniformly drawn based on the webpage The data used in our experiments was from pic assignments for each word in the web page(Figure 1, November 13th 2008 right). This can be summarized as: Training Data: We selected a reasonable high number 1)For each user u=l... U choose Ou Dirichlet(a) of users (1393)and included articles that were cited by at For each topic t 1.. IT choose pe N Dirichlet(B) least three users. Word tokens from title and abstract were and Tt N Dirichlet(n) stemmed with a standard Porter stemmer and stop words 2) For each resource r= 1 and its given users Ur were removed. Word tokens and tags occurring less than five times were filtered out. Table I summarizes the corpus a)Sample a user zi Uniform(1,., Ur) statistics. The user ids, resource ids and tags are provided b)Given ai, sample a topic zi Mult(eu,) as supplementary data". All in all, this results in a total mple a word Mul(④) number of 64159 posts. Test Set for Tag Recommendation: The only restriction 3)For each tag label Lj,j=1. M, in resource r for the test set was that a resource had to be posted from a a)Sample an index i N Uniform(1,., N sly seen in the training set. the same b)Given topic ii, sample a tag label L, N Mult(r:) tags. We evaluate the models on a total of 15000 post bs Sampling average each user uses 32 tags. The maximum number of equations and provide them in an extended version of this tag labels for a specific user is 279. The average number of paper online- In the UTT model, the interest of the user tag assignments for a user is three is modeled by the assignments of users to words in the www.dbs.ifi.imu.De/-bundschu www.dbs.ifiImu.de/--bundschu/uttmodel_supplementary/info.html
Φ β T R α θ z w N r Φ l β T R M r α θ z w N r z Γ γ T Φ l β T R M r α θ U z w N r x z Γ γ T u r Figure 1. Graphical models in plate notation with observed (gray circles) and latent variables (white circles). Left: standard LDA. Middle: Topic-Tag (TT) model. Right: User-Topic-Tag (UTT) model [8], we estimate Φ, θ and Γ from the posterior distributions via Gibbs Sampling [7]. D. User-Topic-Tag (UTT) Model In this section we introduce the most elaborate model for collaborative tagging systems. The UTT model adds an additional layer accounting for the most essential entity: the user, who assigns one or more tags to web resources. This can be formalized by a two-step process where a user first cites an article based on his interests and afterwards assigns tags based on the content of the resource. This process can be modeled by a hierarchical generative model, in which each word w of a resource r is associated with two variables: a user u and a latent topic variable t. We assume that each user is interested in several topics, thus each user has a multinomial distribution over topics. First, a user u is chosen uniformly at random for each word of a certain resource. Hereby u is chosen from the users Ur, the users which cite the resource r. Second, a topic is sampled for each word from the user specific topic distribution Θu from user ux chosen for that word. Third, for each of the tags associated with the resource, a topic is uniformly drawn based on the topic assignments for each word in the web page (Figure 1, right). This can be summarized as: 1) For each user u = 1 . . . |U| choose Θu ∼ Dirichlet(α) For each topic t = 1 . . . |T| choose Φt ∼ Dirichlet(β) and Γt ∼ Dirichlet(γ) 2) For each resource r = 1 . . . |R| and its given users Ur For each word wi, i = 1 . . . Nr in resource r a) Sample a user xi ∼ Uniform(1, . . . , Ur) b) Given xi, sample a topic zi ∼ Mult(Θui ) c) Given zi, sample a word wi ∼ Mult(Φzi ) 3) For each tag label lj , j = 1 . . . Mr in resource r a) Sample an index i ∼ Uniform(1, . . . , Nr) b) Given topic z˜i, sample a tag label lj ∼ Mult(Γz˜i ) For the sake of brevity, we omit the Gibbs Sampling equations and provide them in an extended version of this paper online2 . In the UTT model, the interest of the user is modeled by the assignments of users to words in the 2www.dbs.ifi.lmu.de/∼bundschu Table I CORPUS STATISTICS CITEULIKE DATA SET Unique Total Resources 18.628 18.628 Word tokens 14.489 1.161.794 Tags 4.311 125.808 Users 1.393 18.628 resource. Obviously, this is a simplifying modeling assumption. However, this assumption yielded promising results in the past when modeling authors and their interests [7]. Furthermore, once we have trained a UTT model, we can estimate the resource specific topic distribution based on a single user. This provides a personalized view on a resource and results in a potential better tag recommendation (see Section IV-B4). IV. EXPERIMENTS A. Experimental Setup Data Set: CiteULike provides data snapshots on their webpage3 . The data used in our experiments was from November 13th 2008. Training Data: We selected a reasonable high number of users (1393) and included articles that were cited by at least three users. Word tokens from title and abstract were stemmed with a standard Porter stemmer and stop words were removed. Word tokens and tags occurring less than five times were filtered out. Table I summarizes the corpus statistics. The user id’s, resource id’s and tags are provided as supplementary data4 . All in all, this results in a total number of 64159 posts. Test Set for Tag Recommendation: The only restriction for the test set was that a resource had to be posted from a user previously seen in the training set. The same applies to tags. We evaluate the models on a total of 15000 posts. In average each user uses 32 tags. The maximum number of tag labels for a specific user is 279. The average number of tag assignments for a user is three. 3http://www.citeulike.org/faq/data.adp 4www.dbs.ifi.lmu.de/∼bundschu/UTTmodel supplementary/info.html
Training Details: Parameters were estimated by Table Ill ing samples from ten randomly-seeded runs, each runnin THREE MOST PROBABLE TOPICS FOR THE TAG SEMANTIC UTT ver 100 iterations, with an initial burn-in phase of 500 for the tt model and 1500 iterations for the utt model. we Topic description (word stems) found the number of burn-in iterations to be a convenient Topic 96(p=0. 85) semant ontolog annot know choice by observing a flattening of the log likelihood. Instead dg web commun support in- of estimating the hyperparameters a, B and we fix them to tegr describ xml 50/T, 0.001 and 1/M respectively in each of the experiments Topic 184(p=0.05) servic web (M represents the total number of tags). The format app Integr chosen according to [7]. We trained the topic models with a g standard interfac predefined number of topics ranging from T=200, T=300 and T=400 to show that the performance is not very sensitie Topic 84(p=0.03) search retriev inform relev ar rank feedback effect document this parameter as long as the number of topics is reasonabl high. In addition, models with t=10, t=50 and T=100 were trained for the perplexity evaluation in Section IV-B2 B. Results 1) Uncovering the Hidden Semantic Properties: Table Il illustrates two different topics(out of 200) from the corpus Note that the coupling between p(wlt)and p(taglt)is a property of the here proposed models and originates from -LDA the sampling of a topic for a specific tag based on the topic 2 asol assignments of the resource(see Section I). To show the -UTT model descriptive power of our learned model, we chose two topics i 8 is abo the science of networks, while topic 84 reflects a topic about information retrieval. All extracted 200 topics from the TT and UTT model are available as supplementary data Number of latent variables Information about p(ut) in CiteULike provides interest ing insights about the main research interests of users. The Figure 2. Tag perplexity on the test se most likely users given the topics for a UTT model with T=200 are again provided as supplementary data Other interesting statistical relationships that can be ex- parameters which differ dependent of the model(see Section tracted with the here derived models are e. g. the statistical IID). We partition the data set into disjoint training(90%)and relationships between tag labels and topics p(t)4). p(t(n) test sets(10%)and select for each resource in the test set a gives us a notion about the involvement of a tag in dif- subset of 50% of the tags for the evaluation. The remaining ferent topics. Table Ill shows one such example for the 50% of the tags are used by standard LDA to estimate e tag semantic. semantic is mostly discussed in the context since LDA is not able to model the dependency between of the traditional semantic web, but also in the context of the tokens in the resource and the tags. In contrast, the bioinformatics and web services. The third topic discusses TT and UTT model first estimate the resource specific e, the tag in the classical information retrieval domain which is estimated online via Gibbs Sampling respectively. 2)Tag Perplexity: In addition to the qualitative evaluation Afterwards the most likely tags are computed by T. All of the TT and UTT model shown above, we measure the perplexity values were computed by averaging over ten quality in peaking perplexity is the ability to predict tags on new held-out tags under the maximum likelihood estimates of unseen documents. Perplexity, a quantitative measure for each model for different values of T. Note that a lower comparing language models is widely used to compare the perplexity indicates a better annotation quality. We see that predictive performance of topic models(see e.g. [7) and is the two models. which include the resource tokens into defined over a test set as the computation of the likelihood clearly outperform the standard Lda model. as t increases. the utt model has Perp (Itest Dtrain )=exp a better perplexity than the TT model(with a crosspoint at T=100). With T=400 the perplexity of the TT model starts where ltest are the tags in the test set and ld represent the slightly to increase, while for the UTt model the perple tags in a certain test resource Dtrain represents the trained remains constant
Training Details: Parameters were estimated by averaging samples from ten randomly-seeded runs, each running over 100 iterations, with an initial burn-in phase of 500 for the TT model and 1500 iterations for the UTT model. We found the number of burn-in iterations to be a convenient choice by observing a flattening of the log likelihood. Instead of estimating the hyperparameters α, β and γ, we fix them to 50/T, 0.001 and 1/M respectively in each of the experiments (M represents the total number of tags). The values were chosen according to [7]. We trained the topic models with a predefined number of topics ranging from T=200, T=300 and T=400 to show that the performance is not very sensitive to this parameter as long as the number of topics is reasonably high. In addition, models with T=10 , T=50 and T=100 were trained for the perplexity evaluation in Section IV-B2. B. Results 1) Uncovering the Hidden Semantic Properties: Table II illustrates two different topics (out of 200) from the corpus. Note that the coupling between p(w|t) and p(tag|t) is a property of the here proposed models and originates from the sampling of a topic for a specific tag based on the topic assignments of the resource (see Section III). To show the descriptive power of our learned model, we chose two topics describing different aspects of CiteULike. Topic 18 is about the science of networks, while topic 84 reflects a topic about information retrieval. All extracted 200 topics from the TT and UTT model are available as supplementary data 4 . Information about p(u|t) in CiteULike provides interesting insights about the main research interests of users. The most likely users given the topics for a UTT model with T=200 are again provided as supplementary data4 . Other interesting statistical relationships that can be extracted with the here derived models are e. g. the statistical relationships between tag labels and topics p(t|l). p(t|l) gives us a notion about the involvement of a tag in different topics. Table III shows one such example for the tag semantic. semantic is mostly discussed in the context of the traditional semantic web, but also in the context of bioinformatics and web services. The third topic discusses the tag in the classical information retrieval domain. 2) Tag Perplexity: In addition to the qualitative evaluation of the TT and UTT model shown above, we measure the tag annotation quality in terms of perplexity. Intuitively speaking, perplexity is the ability to predict tags on new unseen documents. Perplexity, a quantitative measure for comparing language models is widely used to compare the predictive performance of topic models (see e. g. [7]) and is defined over a test set as: P erp.(ltest|Dtrain) = exp − PDtest i=1 log(p(ld|Dtrain)) PDtest i=1 Ld ! , where ltest are the tags in the test set and ld represent the tags in a certain test resource. Dtrain represents the trained Table III THREE MOST PROBABLE TOPICS FOR THE TAG SEMANTIC. UTT MODEL, T=200. Topic Topic description (word stems) Topic 96 (p=0.85) semant ontolog annot knowledg web commun support integr describ xml Topic 184 (p=0.05) servic web workflow bioinformat applic resourc integr manag standard interfac Topic 84 (p=0.03) search retriev inform relev ar rank feedback effect document 0 50 100 150 200 250 300 350 400 200 250 300 350 400 450 500 550 Number of latent variables Perplexity LDA TT model UTT model Figure 2. Tag perplexity on the test set. parameters which differ dependent of the model (see Section III). We partition the data set into disjoint training (90%)and test sets (10%) and select for each resource in the test set a subset of 50% of the tags for the evaluation. The remaining 50% of the tags are used by standard LDA to estimate Θ, since LDA is not able to model the dependency between the tokens in the resource and the tags. In contrast, the TT and UTT model first estimate the resource specific Θ, which is estimated online via Gibbs Sampling respectively. Afterwards the most likely tags are computed by Γ. All perplexity values were computed by averaging over ten different samples. Figure 2 plots the perplexity over the held-out tags under the maximum likelihood estimates of each model for different values of T. Note that a lower perplexity indicates a better annotation quality. We see that the two models, which include the resource tokens into the computation of the likelihood clearly outperform the standard LDA model. As T increases, the UTT model has a better perplexity than the TT model (with a crosspoint at T=100). With T=400 the perplexity of the TT model starts slightly to increase, while for the UTT model the perplexity remains constant
Table l SELECTED TOPICS LEARNED FROM THE UTT MODEL WITH T=200. FOR EACH TOPIC THE FIVE MOST PROBABLY WORDS AND TAGS ARE LISTED Topic 18 search 0.171ir retrie inform 0.023 complexity 0.069 retrieva 0.020 0.007 feed back 0.029 evaluation 0.041 Table iv NDCG EVALUATION FOR DIFFERENT NUMBER OF RECOMMENDATIONS NDCG @5@10@15al 0.2 e30.250.310.340.40 TT,T=2000.290.370.390.47 TT.T=300 290.370390.47 TT,T=4000.300.370.400.4 UTT. T: 0.400.50 UTT,T=3000.320.380410.51 UTT,T=4000340400430.52 (with n, the size of a group) Figure 3. divergence of this artificial group. This step was repeated indicate the true group divergence. ach group. The stars 1000 times. Afterwards these results are compared to the true group divergence. Figure 3 shows the corresponding boxplot for the 1000 samplings for each group. On each 3) Assessing User Similarity with the UTT Model: In box, the central red line is the median, the edges of the box order to test if the UTT model is able to identify similar are the 25th and 75th percentiles. The whiskers were chosen members of groups in the Citel n our data set which are such that all data points within +2.7o are considered not as users, we identified all users given Like system. CiteULike- outliers. The stars in the plot indicate the true divergence for belong to one research lab, like for instance the Carnegie the just mentioned percentiles. Furthermore, 20 out of 27 Mellon University Human Interaction Institute with a group groups are not within +2.7o. When using the document of 26 users. In our data set there are 488 users out of 1393 or tag distribution of a user as baseline to compute user which belong to a total of 524 groups(as of November 18, similarities, none of the 27 true group divergences fall out 2008). We excluded all groups with less than five members. of +2. 7o(Results not shown for the sake of brevity, but This resulted in a total of 27 groups with 160 users. 31 user available online") belong to more than one group and the maximum number 4) Personalized Tag Recommendation: We perform eval of groups for one user is five. We derive the similarity uation on a post basis, i. e. given an user u e U and a between users based on the learned user-topic distributions resource r e R, we want to predict a recommendation or eu. Since each user is represented as a multinomial over ranking of tags te lu the topics T, Jeffreys' J-divergence a symmetric version of Baselines: We follow the baseline methods of previous e Kullback-Leibler(KL)divergence, is used. Jeffreys'J- work on tag prediction [1], but in addition provide personal- divergence originates from information theory and is a ized versions. The TT and the UTT model are benchmarked method to compute the similarity between two probability against three standard tag recommendation methods distribu Most popular tags: Tags for a resource r are predicted Our assumption is that users that share the same group based on the relative frequency in the collaborative tag system( Baseline 1) other than users that are randomly chosen and considered Most popular tags with user restriction: Tags for a as an artificial group. Therefore, we repeated the following resource r are first ranked according to the relative procedure for each group: We randoml led n users quency and than are reduced to the set of tags Tu
Table II SELECTED TOPICS, LEARNED FROM THE UTT MODEL WITH T=200. FOR EACH TOPIC THE FIVE MOST PROBABLY WORDS AND TAGS ARE LISTED Topic 18 Word Prob. Tag Prob. network 0.51 network 0.36 connect 0.040 networks 0.266 complex 0.035 graph 0.009 structur 0.023 complexity 0.009 topolog 0.020 complex 0.007 Topic 84 Word Prob. Tag Prob. search 0.171 ir 0.21 retriev 0.125 search 0.059 inform 0.077 information-retrieval 0.054 relev 0.069 retrieval 0.042 feedback 0.029 evaluation 0.041 1 1.5 2 2.5 3 3.5 4 4.5 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Groups JJ−divergence Figure 3. Boxplot over 1000 random samplings for each group. The stars indicate the true group divergence. 3) Assessing User Similarity with the UTT Model: In order to test if the UTT model is able to identify similar users, we identified all users given in our data set which are members of groups in the CiteULike system. CiteULikegroups typically share similar research interests and often belong to one research lab, like for instance the Carnegie Mellon University Human Interaction Institute with a group of 26 users. In our data set there are 488 users out of 1393 which belong to a total of 524 groups (as of November 18, 2008). We excluded all groups with less than five members. This resulted in a total of 27 groups with 160 users. 31 user belong to more than one group and the maximum number of groups for one user is five. We derive the similarity between users based on the learned user-topic distributions Θu. Since each user is represented as a multinomial over the topics T, Jeffreys‘ J-divergence a symmetric version of the Kullback-Leibler (KL) divergence, is used. Jeffreys‘Jdivergence originates from information theory and is a method to compute the similarity between two probability distributions. Our assumption is that users that share the same group membership should be significantly more similar to each other than users that are randomly chosen and considered as an artificial group. Therefore, we repeated the following procedure for each group: We randomly sampled n users Table IV NDCG EVALUATION FOR DIFFERENT NUMBER OF RECOMMENDATIONS. NDCG @5 @10 @15 all Baseline 1 0.04 0.05 0.06 0.19 Baseline 2 0.14 0.20 0.23 0.37 Baseline 3 0.25 0.31 0.34 0.40 TT, T=200 0.29 0.37 0.39 0.47 TT, T=300 0.29 0.37 0.39 0.47 TT, T=400 0.30 0.37 0.40 0.49 UTT, T=200 0.31 0.37 0.40 0.50 UTT, T=300 0.32 0.38 0.41 0.51 UTT, T=400 0.34 0.40 0.43 0.52 (with n, the size of a group) and computed the mean divergence of this artificial group. This step was repeated 1000 times. Afterwards these results are compared to the true group divergence. Figure 3 shows the corresponding boxplot for the 1000 samplings for each group. On each box, the central red line is the median, the edges of the box are the 25th and 75th percentiles. The whiskers were chosen such that all data points within ±2.7σ are considered not as outliers. The stars in the plot indicate the true divergence for each group. All true group divergences fall clearly below the just mentioned percentiles. Furthermore, 20 out of 27 groups are not within ±2.7σ. When using the document or tag distribution of a user as baseline to compute user similarities, none of the 27 true group divergences fall out of ±2.7σ (Results not shown for the sake of brevity, but available online4 ). 4) Personalized Tag Recommendation: We perform evaluation on a post basis, i. e. given an user u ∈ U and a resource r ∈ R, we want to predict a recommendation or ranking of tags t ∈ Tu. Baselines: We follow the baseline methods of previous work on tag prediction [1], but in addition provide personalized versions. The TT and the UTT model are benchmarked against three standard tag recommendation methods. • Most popular tags: Tags for a resource r are predicted based on the relative frequency in the collaborative tag system. (Baseline 1) • Most popular tags with user restriction: Tags for a resource r are first ranked according to the relative frequency and than are reduced to the set of tags Tu