In this stage, we collected the social tags that are potentially relevant for describing the input bookmarked document based on a set of related bookmarks. We assigned a weight to each tag capturing the strength of its contribution to the bookmark description. However, we realised that this measure is not enough for tag recommendation purposes, and global metrics regarding the folksonomy graph, such ies and tag correlations, have to be taken into considerat 4.4 Building the global social tag co-occurrence sub-graph In the fourth stage(label 4 in Figure 1), we interconnect the social tags obtained in the previous stage through the co-occurrence values of each pair of tags of resources(bookmarks)that have been tagged with both ti and t. In this work er The co-occurrence of two tags ti and ty is usually defined in terms of the numb make use of the asymmetric co-occurrence metric (t,t) #in: ti E tags(bn)at E tags(n)) #{n:t∈tags(bn) which assigns different values for co(ti and co(ti, ti dividing the number of resources tagged with the two tags by the number of resources tagged with one of Computing the co-occurrence values for each pair of tags existing in a training dataset, we build a global graph where the vertices correspond to the available tags, and the edges link tags that co-occur within at least one resource. This graph is directed and weighted: each pair of co-occurring tags is linked by two edges whose weights are the asymmetric co-occurrence values of the tags he tags obtained in the We propose to exploit this global graph to previous stage, and extract the ones that are more related with the input bookmark Specifically, we create a sub-graph where the vertices are the above tags, and the edges are the same as these tags have in the global co-occurrence graph. From this sub-graph,we remove those edges whose co-occurrence values co(ti, t)are lower than the average co-occurrence value of the sub-graph vertices Σujco(t,t) avg_Co(n)-#(.j): co(t, t9)>0) where ti and t are the pairs of social tags related to the input bookmark bn Removing these edges, we aim to isolate(and later discard)"noise" tags that less frequently appear in bookmark annotations We hypothesise that vertices of the generated sub-graph that are most"strongly connected with the rest of the vertices correspond to tags that should be recommended, assuming that high graph vertex centralities are associated to the most informative or representative vertices. In this context, it is important to note that related tags with high weights vn do not necessarily have to be the ones with highest vertex centralities in the co-occurrence sub-graph. We hypothesise that a combination
In this stage, we collected the social tags that are potentially relevant for describing the input bookmarked document based on a set of related bookmarks. We assigned a weight to each tag capturing the strength of its contribution to the bookmark description. However, we realised that this measure is not enough for tag recommendation purposes, and global metrics regarding the folksonomy graph, such as tag popularities and tag correlations, have to be taken into consideration. 4.4 Building the global social tag co-occurrence sub-graph In the fourth stage (label 4 in Figure 1), we interconnect the social tags obtained in the previous stage through the co-occurrence values of each pair of tags. The co-occurrence of two tags ' and '0 is usually defined in terms of the number of resources (bookmarks) that have been tagged with both ' and '0 . In this work, we make use of the asymmetric co-occurrence metric: 123' ,'04 = #{6:' 7 tags ^ '0 7 tags } #{6:' 7 tags } , which assigns different values for 123' , '04 and 123'0 , '4 dividing the number of resources tagged with the two tags by the number of resources tagged with one of them. Computing the co-occurrence values for each pair of tags existing in a training dataset, we build a global graph where the vertices correspond to the available tags, and the edges link tags that co-occur within at least one resource. This graph is directed and weighted: each pair of co-occurring tags is linked by two edges whose weights are the asymmetric co-occurrence values of the tags. We propose to exploit this global graph to interconnect the tags obtained in the previous stage, and extract the ones that are more related with the input bookmark. Specifically, we create a sub-graph where the vertices are the above tags, and the edges are the same as these tags have in the global co-occurrence graph. From this sub-graph, we remove those edges whose co-occurrence values 123' , '04 are lower than the average co-occurrence value of the sub-graph vertices: <&=_12 = ∑ 12' ,' ,0 0 #{,?: 123' ,'04 > 0} , where ' and '0 are the pairs of social tags related to the input bookmark . Removing these edges, we aim to isolate (and later discard) “noise” tags that less frequently appear in bookmark annotations. We hypothesise that vertices of the generated sub-graph that are most “strongly” connected with the rest of the vertices correspond to tags that should be recommended, assuming that high graph vertex centralities are associated to the most informative or representative vertices. In this context, it is important to note that related tags with high weights & do not necessarily have to be the ones with highest vertex centralities in the co-occurrence sub-graph. We hypothesise that a combination 26
of both measures- local weights representing the bookmark content topics and global co-occurrences taking into account collaborative popularities is an appropriate strategy for tag recommendation Figure 2 shows the resultant co-occurrence graph associated to the tags retrieved from the example input bookmark. The tags with highest vertex in-degree seem to be good candidates to describe the contents of the bookmarked document put bookmark: A Multilayer Ontology-based Hybrid Recommendation Model Figure 2. Filtered tag co-occurrence graph associated example input bookmark. Edge weights and non-connected vertices are not shown. Two main clusters can be identified in the graph, which correspond to two research areas to the bookmarked document. mender systems, and semantic web technologies
of both measures – local weights representing the bookmark content topics and global co-occurrences taking into account collaborative popularities – is an appropriate strategy for tag recommendation. Figure 2 shows the resultant co-occurrence graph associated to the tags retrieved from the example input bookmark. The tags with highest vertex in-degree seem to be good candidates to describe the contents of the bookmarked document. Input bookmark: A Multilayer Ontology-based Hybrid Recommendation Model Figure 2. Filtered tag co-occurrence graph associated to the example input bookmark. Edge weights and non-connected vertices are not shown. Two main clusters can be identified in the graph, which correspond to two research areas related to the bookmarked document: recommender systems, and semantic web technologies. 27
The goal of this stage was to establish global relations between the social tags that are potentially useful for describing the input bookmark. Exploiting these relations we aimed to take into account tag popularity and tag co-occurrence aspects, and expected to identify which are the most informative tags to be recommended 4.5 Recommending social tags In the fifth stage (label 5 in Figure 1), we select and recommend a subset of the related tags from previous stages. The selection criterion we propose is based on three aspects: the tag frequency in bookmarks similar to the input bookmark(stage 3), the tag co-occurrence graph centrality(stage 4), and a personalisation strategy that prioritises those tags that are related to the input bookmark and belong to the set of tags already used by the user to whom the recommendations are directed For each tag t, the first two aspects are combined as follows Cn(t)=in_degreen(t).(vn(t)). here in_degreen (t) is the number of edges that have as destination the vertex of tag t in the co-occurrence sub-graph built in stage 4 for the input bookmark bm In order to penalise too generic tags we conduct a TF-IDF based reformulation of rarities cn(t) n(t)=cn(t). Log(#(: te tags(bi)) where N is the total number of bookmarks in the repository Finally, to take into account information about the users tagging activity ncrease the rn(t) values of those tags that have already been used by the user pn.u(t)=nn(t).(1+pu(t)) where pu(t) is the normalised preference of user u for tag t f ift E tags(u Pu(t)= max fui otherwise fui being the number of times tag t has been used by user agS WI ighest preference values pn. u(t) constitute the set of final recommendations. Both the TF-IDF and personalisation based mechanisms were evaluated isolated and in conjunction with the baseline approach cn(t) improving its results
The goal of this stage was to establish global relations between the social tags that are potentially useful for describing the input bookmark. Exploiting these relations, we aimed to take into account tag popularity and tag co-occurrence aspects, and expected to identify which are the most informative tags to be recommended. 4.5 Recommending social tags In the fifth stage (label 5 in Figure 1), we select and recommend a subset of the related tags from previous stages. The selection criterion we propose is based on three aspects: the tag frequency in bookmarks similar to the input bookmark (stage 3), the tag co-occurrence graph centrality (stage 4), and a personalisation strategy that prioritises those tags that are related to the input bookmark and belong to the set of tags already used by the user to whom the recommendations are directed. For each tag ', the first two aspects are combined as follows: 1' = 6_AB=CBB' ∙ &' D where 6_AB=CBB' is the number of edges that have as destination the vertex of tag ' in the co-occurrence sub-graph built in stage 4 for the input bookmark . In order to penalise too generic tags we conduct a TF-IDF based reformulation of the centralities 1': C' = 1' ∙ E2= F G #{:' 7 tags} H where G is the total number of bookmarks in the repository. Finally, to take into account information about the user’s tagging activity, we increase the C' values of those tags that have already been used by the user: I,J ' = C' ∙ 1 + IJ' where IJ' is the normalised preference of user L for tag ': IJ ' = M J,) max P *+,-J J, if ' ∈ '<=L 0 otherwise , J, being the number of times tag ' has been used by user L. The tags with highest preference values I,J' constitute the set of final recommendations. Both the TF-IDF and personalisation based mechanisms were evaluated isolated and in conjunction with the baseline approach 1' improving its results. 28
Table 5 shows the final sorted list of tags recommended for the example input bookmark: recommender, collaborative, filtering, semanticweb personalization. It is important to note that these tags are not the same as the top tags obtained in Stage 3 (see Table 4). In that case, all those tags (recommender, recommendation, collaborative, filtering collaborativefiltering) were biased to vocabulary about"recommender ystems", and no diversity in the suggested tags was provided Table 5. Final tag recommendations for the example input bookmark Input bookmark: A Multilayer Ontology-based Hybrid Recommendation Model Tag 1 recommender In the fifth and last stage, we ranked the social tags extracted from the bookmarks similar to the input one. For that purpose, a combination of tag co-occurrence graph centrality, tag frequency, and tag-based personalisation metrics was performed. with an illustrative example, we showed that this strategy seems to offer more diversity in the recommendations than simply selecting the tags that more times were assigned to 5 Experiments 5.1 Tasks Forming part of the ECML PKDD 2009 Discovery Challenge, two experimental tasks have been designed to evaluate the tag recommendations. Both of them get the same dataset for training, a snapshot of BibSonomy system until December 3 1st 2008, but different test datasets. Task 1. The test data contains bookmarks, whose user, resource or tags are not contained in the training data Task 2. The test data contains bookmarks whose user resource or tags are all contained in the training data 5.2 Datasets Table 6 shows the statistics of the training and test datasets used in the experiments T
Table 5 shows the final sorted list of tags recommended for the example input bookmark: recommender, collaborative, filtering, semanticweb, personalization. It is important to note that these tags are not the same as the top tags obtained in Stage 3 (see Table 4). In that case, all those tags (recommender, recommendation, collaborative, filtering, collaborativefiltering) were biased to vocabulary about “recommender systems”, and no diversity in the suggested tags was provided. Table 5. Final tag recommendations for the example input bookmark. Input bookmark: A Multilayer Ontology-based Hybrid Recommendation Model Tag 1 recommender Tag 2 collaborative Tag 3 filtering Tag 4 semanticweb Tag 5 personalization In the fifth and last stage, we ranked the social tags extracted from the bookmarks similar to the input one. For that purpose, a combination of tag co-occurrence graph centrality, tag frequency, and tag-based personalisation metrics was performed. With an illustrative example, we showed that this strategy seems to offer more diversity in the recommendations than simply selecting the tags that more times were assigned to similar bookmarks. 5 Experiments 5.1 Tasks Forming part of the ECML PKDD 2009 Discovery Challenge, two experimental tasks have been designed to evaluate the tag recommendations. Both of them get the same dataset for training, a snapshot of BibSonomy system until December 31st 2008, but different test datasets: • Task 1. The test data contains bookmarks, whose user, resource or tags are not contained in the training data. • Task 2. The test data contains bookmarks, whose user, resource or tags are all contained in the training data. 5.2 Datasets Table 6 shows the statistics of the training and test datasets used in the experiments. Tag assignments (user-tag-resource) are abbreviated as tas. 29
Table 6. ECML PKDD 2009 Discovery Challenge dataset Web users esources 263004 158924 421928 56424 93756 16469 4846351401104 tas/resource 305 3.32 16898 26104 43002 Test(task 1 Llags 1439524393 64460 99603 164063 81 users resources Test(task 2)tags 587 397 11391 tasresouro 3.40 4. 5.3 Evaluation metrics As evaluation metric, we use the average F-measure, computed over all the bookmarks in the test dataset as follows 2 F(tags(u, b)) b)) (tags,(u, b)) precision(tags,(u, b))+recall(tags(u, b)) tags,(u,b)l (u, b)l precision(tags(u, by_tags(u, b)n tags(u, b)l Tags,(u, b)l being tags(u, b) the set of tags assigned to bookmark b by user u, and tags(u, b) the set of tags predicted by the tag recommender for bookmark b and user u. For each bookmark in the test dataset, we compute the F- the recommended tags against the tags the user originally assigned to the bookmark The comparison is done ignoring case of tags and removing all characters which are neither letters nor numbers
Table 6. ECML PKDD 2009 Discovery Challenge dataset. Web pages Scientific publications All bookmarks Training users 2679 1790 3617 resources 263004 158924 421928 tags 56424 50855 93756 tas 916469 484635 1401104 tas/resource 3.48 3.05 3.32 Test (task 1) users 891 1045 1591 resources 16898 26104 43002 tags 14395 24393 34051 tas 64460 99603 164063 tas/resource 3.81 3.82 3.82 Test (task 2) users 91 81 136 resources 431 347 778 tags 587 397 862 tas 1465 1139 3382 tas/resource 3.40 3.28 4.35 5.3 Evaluation metrics As evaluation metric, we use the average X-measure, computed over all the bookmarks in the test dataset as follows: X3'<=YL, 4 = 2 ∙ ICB1263'<=YL, 4 ∙ CB1<EE3'<=YL, 4 ICB1263'<=YL, 4 + CB1<EE3'<=YL, 4 where: CB1<EE3'<=YL, 4 = ['<=L, ∩ '<=YL, [ |'<=L, | ICB1263'<=YL, 4 = ['<=L, ∩ '<=YL, [ ['<=YL, [ being '<=L, the set of tags assigned to bookmark by user L, and '<=YL, the set of tags predicted by the tag recommender for bookmark and user L. For each bookmark in the test dataset, we compute the X-measure by comparing the recommended tags against the tags the user originally assigned to the bookmark. The comparison is done ignoring case of tags and removing all characters which are neither letters nor numbers. 30