《电子商务 E-business》阅读文献：A Two-Level Learning Hierarchy of Concept Based Keyword Extraction for Tag Recommendations.pdf_P36-P40

PKDD 09, tag recommendation has become the exclusive task. However, the performance of tag recommendation is not good enough to be widely used, more research work is needed and progress is essential for the practical use of tag rec- ommendation in commercial system. In this paper, supervise ranking model applied to tackle tag recommendation problem, and good result is achieved The rest of paper is organized as follows: Section 2 lists the previous work on tag recommendation. Section 3 gives a description of supervised ranking model Section 4 lists our experiment settings, experiment procedure and our analysis of he results on recovered 08s dataset. The models performance on 09s dataset is presented in Section 5. Section 6 summarizes our work 2 Previous work Much research work has been done for tag recommendation, most of which can be categorized into two types, one is rule-based, the other is classification- based Rule based approach is used by many researchers. Lipczak 1 proposed a hree-step tag recommendation system in their paper: Basic tags are extracted from the resource title. In the next step, the set of potential recommendations extended by related tags proposed by a lexicon based on co-occurrences of tags within resource's posts. Finally, tags are filtered by the user's personomy-a set of tags previously used by the user. Tatu, et al. 2used document and user models derived from the textual content associated with URLs and publications by soci bookmarking tool users, the textual information includes information present a URLs title, a users description of a document, or a bibtex field associated with a scientific publication, they used natural language understanding approach for producing tag recommendations, such as extraction of concepts, extraction of conflated tags which group tags to semantically related groups. However, too much expert experience and manual work are needed in rule-based approaches and its generalization is limited Classification-based approach is also used for the tag recommendation task Katakis et al. 3 tried to model the automated tag suggestion problem as a multilabel text classification task. Heymann et al. [4 predicted tags based on page text, anchor text, surrounding hosts, and other tags applied to the urL They found an entropy-based metric which captures the generality of a particu lar tag and informs an analysis of how well that tag can be predicted. They also found that tag-based association rules can produce very high-precision predic- tions as well as giving deeper understanding into the relationships between tags Their results have implications for both the study of tagging systems as poten- tial information retrieval tools, and for the design of such systems. However he application of classification does not suggest a good solution to the tag pre- diction problem: first, the tag space is fixed, all the resource can be categorized to the existed tags only, also, the amount of tags number could be very large, the traditional classification model would be rather low efficient

PKDD 09, tag recommendation has become the exclusive task. However, the performance of tag recommendation is not good enough to be widely used, more research work is needed and progress is essential for the practical use of tag recommendation in commercial system. In this paper, supervise ranking model is applied to tackle tag recommendation problem, and good result is achieved on test data. The rest of paper is organized as follows: Section 2 lists the previous work on tag recommendation. Section 3 gives a description of supervised ranking model. Section 4 lists our experiment settings, experiment procedure and our analysis of the results on recovered 08’s dataset. The model’s performance on 09’s dataset is presented in Section 5. Section 6 summarizes our work. 2 Previous Work Much research work has been done for tag recommendation, most of which can be categorized into two types, one is rule-based, the other is classificationbased. Rule based approach is used by many researchers. Lipczak [1] proposed a three-step tag recommendation system in their paper : Basic tags are extracted from the resource title. In the next step, the set of potential recommendations is extended by related tags proposed by a lexicon based on co-occurrences of tags within resource’s posts. Finally, tags are filtered by the user’s personomy - a set of tags previously used by the user. Tatu, et al. [2]used document and user models derived from the textual content associated with URLs and publications by social bookmarking tool users, the textual information includes information present in a URL’s title, a user’s description of a document, or a bibtex field associated with a scientific publication, they used natural language understanding approach for producing tag recommendations, such as extraction of concepts, extraction of conflated tags which group tags to semantically related groups. However, too much expert experience and manual work are needed in rule-based approaches, and its generalization is limited. Classification-based approach is also used for the tag recommendation task. Katakis et al. [3] tried to model the automated tag suggestion problem as a multilabel text classification task. Heymann et al. [4] predicted tags based on page text, anchor text, surrounding hosts, and other tags applied to the URL. They found an entropy-based metric which captures the generality of a particular tag and informs an analysis of how well that tag can be predicted. They also found that tag-based association rules can produce very high-precision predictions as well as giving deeper understanding into the relationships between tags. Their results have implications for both the study of tagging systems as potential information retrieval tools, and for the design of such systems. However , the application of classification does not suggest a good solution to the tag prediction problem: first, the tag space is fixed , all the resource can be categorized to the existed tags only, also, the amount of tags number could be very large, the traditional classification model would be rather low efficient. 36

Collaborative filtering is a commonly used technical for user-oriented ta lany researchers tried collaborative filtering in tag recommendation. Gilad Mishne 5 used collaborative approach to automated tag assignment for weblog posts. Robert Jaschke, et al [6 evaluated and compared user-based collaborative filtering and graph-based recommender, the result shows that both of these two nethods provide better results than non-personalized baseline method, especially Adriana Budura et al. 7 used neighborhood based tag recommendation, which make use of content similarity Principle and simple score approach is used to select the candidate tags, however, in our paper, machine learning method is used, a ranking model is learned automatically, then the candidate tags are anked and top-ranked tags are suggested as recommending tags 3 Supervised Ranking Model for Tag Recommendation 3.1 Problem statement The tag recommendation problem can be described as follows For a given post P whose user is U and resource is R, a set of tags are ested as tags for the post. Here we denote post as P, tag as T, resource as R. user as U. a possible and most nature way to solve the tag recommendation problem is as follows: First, a set of candidate tags are selected for the post, and then tags which are most likely to be the tags for the post are selected as recom- ending tags. The commonly used approach to choose the tags is rule-based and classification-based methods, but both of them have defects: rule-based ap- proach relies on expert experience and manual efforts to set up the rules and tuning the parameters; classification-based is restrict to the fix of tag space and is inefficient when it is treated as a multi-label problem. In this paper tag recom- mendation is conveyed to a problem of ranking candidate tags. A ranking model is constructed to ensure tags that are most likely to be post's tags rank higher than tags that are not. Supervised learning model is used to construct the rank- g model satisfying the restriction. Ranking-SVM model is the most frequentl used supervised ranking model and is proofed to be a successful model, so it is used as our supervised ranking model in the experiments. All the candidate tags for one post are grouped as a ranking group and the top-ranked candidate tags are selected as recommendation tags 3.2 Introduction to Ranking SVM Here we briefly describe the Ranking Support Vector Machine( Ranking SVM) odel for tag recommendation Assume that t E Rm is the input feature space which represents feature of candidate tag given a user and resource, and m denotes the feature number O=10, 1 is the output rank space which is represented by the labels, and 1

Collaborative filtering is a commonly used technical for user-oriented task. Many researchers tried collaborative filtering in tag recommendation. Gilad Mishne [5] used collaborative approach to automated tag assignment for weblog posts. Robert Jaschke, et al [6] evaluated and compared user-based collaborative filtering and graph-based recommender, the result shows that both of these two methods provide better results than non-personalized baseline method, especially the graph-based recommender outperforms existing methods considerably. Adriana Budura et al. [7] used neighborhood based tag recommendation, which make use of content similarity. Principle and simple score approach is used to select the candidate tags, however, in our paper, machine learning method is used, a ranking model is learned automatically, then the candidate tags are ranked and top-ranked tags are suggested as recommending tags. 3 Supervised Ranking Model for Tag Recommendation 3.1 Problem Statement The tag recommendation problem can be described as follows: For a given post P whose user is U and resource is R, a set of tags are suggested as tags for the post. Here we denote post as P, tag as T, resource as R, user as U. A possible and most nature way to solve the tag recommendation problem is as follows: First, a set of candidate tags are selected for the post, and then tags which are most likely to be the tags for the post are selected as recommending tags. The commonly used approach to choose the tags is rule-based and classification-based methods, but both of them have defects: rule-based approach relies on expert experience and manual efforts to set up the rules and tuning the parameters; classification-based is restrict to the fix of tag space and is inefficient when it is treated as a multi-label problem. In this paper, tag recommendation is conveyed to a problem of ranking candidate tags. A ranking model is constructed to ensure tags that are most likely to be post’s tags rank higher than tags that are not. Supervised learning model is used to construct the ranking model satisfying the restriction. Ranking-SVM model is the most frequently used supervised ranking model and is proofed to be a successful model, so it is used as our supervised ranking model in the experiments. All the candidate tags for one post are grouped as a ranking group and the top-ranked candidate tags are selected as recommendation tags. 3.2 Introduction to Ranking SVM Here we briefly describe the Ranking Support Vector Machine(Ranking SVM) model for tag recommendation. Assume that X ∈ ℜm is the input feature space which represents feature of a candidate tag given a user and resource, and m denotes the feature number. Y = {0, 1} is the output rank space which is represented by the labels, and 1 37

he post Pij, CTIP,1=CT1, CT2, . CTn). The ranking model ranks the candidate tags to CT1, CT2,... CTn from top to bottom. Then top-k tags are selected as prediction of the tags of post Pij. Table 1 shows the steps to rank he candidate tags Table 1. Algorithm of rank the candidate tags Input: candidate tags CT1, CT2, . CIn) Output: top-k tags ICTi, CT2,., CTK1 1. Extract feature r ri(i= 1, 2, ..,n) for a sequence of candidate tags CT(Pi=CT1, CT2, CTn 2. Rank the features using the learned ranking model as {CT1,CT2,…,CTn} 3. select top-k tags CTl, CT,,... CTk1 as recommending tags For example, if the actual number of tags for post whose content-id=123456 3, a loss of precision is suffered when 4 tags are recommend to the user. So a proper number of tags to recommend should be found. The number used in our experiment is half the number of all candidate tags. If the number is bigger than 5, we cut them into 5, that means we recommend 5 tags at most 3.4 Training process For all the post in the test dataset, candidate tags CT(Pi, for each post Pi e extracted. Then they are grouped by the post, and features are extracted for each of them in the post content. For those CTk E T(Pii, we label them'l else label them with 0. Then we use SVM-light tool to train a ranking-SVN model. When predicting the tags of the post in test dataset the model learned on the training dataset is applied to rank the candidate tags, and top ranke tags are selected as recommending tags 4 Experiments on 08s recovered dataset 4.1 Experiment settings 2008s dataset recovery In order to compare our experiments performance with that of the 08s teams, we try to get the 08's dataset(both training and test data)and test our model's performance on the recovered dataset. Though he 08s test data can be downloaded from the web. we found that user Ids have been changed between the datasets. However, the content_id field in 08s test data is consistent with 09s data, so we try to recover the 08s dataset on the 09's dataset using the content id field and date time field. The 08s real training

the post Pij , CT {Pij} = {CT1, CT2, ..., CTn} . The ranking model ranks the candidate tags to {CT 1 ′ , CT 2 ′ , ..., CT n ′} from top to bottom. Then top-k tags are selected as prediction of the tags of post Pij . Table 1 shows the steps to rank the candidate tags. Table 1. Algorithm of rank the candidate tags Input: candidate tags {CT1, CT2, ..., CTn} Output: top-k tags {CT′ 1, CT′ 2, ..., CT ′ k} 1. Extract feature x = {xi}(i = 1, 2, ..., n) for a sequence of candidate tags CT{Pij} = {CT1, CT2, ..., CTn}. 2. Rank the features using the learned ranking model as {CT′ 1, CT′ 2, ..., CT ′ n}. 3. select top-k tags {CT′ 1, CT′ 2, ..., CT ′ k} as recommending tags. Also, the number of recommended tags affects the performance of the system. For example, if the actual number of tags for post whose content id=123456 is 3, a loss of precision is suffered when 4 tags are recommend to the user. So a proper number of tags to recommend should be found. The number used in our experiment is half the number of all candidate tags. If the number is bigger than 5, we cut them into 5, that means we recommend 5 tags at most. 3.4 Training Process For all the post in the test dataset, candidate tags CT {Pij} for each post Pij are extracted. Then they are grouped by the post, and features are extracted for each of them in the post content. For those CTk ∈ T {Pij}, we label them ’1’, else label them with ’0’. Then we use SVM-light tool to train a ranking-SVM model. When predicting the tags of the post in test dataset, the model learned on the training dataset is applied to rank the candidate tags, and top ranked tags are selected as recommending tags. 4 Experiments on 08’s recovered dataset 4.1 Experiment settings 2008’s dataset recovery In order to compare our experiments’ performance with that of the 08’s teams, we try to get the 08’s dataset (both training and test data) and test our model’s performance on the recovered dataset. Though the 08’s test data can be downloaded from the web, we found that user IDs have been changed between the datasets. However, the content id field in 08’s test data is consistent with 09’s data, so we try to recover the 08’s dataset on the 09’s dataset using the content id field and date time field. The 08’s real training 39

data and test data are subset of 09's data, so it is possible to recover 08s data on 09 s data. After observing 08s real test data, we found that all posts in 08s test data are between Mar 31, 2008 and May. 15, 2009, so we use the posts during this period on 09s training data as recovered 08s test data and post before Mar 31, 2008 as our recovered 08s training data. There are still slight difference between our recovered data and the 08s real data. We assume that the difference wont affect our performance seriously, so the result is comparable with 08s results Some statistics have been made on our recovered 08s dataset. Table 2 shows the statistics of posts on this recovered dataset. Table 3 shows the statistics of posts according to the existence of their user and resource in the recovered training data. In following part in section 4, the ti data refers to the recovered training data, the test data refers to the recovered test data Table 2. Statistics of posts on recovered O8's dataset Post in recovered training data 234, 134 BOOKMARK 184,65 BIBTEX.479 OOKMARK 20.647 Post in recovered test data 63,192 BIBTEX 42, 545 Table 3. Statistics of posts according to their user and resource stati Users in recovered test data appear in recovered training data Users in recovered test data do not appear in recovered training data Resources in recovered test data appear in recovered training data Resources in recovered test data do not appear in recovered training Data format description The dataset used in experiments is released by ECML. The data consists of three tables: TAS table. BOOKMARK table and BIBTEX table Table 4 is a description of the fields of the three tables. Only the fields we used in experiments are listed in the table Data preprocess Firstly, the terms are converted into lowercase. Then the stop words are removed, such as"a, the, is, an", these terms are not likely to be the tags of the post. Finally, the punctuations as:' ,',', etc are removed. Latex symbols such as ' is also removed using regular expressions Table 5 shows example results of data preprocess 4.2 Post division It can be observed from data distribution that he training data(54%) and some do not exist in the training data(46%). Also

data and test data are subset of 09’s data, so it is possible to recover 08’s data on 09’s data. After observing 08’s real test data, we found that all posts in 08’s test data are between Mar. 31, 2008 and May. 15, 2009, so we use the posts during this period on 09’s training data as recovered 08’s test data and posts before Mar. 31, 2008 as our recovered 08’s training data. There are still slight difference between our recovered data and the 08’s real data. We assume that the difference won’t affect our performance seriously, so the result is comparable with 08’s results. Some statistics have been made on our recovered 08’s dataset. Table 2 shows the statistics of posts on this recovered dataset. Table 3 shows the statistics of posts according to the existence of their user and resource in the recovered training data. In following part in section 4, the training data refers to the recovered training data, the test data refers to the recovered test data. Table 2. Statistics of posts on recovered 08’s dataset Post in recovered training data 234,134 BOOKMARK 184,655 BIBTEX 49,479 Post in recovered test data 63,192 BOOKMARK 20,647 BIBTEX 42,545 Table 3. Statistics of posts according to their user and resource status Users in recovered test data appear in recovered training data 265 Users in recovered test data do not appear in recovered training data 225 Resources in recovered test data appear in recovered training data 1230 Resources in recovered test data do not appear in recovered training data 61970 Data format description The dataset used in experiments is released by ECML. The data consists of three tables: TAS table, BOOKMARK table and BIBTEX table. Table 4 is a description of the fields of the three tables. Only the fields we used in experiments are listed in the table. Data preprocess Firstly, the terms are converted into lowercase. Then the stop words are removed, such as ”a, the, is, an”, these terms are not likely to be the tags of the post. Finally, the punctuations as ’:’, ’,’, etc are removed. Latex symbols such as ’{’ and ’}’ is also removed using regular expressions. Table 5 shows example results of data preprocess. 4.2 Post Division It can be observed from data distribution that some users of posts exist in the training data (54%) and some do not exist in the training data (46%). Also 40