Human-competitive tagging using automatic keyphrase extraction Olena Medelyan, Eibe frank, lan h. witten Computer Science Department University of Waikato olena, eibe, ihw)@cs. waikato. ac nz tags. How well do human taggers perform? How abstract consistent are they with each other? One potential solution to inconsistency This paper connects two research areas: auto- folksonomies is to use suggestion tools matic tagging on the web and statistical key automatically compute tags for new documents phrase extraction. First, we analyze the quality (e.g. Mishne, 2006 Sood et al, 2007; Heymann using traditional evaluation techniques. Next, et al, 2008). Interestingly, the blooming research we demonstrate how documents can be tagged on automatic tagging has so far not been con automatically with a state-of-the-art keyphrase nected to work on keyphrase extraction(e.g traction algorithm, and further improve per Frank et al, 1999, Turney, 2003, Hulth, 2004) ormance in this new domain using a new al which can be used as a tool for the same task gorithm,"Maui, that utilizes semantic infor (note: we use tag and keyph mation extracted from Wikipedia. Maui out- Instead of simple heuristics based on term fre performs existing approaches and extracts tags quencies and co-occurrence of tags, keyphrase hat are competitive with those assigned by the extraction methods apply machine learning to best performing human tagger determine typical distributions of properties common to manually assigned phrases, and can 1 Introduction include analysis of semantic relations betv Tagging is the process of labeling web resources candidate tags (Turney, 2003) state-of-the-art keyphrase extraction systems per- based on their content. Each label, or tag, corre- form compared to simple tagging techniques? sponds to a topic in a given document. Unlike How consistent are they with human taggers? metadata assigned by authors, or by professional indexers in libraries, tags are assigned by end These are questions we address in this paper users for organizing and sharing information that Until now, keyphrase extraction methods have is of interest to them. The organic system of tags primarily been evaluated using a single set of assigned by all users of a given web platform is keyphrases for each document, thereby largely called a folksonomy ignoring the subjective nature of the tas Collaboratively tagged documents, on the other In contrast to traditional taxonomies painstak- hand, offer multiple tag assignments by inde- ingly constructed by experts, a user can add any tags to a folksonomy. This leads to the greatest pendent users, a unique basis for evaluation that downside of tagging, inconsistency, which orig- we capitalize upon in this paper nates in the synonymy and polysemy of human The experiments reported in this paper language, as well as in the varying degrees of these gaps in the research on automatic tagging specificity used by taggers(Golder and Huber- and keyphrase extraction. First, we analyze tag man, 2006). In traditional libraries, consistency is gng consistency on the CiteULike. org platform the primary evaluation criterion of indexing or organizing academic citations Methods tradi (Rolling, 1981). Much work has been done on tionally used for the evaluation of professional describing the statistical properties of folksono- indexing will provide insight into the quality of mies such as tag distribution and co-occurrences this folksonomy. Next, we extract a high quality (Halpin et aL, 2007; Sigurbjornsson et al., 2008. corpus from CiteULike, containing documents Sood et al., 2007), but to our knowledge there that have been tagged consistently by the best has been none on assessing the actual quality of human taggers Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 1318-1327 Singapore, 6-7 August 2009. 2009 ACL and AFNLP
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 1318–1327, Singapore, 6-7 August 2009. c 2009 ACL and AFNLP Human-competitive tagging using automatic keyphrase extraction Olena Medelyan, Eibe Frank, Ian H. Witten Computer Science Department University of Waikato {olena,eibe,ihw}@cs.waikato.ac.nz Abstract This paper connects two research areas: automatic tagging on the web and statistical keyphrase extraction. First, we analyze the quality of tags in a collaboratively created folksonomy using traditional evaluation techniques. Next, we demonstrate how documents can be tagged automatically with a state-of-the-art keyphrase extraction algorithm, and further improve performance in this new domain using a new algorithm, “Maui”, that utilizes semantic information extracted from Wikipedia. Maui outperforms existing approaches and extracts tags that are competitive with those assigned by the best performing human taggers. 1 Introduction Tagging is the process of labeling web resources based on their content. Each label, or tag, corresponds to a topic in a given document. Unlike metadata assigned by authors, or by professional indexers in libraries, tags are assigned by endusers for organizing and sharing information that is of interest to them. The organic system of tags assigned by all users of a given web platform is called a folksonomy. In contrast to traditional taxonomies painstakingly constructed by experts, a user can add any tags to a folksonomy. This leads to the greatest downside of tagging, inconsistency, which originates in the synonymy and polysemy of human language, as well as in the varying degrees of specificity used by taggers (Golder and Huberman, 2006). In traditional libraries, consistency is the primary evaluation criterion of indexing (Rolling, 1981). Much work has been done on describing the statistical properties of folksonomies, such as tag distribution and co-occurrences (Halpin et al., 2007; Sigurbjörnsson et al., 2008; Sood et al., 2007), but to our knowledge there has been none on assessing the actual quality of tags. How well do human taggers perform? How consistent are they with each other? One potential solution to inconsistency in folksonomies is to use suggestion tools that automatically compute tags for new documents (e.g. Mishne, 2006; Sood et al., 2007; Heymann et al., 2008). Interestingly, the blooming research on automatic tagging has so far not been connected to work on keyphrase extraction (e.g. Frank et al., 1999; Turney, 2003; Hulth, 2004), which can be used as a tool for the same task (note: we use tag and keyphrase as synonyms). Instead of simple heuristics based on term frequencies and co-occurrence of tags, keyphrase extraction methods apply machine learning to determine typical distributions of properties common to manually assigned phrases, and can include analysis of semantic relations between candidate tags (Turney, 2003). How well do state-of-the-art keyphrase extraction systems perform compared to simple tagging techniques? How consistent are they with human taggers? These are questions we address in this paper. Until now, keyphrase extraction methods have primarily been evaluated using a single set of keyphrases for each document, thereby largely ignoring the subjective nature of the task. Collaboratively tagged documents, on the other hand, offer multiple tag assignments by independent users, a unique basis for evaluation that we capitalize upon in this paper. The experiments reported in this paper fill these gaps in the research on automatic tagging and keyphrase extraction. First, we analyze tagging consistency on the CiteULike.org platform for organizing academic citations. Methods traditionally used for the evaluation of professional indexing will provide insight into the quality of this folksonomy. Next, we extract a high quality corpus from CiteULike, containing documents that have been tagged consistently by the best human taggers. 1318
2.1 Extracting a high quality tagged corpus 713, 600 all C teULike articles The CiteULike data set is freely available and 513, 100 at least 1 tac contains information about which documents 367, 600 at least 2 tags were tagged with what tags by which users(al though identities are not provided) CiteULike's 22 300 users have tagged 713, 600 documents with 2. 4M"tag assignments'-sin- 16,600 at least 3 taggers gle applications of a tag by a user to a document 2, 500 at least 3 agreed tag The two most popular tags, bibtex-import and no-tag, indicate an information source and a missing tag respectively. Most of the remainder describe particular concepts relevant to the documents. We exclude non-content tags fron Figure 1. Quality control of CiteULike data our experiments, e.g. personal tags like to-read or todo. Note that spam entries have been elimi- Following that, our goal is to build a system nated from the data set Because CiteULike tagger at matches the performance of these taggers. sional indexers, high quality of the assigned top- We first apply an existing approach proposed by Brooks and Montanez(2006) and compare it to s cannot be guaranteed. In fact, manual as- sessment of users' tags by human evaluators the keyphrase extraction algorithm Kea(Frank et shows precision of 59%(Mishne, 2006)and 49% al, 1999). Next we create a new algorithm alled Maui, that enhances Kea's successful ma. (Sood et al. 2006). However, why is the opinion chine learning framework with semantic know of human evaluators valued more than the edge retrieved from Wikipedia, new features, and lon of taggers? We propose an alternative way of a new classification model. We evaluate mau determining ground truth using an automatic ap- proach to determine reliable tags: We concen using tag sets assigned to the same documents by trate on a subset of CiteULike containing docu- with Citeulike users as they are with each other. ments that have been indexed with at least three Most of the computation required for auto matic tagging with this method can be performed sistency between the users, and then compare it offline. In practice, it can be used as a tag sug- to the algorithm's consistency, we need taggers gestion tool that provides users with tags describ who have tagged documents that some others ing the main topics of newly added documents, had tagged. We say that two users are "co- which can then be corrected or enhanced by per- taggers"if they have both tagged at least one sonal tags if required. This will improve consis common document. As well as restricting the tency in the folksonomy without compromising document set, we only include taggers who have its flexibility at least two co-taggers 2 Collaboratively-tagged data Figure 1 shows the proportions of CiteULike documents that are discarded in order to produce ike. org is a bookmarking service that re- our high quality data set. The final set contains sembles the popular deL icio us, but concentrates only 2, 100 documents(0.3% of the original) on scholarly papers. Rather than replicating the Unfortunately, many of these are unavailable for fulltextoftaggedpapersitsimplypointstothemdownload-forexamplebooksatAmazon.com on the web(e.g. PubMed, Cite Seer, Science Di- and ArXiv. org references cannot be crawled. We rect, Amazon)or in journals(e. g. High Wire, Na- further restrict attention to two sources: High ture). This avoids violating copyright but means Wire and Nature, both of which provide easily that the full text of articles is not necessarily accessible PDFs of the full text available. When entering new resources, users The result is a set of 180 documents indexed re encouraged to assign tags describing their by 332 taggers. A total of 4,638 tags were content or reflecting their own grouping of the signed by all taggers to documents in this set; information. However, the system does not sug- however, the number of tags on which gest tags. Moreover, users do not see other users' two users agreed is significantly smaller, n tags and are thus not biased in their tag choices. 946. Still, this results in accurate tag sets that contain an average of five tags per document
Following that, our goal is to build a system that matches the performance of these taggers. We first apply an existing approach proposed by Brooks and Montanez (2006) and compare it to the keyphrase extraction algorithm Kea (Frank et al., 1999). Next we create a new algorithm, called Maui, that enhances Kea’s successful machine learning framework with semantic knowledge retrieved from Wikipedia, new features, and a new classification model. We evaluate Maui using tag sets assigned to the same documents by several users and show that it is as consistent with CiteULike users as they are with each other. Most of the computation required for automatic tagging with this method can be performed offline. In practice, it can be used as a tag suggestion tool that provides users with tags describing the main topics of newly added documents, which can then be corrected or enhanced by personal tags if required. This will improve consistency in the folksonomy without compromising its flexibility. 2 Collaboratively-tagged Data CiteULike.org is a bookmarking service that resembles the popular del.icio.us, but concentrates on scholarly papers. Rather than replicating the full text of tagged papers it simply points to them on the web (e.g. PubMed, CiteSeer, ScienceDirect, Amazon) or in journals (e.g. HighWire, Nature). This avoids violating copyright but means that the full text of articles is not necessarily available. When entering new resources, users are encouraged to assign tags describing their content or reflecting their own grouping of the information. However, the system does not suggest tags. Moreover, users do not see other users’ tags and are thus not biased in their tag choices. 2.1 Extracting a high quality tagged corpus The CiteULike data set is freely available and contains information about which documents were tagged with what tags by which users (although identities are not provided). CiteULike’s 22,300 users have tagged 713,600 documents with 2.4M “tag assignments”— single applications of a tag by a user to a document. The two most popular tags, bibtex-import and no-tag, indicate an information source and a missing tag respectively. Most of the remainder describe particular concepts relevant to the documents. We exclude non-content tags from our experiments, e.g. personal tags like to-read or todo. Note that spam entries have been eliminated from the data set. Because CiteULike taggers are not professional indexers, high quality of the assigned topics cannot be guaranteed. In fact, manual assessment of users’ tags by human evaluators shows precision of 59% (Mishne, 2006) and 49% (Sood et al., 2006). However, why is the opinion of human evaluators valued more than the opinion of taggers? We propose an alternative way of determining ground truth using an automatic approach to determine reliable tags: We concentrate on a subset of CiteULike containing documents that have been indexed with at least three tags on which at least two users have agreed. In order to be able to measure the tagging consistency between the users, and then compare it to the algorithm’s consistency, we need taggers who have tagged documents that some others had tagged. We say that two users are “cotaggers” if they have both tagged at least one common document. As well as restricting the document set, we only include taggers who have at least two co-taggers. Figure 1 shows the proportions of CiteULike documents that are discarded in order to produce our high quality data set. The final set contains only 2,100 documents (0.3% of the original). Unfortunately, many of these are unavailable for download—for example, books at Amazon.com and ArXiv.org references cannot be crawled. We further restrict attention to two sources: HighWire and Nature, both of which provide easilyaccessible PDFs of the full text. The result is a set of 180 documents indexed by 332 taggers. A total of 4,638 tags were assigned by all taggers to documents in this set; however, the number of tags on which at least two users agreed is significantly smaller, namely 946. Still, this results in accurate tag sets that contain an average of five tags per document. Figure 1. Quality control of CiteULike data 1319
O-taggers documents consistency personalized tag choices as Chirita et al.(2007) 1234567 suggest, but rather stem from the lack of guide 714 1661 555625686 lines and uniform tag suggestions that a book marking service could provide 2.2 Measuring tagging consistency 248 48.3 Traditional indexers aim for consistency, on the 47.1 45.4 basis that this will enhance document retrieval 44.4 (Leonard, 1975). Consistency is measured using 8665 43.5 experiments in which several people index the 7787942490 same documents--usually a small set of a few 123456789022 40.9 dozen documents. It is computed for pairs of in 39.7 38.8 deers, by formulae such as rollings(1981) 35948 37. Consistency(/,)=2C. where C is the number of tags(index terms)in dexers I, and 1, have in common and a and b is 7671 16506 the size of their tag sets respectively In our experiments, before computing the number of terms in common. we stem each tas with the Porter(1980)stemmer. For example, the 678 overlap C between the tag sets (complex systems, network, small world, and theoretical, small 10 12 29 28.8 world, networks, dynamics) consist of the two 30 10 279 tags (network, small world,, and the consistency 31 10 26.7 is2×2/(3+4)=0.57. To compute the overall consistency of a par 34 21.0 ticular indexer, this figure is averaged over all documents and co-indexers. There were no cases where the same user reassigned tags to the same average 7.5 37.7 articles, so computing intra-tagger consistency Table 1. Consistency of the most prolific and although interesting, was not impossible most consistent taggers To our knowledge. traditional indexing consis- yet been applied to col Note that traditionally much smaller data sets are laboratively tagged data. However, experiments used to assess consistency of human indexers, on determining tagging quality do follow the because such sets need to be created specifically same idea. For example Xu et al.(2006) define for the experiment. Collaborative tagging plat- an authority metric that assigns high scores to forms like CiteULike can be mined for large col- those users who match other users'choices on lections of this kind in natural settings the same documents. in order to eliminate the area of bioinformatics. To give an example, a 2.3 Consistency of CiteULike taggers document entitled Initial sequencing and com parative analysis of the mouse genome was In the collec 80 documents tagged by 332 tagged by eight users with a total of 22 tags. Four users described in Section 3.1, each tagger has 18 of them agreed on the tag mouse, but one used co-taggers on average, ranging from 2 to 129 the broader term rodents. Three agreed on the tag and has indexed I to 25 documents. For each genome, but one added genome paper, and an- user we compute the consistency with all other other used the more specific comparative genom- users who tagged the e document. Consis- ics. There are also cases when tags are written tency is then averaged across documents. We together, e.g. genomepaper, or with a prefix key found that the distribution of per-user consis- genome, or in a different grammatical form: se- tency resembles a power law with a few users quence vs sequencing. This example shows achieving high consistency values and a long tail many inconsistencies in tags are not caused of inconsistent taggers The maximum consis
Note that traditionally much smaller data sets are used to assess consistency of human indexers, because such sets need to be created specifically for the experiment. Collaborative tagging platforms like CiteULike can be mined for large collections of this kind in natural settings. Most documents in the extracted set relate to the area of bioinformatics. To give an example, a document entitled Initial sequencing and comparative analysis of the mouse genome was tagged by eight users with a total of 22 tags. Four of them agreed on the tag mouse, but one used the broader term rodents. Three agreed on the tag genome, but one added genome paper, and another used the more specific comparative genomics. There are also cases when tags are written together, e.g. genomepaper, or with a prefix key genome, or in a different grammatical form: sequence vs. sequencing. This example shows that many inconsistencies in tags are not caused by personalized tag choices as Chirita et al. (2007) suggest, but rather stem from the lack of guidelines and uniform tag suggestions that a bookmarking service could provide. 2.2 Measuring tagging consistency Traditional indexers aim for consistency, on the basis that this will enhance document retrieval (Leonard, 1975). Consistency is measured using experiments in which several people index the same documents—usually a small set of a few dozen documents. It is computed for pairs of indexers, by formulae such as Rolling’s (1981): , where C is the number of tags (index terms) indexers I1 and I2 have in common and A and B is the size of their tag sets respectively. In our experiments, before computing the number of terms in common, we stem each tag with the Porter (1980) stemmer. For example, the overlap C between the tag sets {complex systems, network, small world} and {theoretical, small world, networks, dynamics} consist of the two tags {network, small world}, and the consistency is 2×2/(3+4) = 0.57. To compute the overall consistency of a particular indexer, this figure is averaged over all documents and co-indexers. There were no cases where the same user reassigned tags to the same articles, so computing intra-tagger consistency, although interesting, was not impossible. To our knowledge, traditional indexing consistency metrics have not yet been applied to collaboratively tagged data. However, experiments on determining tagging quality do follow the same idea. For example, Xu et al. (2006) define an authority metric that assigns high scores to those users who match other users’ choices on the same documents, in order to eliminate spammers. 2.3 Consistency of CiteULike taggers In the collection of 180 documents tagged by 332 users described in Section 3.1, each tagger has 18 co-taggers on average, ranging from 2 to 129, and has indexed 1 to 25 documents. For each user we compute the consistency with all other users who tagged the same document. Consistency is then averaged across documents. We found that the distribution of per-user consistency resembles a power law with a few users achieving high consistency values and a long tail of inconsistent taggers. The maximum consistagger co-taggers documents consistency 1 1 5 71.4 2 1 5 71.4 3 6 5 57.9 4 6 6 51.0 5 11 12 50.4 6 2 5 50.1 7 4 6 48.3 8 8 8 47.1 9 13 16 45.4 10 12 8 44.4 11 7 6 43.5 12 7 6 41.7 13 8 5 40.9 14 7 6 39.7 15 9 13 38.8 16 4 5 38.4 17 12 9 37.3 18 4 14 36.1 19 9 8 35.9 20 10 11 33.7 21 7 6 33.1 22 6 5 33.0 23 7 10 32.1 24 11 16 31.7 25 8 13 30.6 26 6 8 30.6 27 9 6 29.8 28 10 12 29.0 29 8 6 28.8 30 9 10 27.9 31 10 8 26.7 32 8 7 26.3 33 10 5 25.6 34 8 7 21.0 35 9 9 18.3 36 3 6 7.9 average 7.5 8.1 37.7 Table 1. Consistency of the most prolific and most consistent taggers 1320
tency in this group is 92.3% and the average is 3.1 Features indicating significance 18.5%. The average consistency of the most pro- We now describe the features used in the classi- lific 70 indexers-those who have indexed fication model to determine whether a phrase is least five documents-is in the same range, likely to be a tag. We begin with three baseline namely 18.4%. The consistency of traditional approaches to free indexing is reported to be be- extend the set with three features that have been tween 4% and 67%, with an average of 27% de found useful in previous work. We also add three pending on what aids are used (Leininger, 2000). new features that have not been evaluated before It is instructive to consider the group of best taggers. We define these as the ones who (a)ex. Wikipedia linkage. All Wikipedia-based features It greater than average consistency with all are computed using the WikipediaMiner toolkit others, and(b) are su tagged at least five documents. There are 36 such 1. TFXIDF combines the frequency of a taggers Table I lists their consistency within this occurrence frequency in general use(Salton and group. The average consistency they achieve as a McGill, 1983 ). This score is high for rare phrases group is 37.7%, which is the similar to the aver- that appear frequently in a document and there- age consistency of professionals (Leininger, fore are more likely to be significant 2. Position of the first occurrence is com The above consistency analysis provides in- puted as the relative distance of the first occur sight into the tagging quality of the best rence of the candidate tag from the beginning of CiteULike users, based on High Wire and Nature the document. Candidates with very high or very articles. For the purposes of this paper, it shows low values are likely to be tags, because they how the tagging community can be restricted to a appear either in the opening document parts such best-performing group of taggers by measuring as title, abstract, table of contents, and introduc- their consistency. This is helpful for testing the tion, or in the documents final sections such as performance of automatic tagging(Section 4. 4). conclusion and reference lists 3 Automatic tagging with maui 3. Keyphraseness quantifies how often a can didate phrase appears as a tag in the training cor- Maui is a general algorithm for automatic topical pus. Automatic tagging approaches utilize the indexing based on the Kea system(Frank et al, same information: Mishne(2006)and Sood et al. 1999). It works in two stages: candidate selec-(2006)automatically suggest tags previously as tion and machine learning based filtering. In this signed to similar documents. However, in Maui paper, we apply it to automatic tagging In the (as in Kea)this feature is just one component of candidate selection stage, Maui first determines the overall model. Thus if a candidate never ap- textual sequences defined by orthographic pears as a keyphrase in the training corpus, it can boundaries and splits these sequences into to- still be extracted if its other feature values are kens. Then all n-grams up to a maximum length significant enough 4. Phrase length is measured in words. gen- of 3 words that do not begin or end with a stop- erally speaking, the longer the phrase, the more word are extracted as candidate tags. To reduce the number of candidates, all those that appear specific it is. Training captures and quantifies the only once are discarded. This speeds up the train- specificity preference in a given training corpus ing and the extraction process without impacting 5. Node degree quantifies the semantic relat the results In the filtering stage several features edness of a candidate tag to other candidates are computed for each candidate, which are then Turney(2003)computes semantic relatedness input to a machine learning model to obtain the using search engine statistics. Instead, following bability that the candidate is indeed Medelyan et al.(2008), we utilize wikipedia Maui 's architecture resembles that hyperlinks for this task. We first map each can other supervised keyphrase extraction systems didate phrase to its most common Wikipedia (Turney, 2000; Hulth 2004; Medelyan et al., page. For example, the word Jaguar appears as a 2008). However, this architecture has not previ- link anchor in wikipedia 927 times. In 466 cases ously been applied to the task of automatic tag monness of this mapping is 0.5. In 203 cases it gIng links to the animal description, a commonness of Maui is open-source and available for download hp:∥ ttp: //wikipedia sourceforge. net/ 1321
tency in this group is 92.3% and the average is 18.5%. The average consistency of the most prolific 70 indexers—those who have indexed at least five documents—is in the same range, namely 18.4%. The consistency of traditional approaches to free indexing is reported to be between 4% and 67%, with an average of 27% depending on what aids are used (Leininger, 2000). It is instructive to consider the group of best taggers. We define these as the ones who (a) exhibit greater than average consistency with all others, and (b) are sufficiently prolific, i.e. have tagged at least five documents. There are 36 such taggers; Table 1 lists their consistency within this group. The average consistency they achieve as a group is 37.7%, which is the similar to the average consistency of professionals (Leininger, 2000). The above consistency analysis provides insight into the tagging quality of the best CiteULike users, based on HighWire and Nature articles. For the purposes of this paper, it shows how the tagging community can be restricted to a best-performing group of taggers by measuring their consistency. This is helpful for testing the performance of automatic tagging (Section 4.4). 3 Automatic tagging with Maui Maui is a general algorithm for automatic topical indexing based on the Kea system (Frank et al., 1999).1 It works in two stages: candidate selection and machine learning based filtering. In this paper, we apply it to automatic tagging. In the candidate selection stage, Maui first determines textual sequences defined by orthographic boundaries and splits these sequences into tokens. Then all n-grams up to a maximum length of 3 words that do not begin or end with a stopword are extracted as candidate tags. To reduce the number of candidates, all those that appear only once are discarded. This speeds up the training and the extraction process without impacting the results. In the filtering stage several features are computed for each candidate, which are then input to a machine learning model to obtain the probability that the candidate is indeed a tag. Maui’s architecture resembles that of many other supervised keyphrase extraction systems (Turney, 2000; Hulth 2004; Medelyan et al., 2008). However, this architecture has not previously been applied to the task of automatic tagging. 1 Maui is open-source and available for download at http://maui-indexer.googlecode.com 3.1 Features indicating significance We now describe the features used in the classification model to determine whether a phrase is likely to be a tag. We begin with three baseline features used in Kea (Frank et al., 1999), and extend the set with three features that have been found useful in previous work. We also add three new features that have not been evaluated before: spread, semantic relatedness and inverse Wikipedia linkage. All Wikipedia-based features are computed using the WikipediaMiner toolkit. 2 1. TF×IDF combines the frequency of a phrase in a particular document with its inverse occurrence frequency in general use (Salton and McGill, 1983). This score is high for rare phrases that appear frequently in a document and therefore are more likely to be significant. 2. Position of the first occurrence is computed as the relative distance of the first occurrence of the candidate tag from the beginning of the document. Candidates with very high or very low values are likely to be tags, because they appear either in the opening document parts such as title, abstract, table of contents, and introduction, or in the document’s final sections such as conclusion and reference lists. 3. Keyphraseness quantifies how often a candidate phrase appears as a tag in the training corpus. Automatic tagging approaches utilize the same information: Mishne (2006) and Sood et al. (2006) automatically suggest tags previously assigned to similar documents. However, in Maui (as in Kea) this feature is just one component of the overall model. Thus if a candidate never appears as a keyphrase in the training corpus, it can still be extracted if its other feature values are significant enough. 4. Phrase length is measured in words. Generally speaking, the longer the phrase, the more specific it is. Training captures and quantifies the specificity preference in a given training corpus. 5. Node degree quantifies the semantic relatedness of a candidate tag to other candidates. Turney (2003) computes semantic relatedness using search engine statistics. Instead, following Medelyan et al. (2008), we utilize Wikipedia hyperlinks for this task. We first map each candidate phrase to its most common Wikipedia page. For example, the word Jaguar appears as a link anchor in Wikipedia 927 times. In 466 cases it links to the article Jaguar cars, thus the commonness of this mapping is 0.5. In 203 cases it links to the animal description, a commonness of 2 http://wikipedia-miner.sourceforge.net/ 1321
0. 22. We compute the node degree of the corre- 3.2 Machine learning in Maui sponding Wikipedia article as the number of In order to build the model, we use the subset of hyperlinks that connect it to other Wikipedia the CiteULike collection described in Section pages that have been identified for other candi- date tags from the same document. a document 31. For each document we know a that describes a particular topic will cover many that at least two users have agreed on. This is related concepts, so high node degree--which used as ground truth for building the model. For indicates strong connectivity to other phrases in ch training document, candidate phrases (i.e the same document-means that a candidate is n-grams)are identified and their feature values more likely to be significant are calculated as described above 6. wikipedia-based keyphraseness is the Each candidate is then marked as a positive or likelihood of a phrase bein link in the negative example, depending on whether users Wikipedia pages in which the phrase appears in document. The machine-learning model is con- the anchor text of a link by the total number of structed automatically from these labeled train- We multiply this ing examples using the WEKA machine learning number by the phrase s document frequency The new features proposed in this paper are Naive Bayes classifier, which implicitly assumes the following that the features are independent of each other 7. Spread of a phrase is the distance between given the classification. However, Kea uses only its first and last occurrences in a document. Both two or three features, whereas Maui combines values are computed relative to the length of the nine features amongst which there are many ob- document(see feature 2). High values help to vious relationships, e.g. first occurrence and determine phrases that are mentioned both in the spread, or node degree and semantic relatedness beginning and at the end of a document. Consequently, we also consider bagged decision A8. Semantic relatedness of a phrase has al- trees, which can model attribute interactions and eady been captured as the node degree(see fea- do not require parameter tuning to yield good ture 5). However, recent research allows us to results. Bagging learns an ensemble of classifiers compute semantic relatedness with better tech- and uses them in combination, thereby often niques than mere hyperlink counts. Milne and achieving significantly better results than the in- Witten(2008) propose an efficient Wikipedia dividual classifiers(Breiman, 1996). Different based approach that is nearly as accurate as hu- trees are generated by sampling from the original man subjects at quantifying the relationship be- dataset with replacement. Like Naive Bayes, tween two given concepts. Given a set of candi- bagged trees yield probability estimates that can date phrase determine the most likely Wikipedia articles that describe them (as ex- To select tags from a new document, Maui de plained in feature 5), and then determine the total termines candidate phrases and their feature val relatedness of a given phrase to all other candi- ues, and then applies the classifier built during dates. The higher the value, the more likely is the training. This classifier determines the probabil phrase to be a tag ity that a candidate is a tag based on relat 9. Inverse Wikipedia linkage is another fea- quencies observed from the training data. ture that utilizes wikipedia as a source of lan- guage usage statistics. Here, again given the 4 Evaluation most likely Wikipedia article for a given phrase, Here we describe the data used in the experi that link to it and normalize this value as inin- ments and the results obtained, addressing the verse document frequenc following questions 1. How does a state-of-the-art keyphrase ex links To(Ap) traction method perform on collaboratively tagged data, compared to a baseline auto- where links To(Ap) is the number of incoming matic tagging method? links to the article A representing the candidate 2. What is the performance of Maui with old rase P. and n is the total number of links in and new features? our Wikipedia snapshot (52M). This feature 3. How consistent are Maui's tags compared to highlights those phrases that refer to concepts those assigned by human taggers? commonly used to describe other concepts
0.22. We compute the node degree of the corresponding Wikipedia article as the number of hyperlinks that connect it to other Wikipedia pages that have been identified for other candidate tags from the same document. A document that describes a particular topic will cover many related concepts, so high node degree—which indicates strong connectivity to other phrases in the same document—means that a candidate is more likely to be significant. 6. Wikipedia-based keyphraseness is the likelihood of a phrase being a link in the Wikipedia corpus. It divides the number of Wikipedia pages in which the phrase appears in the anchor text of a link by the total number of Wikipedia pages containing it. We multiply this number by the phrase’s document frequency. The new features proposed in this paper are the following: 7. Spread of a phrase is the distance between its first and last occurrences in a document. Both values are computed relative to the length of the document (see feature 2). High values help to determine phrases that are mentioned both in the beginning and at the end of a document. 8. Semantic relatedness of a phrase has already been captured as the node degree (see feature 5). However, recent research allows us to compute semantic relatedness with better techniques than mere hyperlink counts. Milne and Witten (2008) propose an efficient Wikipedia based approach that is nearly as accurate as human subjects at quantifying the relationship between two given concepts. Given a set of candidate phrases we determine the most likely Wikipedia articles that describe them (as explained in feature 5), and then determine the total relatedness of a given phrase to all other candidates. The higher the value, the more likely is the phrase to be a tag. 9. Inverse Wikipedia linkage is another feature that utilizes Wikipedia as a source of language usage statistics. Here, again given the most likely Wikipedia article for a given phrase, we count the number of other Wikipedia articles that link to it and normalize this value as in inverse document frequency: where linksTo(AP) is the number of incoming links to the article A representing the candidate phrase P, and N is the total number of links in our Wikipedia snapshot (52M). This feature highlights those phrases that refer to concepts commonly used to describe other concepts. 3.2 Machine learning in Maui In order to build the model, we use the subset of the CiteULike collection described in Section 3.1. For each document we know a set of tags that at least two users have agreed on. This is used as ground truth for building the model. For each training document, candidate phrases (i.e. n-grams) are identified and their feature values are calculated as described above. Each candidate is then marked as a positive or negative example, depending on whether users have assigned it as a tag to the corresponding document. The machine-learning model is constructed automatically from these labeled training examples using the WEKA machine learning workbench. Kea (Frank et al., 1999) uses the Naïve Bayes classifier, which implicitly assumes that the features are independent of each other given the classification. However, Kea uses only two or three features, whereas Maui combines nine features amongst which there are many obvious relationships, e.g. first occurrence and spread, or node degree and semantic relatedness. Consequently, we also consider bagged decision trees, which can model attribute interactions and do not require parameter tuning to yield good results. Bagging learns an ensemble of classifiers and uses them in combination, thereby often achieving significantly better results than the individual classifiers (Breiman, 1996). Different trees are generated by sampling from the original dataset with replacement. Like Naïve Bayes, bagged trees yield probability estimates that can be used to rank candidates. To select tags from a new document, Maui determines candidate phrases and their feature values, and then applies the classifier built during training. This classifier determines the probability that a candidate is a tag based on relative frequencies observed from the training data. 4 Evaluation Here we describe the data used in the experiments and the results obtained, addressing the following questions: 1. How does a state-of-the-art keyphrase extraction method perform on collaboratively tagged data, compared to a baseline automatic tagging method? 2. What is the performance of Maui with old and new features? 3. How consistent are Maui’s tags compared to those assigned by human taggers? 1322