Margaret E.L. Kipp(kipp@uwo.ca Faculty of Information and Media Studies, University of western Ontario, London, Ontario Tagging practices on Research Oriented social Bookmarking Sites Abstract: This paper examines the tagging practices evident on CiteULike, a research oriented social bookmarking site for journal articles. Tagging practices were examined using standard informetric measures for analysis of bibliographic information and term use. Additionally, tags were compared to author key words and descriptors assigned to the same article Resume Cette communication examine les pratiques d'etiquetage par mots-cles qui sont utilises sur CiteULike, un service d'etiquetage social, pour les articles de periodiques Ces pratiques de marquage ont ete examinees en utilisant les mesures informetriques habituellement utilisees pour Analyse d information bibliographique et d utilisation de mots-cles. En outre, les etiquettes ont ete compares aux mots-cles utilises par les auteurs et aux descripteurs attribues a ces memes article 1. Introduction The ability to quickly locate relevant information is becoming increasingly important as more information becomes available digitally. Much of this information is unsorted and retrieval relies on free text search, user created hyperlinks and a large dose of serendipity Information organisation is a core area of library and information science dealing directly with the ability to increase the relevance of information retrieval by increasing the ability to at once collocate and distinguish material. In a digital world, one of the important tasks document spaces for information. A classification system using terms and keywords se of library and information science is to reduce the difficulty inherent in searching large appropriate to the context of the intended user, can help make the difference between a usable document space and a space which is difficult to navigate and find the information Universal hierarchical classification systems and subject specific taxonomies have a long history, but the design and application of these systems has largely been left to professional intermediaries such as librarians. As the amount of information available for user search increases and users begin to demand increasingly specialised information in search, these systems are often found to be at once too generic and too specific for user needs. Full text search, which can provide fine grained access to information has, however, the fault of doing so at the expense of precision resulting from the use of differing terminology User tagging and folksonomies created in a distributed fashion through social bookmarking sites have been suggested as a potential solution to these problems(Mathes 2004; Hammond et al 2005) since user tagging could provide the additional access points at less cost. However, this relies on many assumptions, such as the assumption that tagging provides a similar or better search context to free text searching or intermediary assigned index terms
1 Margaret E.I. Kipp (mkipp@uwo.ca) Faculty of Information and Media Studies, University of Western Ontario, London, Ontario Tagging Practices on Research Oriented Social Bookmarking Sites Abstract: This paper examines the tagging practices evident on CiteULike, a research oriented social bookmarking site for journal articles. Tagging practices were examined using standard informetric measures for analysis of bibliographic information and term use. Additionally, tags were compared to author keywords and descriptors assigned to the same article. Résumé : Cette communication examine les pratiques d’étiquetage par mots-clés qui sont utilisés sur CiteULike, un service d’étiquetage social, pour les articles de périodiques. Ces pratiques de marquage ont été examinées en utilisant les mesures informétriques habituellement utilisées pour l’analyse d’information bibliographique et d’utilisation de mots-clés. En outre, les étiquettes ont été comparées aux mots-clés utilisés par les auteurs et aux descripteurs attribués à ces mêmes articles. 1. Introduction The ability to quickly locate relevant information is becoming increasingly important as more information becomes available digitally. Much of this information is unsorted and retrieval relies on free text search, user created hyperlinks and a large dose of serendipity. Information organisation is a core area of library and information science dealing directly with the ability to increase the relevance of information retrieval by increasing the ability to at once collocate and distinguish material. In a digital world, one of the important tasks of library and information science is to reduce the difficulty inherent in searching large document spaces for information. A classification system using terms and keywords, appropriate to the context of the intended user, can help make the difference between a usable document space and a space which is difficult to navigate and find the information sought. Universal hierarchical classification systems and subject specific taxonomies have a long history, but the design and application of these systems has largely been left to professional intermediaries such as librarians. As the amount of information available for user search increases and users begin to demand increasingly specialised information in search, these systems are often found to be at once too generic and too specific for user needs. Full text search, which can provide fine grained access to information has, however, the fault of doing so at the expense of precision resulting from the use of differing terminology. User tagging and folksonomies created in a distributed fashion through social bookmarking sites have been suggested as a potential solution to these problems (Mathes 2004; Hammond et al 2005) since user tagging could provide the additional access points at less cost. However, this relies on many assumptions, such as the assumption that user tagging provides a similar or better search context to free text searching or intermediary assigned index terms
This study builds on a previous study(Kipp 2006)examining the emerging phenomenon of social bookmarking or tagging in comparison to existing classificatory structures from traditional cataloguing and classification research. A sample of articles from the field library and information science was examined for contextual differences in keyword usage between users of social bookmarking sites and authors and intermediaries (cataloguers or indexers). This study found many similarities and some intriguing differences in context, specifically in the realm of personal information management Users tagging articles on social bookmarking tools tend to use terms such as ' and todo to indicate their interest in further use or study of an item (Kipp 2006)a study of del icio us found that approximately 16% of tags in the sample were time and task related tags having a personal information management edge (Kipp and Campbell 2006) Additional differences included the fact that"intermediaries considered geographic location to be an important part of the description of the aboutness of an article, authors and users tended to assume it was somewhat less important than the other contexts of the articles. "(Kipp 2006) Many tags were related to terms in the formal thesaurus from which the descriptors were located, but were not formally in the thesaurus. In some cases this was due to new or emerging terminology, in others to material being used in related but different areas of a field (e.g. information seeking versus information retrieval).(Kipp 2006) The current study expands upon the findings from this earlier study using a larger collection of articles from the field of biology tagged by users of CiteULike (http://ciTeulike.org/),socialbookmarkingsitewhichisspecialisedforacademic articles. The chosen journals were restricted to journals known to request author assigned keywords and to journals indexed in Pubmed, which provides intermediary assigned controlled vocabulary for searchers. Thus, each article in the study has three sets of keywords assigned by three different classes of metadata creators. As in the previous study, the data will be analysed using thesaural comparisons for depth of specificity at various levels as well as statistically for term usage and frequency Analysis of this new data set from a different field will help to strengthen the conclusions of the earlier study by showing that users in different fields also provide useful sets of tags. This study has implications for the design of systems for accessing, indexing and searching document spaces 2. Social bookmarking tools Social Bookmarking sites have become increasingly popular since their inception. Sites such as del icio us report over a million users with additional users signing up every day (http://blog.del.icio.us/blog/2006/09/million.html)interestisincreasinginacademic circles. In particular, researchers from library science and computer science examine the growth of an Internet phenomenon with potential applications to both fields. ( Voss 2007; Kipp 2006; Kipp and Campbell 2006; Hammond et al. 2005). One of the most interesting aspects of social bookmarking sites is the phenomenon of social tagging that has grown along with them as users are encouraged to provide a few key terms they consider most useful in categorising the item they are bookmarking Tagging, which began on social bookmarking sites like del icio us, allowed users to store their bookmarks(favourite URLs) in a publicly accessible fashion and associate these bookmarks with a series of descriptive tags the user thought might be helpful in aiding
2 This study builds on a previous study (Kipp 2006) examining the emerging phenomenon of social bookmarking or tagging in comparison to existing classificatory structures from traditional cataloguing and classification research. A sample of articles from the field of library and information science was examined for contextual differences in keyword usage between users of social bookmarking sites and authors and intermediaries (cataloguers or indexers). This study found many similarities and some intriguing differences in context, specifically in the realm of personal information management. Users tagging articles on social bookmarking tools tend to use terms such as 'toread' and 'todo' to indicate their interest in further use or study of an item. (Kipp 2006) A study of del.icio.us found that approximately 16% of tags in the sample were time and task related tags having a personal information management edge. (Kipp and Campbell 2006) Additional differences included the fact that "intermediaries considered geographic location to be an important part of the description of the aboutness of an article, authors and users tended to assume it was somewhat less important than the other contexts of the articles." (Kipp 2006) Many tags were related to terms in the formal thesaurus from which the descriptors were located, but were not formally in the thesaurus. In some cases this was due to new or emerging terminology, in others to material being used in related but different areas of a field (e.g. information seeking versus information retrieval). (Kipp 2006) The current study expands upon the findings from this earlier study using a larger collection of articles from the field of biology tagged by users of CiteULike (http://CiteULike.org/), social bookmarking site which is specialised for academic articles. The chosen journals were restricted to journals known to request author assigned keywords and to journals indexed in Pubmed, which provides intermediary assigned controlled vocabulary for searchers. Thus, each article in the study has three sets of keywords assigned by three different classes of metadata creators. As in the previous study, the data will be analysed using thesaural comparisons for depth of specificity at various levels as well as statistically for term usage and frequency. Analysis of this new data set from a different field will help to strengthen the conclusions of the earlier study by showing that users in different fields also provide useful sets of tags. This study has implications for the design of systems for accessing, indexing and searching document spaces. 2. Social Bookmarking Tools Social Bookmarking sites have become increasingly popular since their inception. Sites such as del.icio.us report over a million users with additional users signing up every day. (http://blog.del.icio.us/blog/2006/09/million.html) Interest is increasing in academic circles. In particular, researchers from library science and computer science examine the growth of an Internet phenomenon with potential applications to both fields. (Voss 2007; Kipp 2006; Kipp and Campbell 2006; Hammond et al. 2005). One of the most interesting aspects of social bookmarking sites is the phenomenon of social tagging that has grown along with them as users are encouraged to provide a few key terms they consider most useful in categorising the item they are bookmarking. Tagging, which began on social bookmarking sites like del.icio.us, allowed users to store their bookmarks (favourite URLs) in a publicly accessible fashion and associate these bookmarks with a series of descriptive tags the user thought might be helpful in aiding
the process of finding the URL again. Early adopters found that the automatic clustering of bookmarked URLs by their associated tags led to the discovery of other useful URLs on similar topics. Shirky 2005) The number of sites utilising user tagging as a form of information organisation is increasing and tagging is beginning to be integrated into web sites with more traditional hierarchical organisational systems such as on-line book stores (e.g. Amazon. com) Citeulike(hTtp: //ciTeulike. org/)is a social bookmarking service specialised for use by academics who wish to bookmark academic articles for later retrieval CiteULike was createdbyriChardCameroninNovember2004.(http://www.Citeulike.org/faq/all.adp) CiteULike Everyones library Figure 1: Screenshot of citeULike Similar to the more commonly known del icio us, CiteULike allows users to assign an arbitrary number of tags to the articles in their library Users may search by tag to relocate articles in their own library, as well as in the libraries of other users. User and overall tag clouds allow users to see commonly used or popular tags for an article or for the entire tool Since CiteULike tags are often associated with journal articles, it is possible to collect author keywords and descriptors for many of the articles. Thus, a comparison can be made between user tags, author keywords and intermediary descriptors attached to a single article 3. Related studies Bowker and Star(1999)suggest that classification is a basic practice of all humans Bowker and Star 1999)Traditional classification methods have tended to rely on trained indexers, cataloguers or taxonomists to organise and describe information. While other groups have been involved in creating keywords or index terms(for example, journal article authors who are asked to provide a certain number of key words with their submitted articles), these key words generally have a small circulation and are not widely used. Such small scale indexing is common but generally covers a narrow range of topics and is specific to the article. Additionally, such keywords are often derived from the work itself and may or may not have wide circulation outside a small subset of the field Collaborative tagging systems such as CiteULike allow users to publicly participate in the classification of journal articles
the process of finding the URL again. Early adopters found that the automatic clustering of bookmarked URLs by their associated tags led to the discovery of other useful URLs on similar topics. (Shirky 2005) The number of sites utilising user tagging as a form of information organisation is increasing and tagging is beginning to be integrated into web sites with more traditional hierarchical organisational systems such as on-line book stores (e.g. Amazon.com). CiteULike (http://CiteULike.org/) is a social bookmarking service specialised for use by academics who wish to bookmark academic articles for later retrieval. CiteULike was created by Richard Cameron in November 2004. (http://www.CiteULike.org/faq/all.adp) Figure 1: Screenshot of CiteULike Similar to the more commonly known del.icio.us, CiteULike allows users to assign an arbitrary number of tags to the articles in their library. Users may search by tag to relocate articles in their own library, as well as in the libraries of other users. User and overall tag clouds allow users to see commonly used or popular tags for an article or for the entire tool. Since CiteULike tags are often associated with journal articles, it is possible to collect author keywords and descriptors for many of the articles. Thus, a comparison can be made between user tags, author keywords and intermediary descriptors attached to a single article. 3. Related Studies Bowker and Star (1999) suggest that classification is a basic practice of all humans. (Bowker and Star 1999) Traditional classification methods have tended to rely on trained indexers, cataloguers or taxonomists to organise and describe information. While other groups have been involved in creating keywords or index terms (for example, journal article authors who are asked to provide a certain number of keywords with their submitted articles), these keywords generally have a small circulation and are not widely used. Such small scale indexing is common but generally covers a narrow range of topics and is specific to the article. Additionally, such keywords are often derived from the work itself and may or may not have wide circulation outside a small subset of the field. Collaborative tagging systems such as CiteULike allow users to publicly participate in the classification of journal articles. 3
To discover if tags can truly provide a useful replacement or enhancement for controlled vocabularies, it is important to examine whether or not they provide a similar contextual dimension to the existing classification systems. While it seems unlikely that untrained users will produce a full featured classification system similar to the traditional library systems, it is possible to examine the tags they do assign to see how they compare to the descriptors assigned by a trained indexer and to keywords assigned by authors Adam Mathes(2004) notes that there are three major groups that are commonly involved in the classification of documents. These groups are authors, intermediaries and users (Mathes 2004)While intermediary index terms(often subject headings) have been widel promulgated, author keywords and user terminology have tended to be relatively local. In fact, author keywords have received relatively little attention in the literature.(Kipp 2006; Ansari 2005; Voorbij 1998)While intermediaries have been indexing documents for some time, the development of large scale user created collections of tagged documents is new This leads one to ask if user categories are indeed different from subject headings or author keywords and if so, how they differ? Are there differences in context, type, or some other semantic relationship? If so, it could be quite important to examine the differences between these categories and the reasons that they do not appear in traditional classification systems. Perhaps these categories are considered to be too short term, too user centric or too subjective to be included? Terms such as @toread and cool In the organisation and retrieval of information. Yet, they are an important part of the o after all. do not describe the aboutness of a document and would seem to be of little u phenomenon of tagging(Kipp 2007) These short term and highly specific tags suggest important differences between user classification systems and author or intermediary classification systems Descriptive statistics can be used to make a basic comparison of the indexing practices of each of the three groups involved in the classification of journal articles(users of a document, authors of a document, and intermediaries or indexers of a document) Additionally, a comparison can be made at the level of the assigned metadata itself. Tags can be examined to see how well they fit the aboutness of the document and to see how closely they match the existing descriptors and author key words already assigned to the documents a few studies have made comparisons of different types of keywords. Voorbij (1998) studied the correspondence between words in the titles of monographs in the humanities and social sciences and librarian d descriptors existing in the online public access catalogue of the National Library of the Netherlands. His study used the different relationships in a thesaurus as an indication of closeness of match, beginning with ar exact(or almost exact)match, continuing to synonyms, narrower terms, broader terms related terms, relationships not formally in the thesaurus, and terms which did not appear in the title at all. ( Voorbij 1998, 468)A similar study by ansari(2005)examined the degree of exact and partial match between title key words and the assigned descriptors of medical theses in Farsi. She found that the degree of match was greater than 70 per cent (Ansari 2005, 414) Both studies suggest that title keyword searching alone and controlled vocabulary searching alone lead to failure to find some articles. However, there is very little research in this area. Consequently, this study continues to examine the question of convergence between tags, keywords and descriptors by exploring the tagging phenomenon as it is growing at CiteULike
4 To discover if tags can truly provide a useful replacement or enhancement for controlled vocabularies, it is important to examine whether or not they provide a similar contextual dimension to the existing classification systems. While it seems unlikely that untrained users will produce a full featured classification system similar to the traditional library systems, it is possible to examine the tags they do assign to see how they compare to the descriptors assigned by a trained indexer and to keywords assigned by authors. Adam Mathes (2004) notes that there are three major groups that are commonly involved in the classification of documents. These groups are authors, intermediaries and users. (Mathes 2004) While intermediary index terms (often subject headings) have been widely promulgated, author keywords and user terminology have tended to be relatively local. In fact, author keywords have received relatively little attention in the literature. (Kipp 2006; Ansari 2005; Voorbij 1998) While intermediaries have been indexing documents for some time, the development of large scale user created collections of tagged documents is new. This leads one to ask if user categories are indeed different from subject headings or author keywords and if so, how they differ? Are there differences in context, type, or some other semantic relationship? If so, it could be quite important to examine the differences between these categories and the reasons that they do not appear in traditional classification systems. Perhaps these categories are considered to be too short term, too user centric or too subjective to be included? Terms such as @toread and cool after all, do not describe the aboutness of a document and would seem to be of little use in the organisation and retrieval of information. Yet, they are an important part of the phenomenon of tagging. (Kipp 2007) These short term and highly specific tags suggest important differences between user classification systems and author or intermediary classification systems. Descriptive statistics can be used to make a basic comparison of the indexing practices of each of the three groups involved in the classification of journal articles (users of a document, authors of a document, and intermediaries or indexers of a document). Additionally, a comparison can be made at the level of the assigned metadata itself. Tags can be examined to see how well they fit the aboutness of the document and to see how closely they match the existing descriptors and author keywords already assigned to the documents. A few studies have made comparisons of different types of keywords. Voorbij (1998) studied the correspondence between words in the titles of monographs in the humanities and social sciences and librarian assigned descriptors existing in the online public access catalogue of the National Library of the Netherlands. His study used the different relationships in a thesaurus as an indication of closeness of match, beginning with an exact (or almost exact) match, continuing to synonyms, narrower terms, broader terms, related terms, relationships not formally in the thesaurus, and terms which did not appear in the title at all. (Voorbij 1998, 468) A similar study by Ansari (2005) examined the degree of exact and partial match between title keywords and the assigned descriptors of medical theses in Farsi. She found that the degree of match was greater than 70 per cent. (Ansari 2005, 414) Both studies suggest that title keyword searching alone and controlled vocabulary searching alone lead to failure to find some articles. However, there is very little research in this area. Consequently, this study continues to examine the question of convergence between tags, keywords and descriptors by exploring the tagging phenomenon as it is growing at CiteULike
This study posed the following research question To what extent do term usage patterns of user tags, author keywords and intermediary descriptors suggest a similar context between users, authors and intermediaries? 4. Methodology This study builds on previous work(Kipp 2006) which examined three forms of index term creation originating from three different groups: users of a document, authors of a document and intermediaries or indexers of a document In Kipp(2006)it was found that while users often did use terms which were directly from the thesaurus used to assign descriptors to the articles, terms were also often similar or related terms which were not formally linked in the thesaurus. The most prominent example was the use of information retrieval versus information seeking(related but distinct areas of research). Additionally users tended to include personal information management terminology such as'toread in their tag sets, but were less likely to include geographic information(Kipp 2006)While the findings from the preliminary study showed that there were differences in the way users. authors and intermediaries classified documents the size of the data set --165 articles--made it difficult to generalise these findings to larger data sets from other fields A larger data set, from a different field, which showed similar patterns of term usage and thesaural matches would strengthen conclusions from the earlier study Tag data for the current study was collected from CiteULike between January 12, 2007 and January 24, 2007 via a python script(CiteULike. py). Author keywords and descriptors were collected from on-line journal databases and Pubmed respectively using additional python scripts Journals selected for this study were chosen because they are: a) biology related, b) require authors to submit keywords for their articles and c)are indexed in Pubmed using Medical Subject Headings(MeSH). Two journals were selected for this study: Proteins and Journal of Molecular Biology. All articles from these selected journals, which have been tagged on CiteULike by at least one user, were collected. To ensure that all articles from these journals were collected, the python script was designed to collect under al common variants of their names(e.g. J Mol. Biol. for Journal of Molecular Biology) These results were parsed to exclude currently untagged articles. To aid in the location of new articles, CiteULike also provides listings for articles from selected journals that have not yet been tagged. Data collected included title, journal name, volume, issue, page numbers, author names abstract where available, and URLs providing access to the article or its abstract. URLS were collected for each article and automatically separated into categories as potential sources of keywords or descriptors. Digital Object Identifiers(DOIs http://www.doi.org/)wereselectedbypreferenceasasourceofauthorkeywordsfor journal articles and Pubmed URLs were used to locate descriptors(in this case MeSh indexing terms) All articles were then located in Pubmed and on publicly available abstract pages from on-line journal database sites using the URLs collected from CiteULike. Where possible pubmed URLs and DOI URLs were used directly, otherwise a series of scripts was used
5 This study posed the following research question: ● To what extent do term usage patterns of user tags, author keywords and intermediary descriptors suggest a similar context between users, authors and intermediaries? 4. Methodology This study builds on previous work (Kipp 2006) which examined three forms of index term creation originating from three different groups: users of a document, authors of a document and intermediaries or indexers of a document. In Kipp (2006) it was found that while users often did use terms which were directly from the thesaurus used to assign descriptors to the articles, terms were also often similar or related terms which were not formally linked in the thesaurus. The most prominent example was the use of information retrieval versus information seeking (related but distinct areas of research). Additionally, users tended to include personal information management terminology such as 'toread' in their tag sets, but were less likely to include geographic information. (Kipp 2006) While the findings from the preliminary study showed that there were differences in the way users, authors and intermediaries classified documents, the size of the data set--165 articles--made it difficult to generalise these findings to larger data sets from other fields. A larger data set, from a different field, which showed similar patterns of term usage and thesaural matches would strengthen conclusions from the earlier study. Tag data for the current study was collected from CiteULike between January 12, 2007 and January 24, 2007 via a python script (CiteULike.py). Author keywords and descriptors were collected from on-line journal databases and Pubmed respectively using additional python scripts. Journals selected for this study were chosen because they are: a) biology related, b) require authors to submit keywords for their articles and c) are indexed in Pubmed using Medical Subject Headings (MeSH). Two journals were selected for this study: Proteins and Journal of Molecular Biology. All articles from these selected journals, which have been tagged on CiteULike by at least one user, were collected. To ensure that all articles from these journals were collected, the python script was designed to collect under all common variants of their names (e.g. J. Mol. Biol. for Journal of Molecular Biology). (These results were parsed to exclude currently untagged articles. To aid in the location of new articles, CiteULike also provides listings for articles from selected journals that have not yet been tagged.) Data collected included title, journal name, volume, issue, page numbers, author names, abstract where available, and URLs providing access to the article or its abstract. URLs were collected for each article and automatically separated into categories as potential sources of keywords or descriptors. Digital Object Identifiers (DOIs - http://www.doi.org/) were selected by preference as a source of author keywords for journal articles and Pubmed URLs were used to locate descriptors (in this case MeSH indexing terms). All articles were then located in Pubmed and on publicly available abstract pages from on-line journal database sites using the URLs collected from CiteULike. Where possible, pubmed URLs and DOI URLs were used directly, otherwise a series of scripts was used