The benefit of using Tag-Based Profiles Claudiu s firan Wolfgang NejdI Raluca pain L3S Research Center L3S Research cente L3S Research Center Appelstrasse 9a appelstrasse a 60167 Hanover, Germany 30167 Hanover, Germany 30167 Hanover, Germany firan l3S. de nejdI@L3S.de pain @L3S.de Abstract Foaming the Music or Yahoo! Music are a few of the currently online available music recommendation systems Collaborative tagging, i.e. the process of assigning These systems employ different approaches for recom- metadata in the form of keywords to shared content by many mending music tracks to their users, ranging from content users, has emerged as an important way to provide informa- based and collaborative filtering techniques to hybrid meth- tion about resources on the Web and elsewhere. Such key- ods. While clearly useful to their users, these more conven- words(tags)are used to enable the organization of infor- tional recommendation techniques still suffer from a num mation within personal information spaces, such as photo ber of problems: in the case of collaborative filtering,mu- collections, but can also be shared, allowing browsing and sical pieces with no ratings cannot be recommended be- earching with the help of tags attached by other users to cause recommendations are based on actual user ratings information resources from the Web. Recent research has Besides, artist variety in recommended pieces can be poor, shown that such tag distributions stabilize over time and can making these recommendations less satisfactory than they be used to improve search on the Web. In this paper we are could be. Recommending tracks that are similar to users interested in another aspect, namely how they characterize favorites in terms of content induces unreliability in model- the user and enable personalized recommendations. Using ing users' preferences and besides, content similarity does data from a frequently used music search portal, Last. fm, not necessarily reflect preferences. Hybrid recommendation we analyze tag usage and statistics and investigate the use techniques combine the advantages of the two approaches of tag-based user profiles in contrast to conventional user and are thus better. However, to our knowledge, the only profiles based on song and track usage. We specify recom- usable music recommender system using hybrid techniques mendation algorithms based on tag user profiles, and ex- is Foafing the Music, which relies heavily on FOAF profiles plore how collaborative filtering recommendations based created by the users-not an easy task for non-expert users on these tag profiles are different from recommendations Last, but not least, though many of these community sites based on song/track profiles. Finally, we describe a new allow tagging, these tags are not used for recommendation search-based method, which uses tags to recommend songs or any form of advanced search. interesting to a user, yielding substantially improved results The rest of this paper is structured as follows: Sectio The paper finishes with a discussion of some future work to 2 presents a short motivating scenario, in Section 3 we dis further improve tag-based search and recommendation in cuss existing approaches to music recommendations. We continue with the description of the Last. fm dataset, which we used in our experiments(Section 4). In Section 5 we define track-based and respectively tag-based profiles, dis 1. Introductio cuss how they differ from each other and present appropri ate algorithms for creating these user profiles. Using the algorithms described in Section 5, we then propose 7 differ More and more companies start offering personalized ent music recommendation methods(Section 6)based on services toward their users and online music recommender systems are one prominent example. Pandora, Last. fm, shttp://foafing-the-music.iua.upf.edu/ http://music.yahoo.com/ 'Method of making automatic predictions about the interests of a user http://www.last.fml by collecting preference information from many user
The Benefit of Using Tag-Based Profiles Claudiu S. Firan L3S Research Center Appelstrasse 9a 30167 Hanover, Germany firan@L3S.de Wolfgang Nejdl L3S Research Center Appelstrasse 9a 30167 Hanover, Germany nejdl@L3S.de Raluca Paiu L3S Research Center Appelstrasse 9a 30167 Hanover, Germany paiu@L3S.de Abstract Collaborative tagging, i.e. the process of assigning metadata in the form of keywords to shared content by many users, has emerged as an important way to provide information about resources on the Web and elsewhere. Such keywords (tags) are used to enable the organization of information within personal information spaces, such as photo collections, but can also be shared, allowing browsing and searching with the help of tags attached by other users to information resources from the Web. Recent research has shown that such tag distributions stabilize over time and can be used to improve search on the Web. In this paper we are interested in another aspect, namely how they characterize the user and enable personalized recommendations. Using data from a frequently used music search portal, Last.fm, we analyze tag usage and statistics and investigate the use of tag-based user profiles in contrast to conventional user profiles based on song and track usage. We specify recommendation algorithms based on tag user profiles, and explore how collaborative filtering recommendations based on these tag profiles are different from recommendations based on song/track profiles. Finally, we describe a new search-based method, which uses tags to recommend songs interesting to a user, yielding substantially improved results. The paper finishes with a discussion of some future work to further improve tag-based search and recommendation in community web sites. 1. Introduction More and more companies start offering personalized services toward their users and online music recommender systems are one prominent example. Pandora1 , Last.fm2 , 1http://www.pandora.com/ 2http://www.last.fm/ Foafing the Music3 or Yahoo! Music4 are a few of the currently online available music recommendation systems. These systems employ different approaches for recommending music tracks to their users, ranging from content based and collaborative filtering techniques to hybrid methods. While clearly useful to their users, these more conventional recommendation techniques still suffer from a number of problems: in the case of collaborative filtering5 , musical pieces with no ratings cannot be recommended because recommendations are based on actual user ratings. Besides, artist variety in recommended pieces can be poor, making these recommendations less satisfactory than they could be. Recommending tracks that are similar to users’ favorites in terms of content induces unreliability in modeling users’ preferences and besides, content similarity does not necessarily reflect preferences. Hybrid recommendation techniques combine the advantages of the two approaches and are thus better. However, to our knowledge, the only usable music recommender system using hybrid techniques is Foafing the Music, which relies heavily on FOAF profiles created by the users - not an easy task for non-expert users. Last, but not least, though many of these community sites allow tagging, these tags are not used for recommendation or any form of advanced search. The rest of this paper is structured as follows: Section 2 presents a short motivating scenario, in Section 3 we discuss existing approaches to music recommendations. We continue with the description of the Last.fm dataset, which we used in our experiments (Section 4). In Section 5 we define track-based and respectively tag-based profiles, discuss how they differ from each other and present appropriate algorithms for creating these user profiles. Using the algorithms described in Section 5, we then propose 7 different music recommendation methods (Section 6) based on 3http://foafing-the-music.iua.upf.edu/ 4http://music.yahoo.com/ 5Method of making automatic predictions about the interests of a user by collecting preference information from many users
collaborative filtering and search. We evaluated our algo- Site Summary(RSS) vocabularies for recommending mu- rithms through a user study, the experimental setup and the sic to a user, depending on her musical tastes. Music in- results we obtained are summarized in Section 7. Sectior formation, such as new album releases, related news artists. 8 presents our conclusions and several ideas to be further or available audio pieces, is gathered from RSS feeds from investigated the Web. whereas foaf documents are used to define user preferences (i.e, for building the user profiles). An 2.A Motivating Scenario other hybrid music recommendation method is presented in 6, which simultaneously considers user ratings and con- tent similarity and is based on a three-way aspect model, As motivation for our work, let us look at the following so that it can directly represent substantial(unobservable) simple example: Tom is a programmer in a big company user preferences as a set of latent variables introduced in nd while writing code he likes to listen to music. His pre- a Bayesian network. Probabilistic relations over users, rat- ferred music is"electronic and related kinds of music. On ings, and contents are statistically estimated. Our approach his desktop Tom has stored quite a few MP3 files with his differs from these previous ones by the fact that the user pro- favorite music, but as he has been listening to them over file is inferred automatically from his desktop music data and over again, he would really like a change. He therefore without any additional manual effort from the user's side decides to use an online music recommender system. He and that we rely on tag information instead of track-based pens the corresponding web page in his browser, but be- fore being able to listen to some new music. he needs to pro- profiles ide personal information like age, location and gender. He A totally different approach for producing music recom- mendations is presented in [4]. Their method is applied also has to indicate one or more of his favorite artists. After to an interactive music system that generates playlists fit- finishing this long registration process he can finally start ting the preferences indicated by the user. For automati using the system, but to his bad surprise, the first track he cally generating music playlists, their approach uses a local gets recommended is a piece he already has on his desktop. search procedure in the solution space, based on simulated (hopefully) will get better as the system learns his prefer space moving from one solution to a neighboring solution, compares their quality and stops when an optimal solution ences. However, this intensive interaction with the online is found music recommender system in the initial phase is quite an- 5] describes a system that queries web search engines noying for Tom, and certainly his group manager will not for pages related to artists, downloads the pages, extracts be too happy if he sees Tom switching back and forth be- text and natural language features, and analyzes these fea- tween his working environment and the web browser for tures in order to produce textual summary descriptions of rating music pieces. What Tom needs is a system not suf- each artist. These descriptions are used to compute simi- fering from this cold start problem, a system that is able to infer his profile without much interaction, but still with the larity between artists and can be further used for produc ng recommendations. However, the paper does not present ability to provide good music recommendations any evaluation experiments regarding the quality of the rec ommendations received by using this technique. Also, our 3 Related work approach differs from the one presented in [5] by the fact that they are searching the web for finding similar artists Most available music recommender systems are based on whereas we rely on user tags(in particular from the Last.fm collaborative filtering methods; i.e., they recommend mu- web site) for grouping tracks and artists and for building up sic to a user by considering some other users' ratings for the same music pieces. This technique is quite widely uti- The most similar approach to ours is [1, which uses col- lized,including music shopping services like Amazon?' laboratively created data from the Web for making recom- mendations. However, their goal is generating personalized ommendation method suffers from the cold start problem, tag recommendations for users of social bookmarking sites as described in Section 2 such as Delicious. Techniques for recommending tags do a better recommendation technique is described in [2] already exist and are based on the popularity of tags among The paper gives an overview of the Foafing the Music sys- all users, on time usage information, or on simple heuristics tem, which uses the Friend-of-a-Friend (FOAF) and rich extract keywords from the URl being tagged. Their ap proach complements these techniques and is based on rec- Throughout the paper with"desktop"we mean any personal computer. ommending tags from URLs that are similar to the one in http://www.amazon.com
collaborative filtering and search. We evaluated our algorithms through a user study, the experimental setup and the results we obtained are summarized in Section 7. Section 8 presents our conclusions and several ideas to be further investigated. 2. A Motivating Scenario As motivation for our work, let us look at the following simple example: Tom is a programmer in a big company and while writing code he likes to listen to music. His preferred music is “electronic” and related kinds of music. On his desktop6 Tom has stored quite a few MP3 files with his favorite music, but as he has been listening to them over and over again, he would really like a change. He therefore decides to use an online music recommender system. He opens the corresponding web page in his browser, but before being able to listen to some new music, he needs to provide personal information like age, location and gender. He also has to indicate one or more of his favorite artists. After finishing this long registration process he can finally start using the system, but to his bad surprise, the first track he gets recommended is a piece he already has on his desktop. Tom has the possibility to rate the music this system recommends to him, such that over time the recommendations (hopefully) will get better as the system learns his preferences. However, this intensive interaction with the online music recommender system in the initial phase is quite annoying for Tom, and certainly his group manager will not be too happy if he sees Tom switching back and forth between his working environment and the web browser for rating music pieces. What Tom needs is a system not suffering from this cold start problem, a system that is able to infer his profile without much interaction, but still with the ability to provide good music recommendations. 3. Related Work Most available music recommender systems are based on collaborative filtering methods; i.e., they recommend music to a user by considering some other users’ ratings for the same music pieces. This technique is quite widely utilized, including music shopping services like Amazon7 or iTunes8 , and has proven to be effective. However, this recommendation method suffers from the cold start problem, as described in Section 2. A better recommendation technique is described in [2]. The paper gives an overview of the Foafing the Music system, which uses the Friend-of-a-Friend (FOAF) and Rich 6Throughout the paper with “desktop” we mean any personal computer. 7http://www.amazon.com/ 8http://www.apple.com/itunes/ Site Summary (RSS) vocabularies for recommending music to a user, depending on her musical tastes. Music information, such as new album releases, related news artists, or available audio pieces, is gathered from RSS feeds from the Web, whereas FOAF documents are used to define user preferences (i.e., for building the user profiles). Another hybrid music recommendation method is presented in [6], which simultaneously considers user ratings and content similarity and is based on a three-way aspect model, so that it can directly represent substantial (unobservable) user preferences as a set of latent variables introduced in a Bayesian network. Probabilistic relations over users, ratings, and contents are statistically estimated. Our approach differs from these previous ones by the fact that the user pro- file is inferred automatically from his desktop music data without any additional manual effort from the user’s side, and that we rely on tag information instead of track-based profiles. A totally different approach for producing music recommendations is presented in [4]. Their method is applied to an interactive music system that generates playlists fitting the preferences indicated by the user. For automatically generating music playlists, their approach uses a local search procedure in the solution space, based on simulated annealing: the algorithm iteratively searches the solution space moving from one solution to a neighboring solution, compares their quality and stops when an optimal solution is found. [5] describes a system that queries web search engines for pages related to artists, downloads the pages, extracts text and natural language features, and analyzes these features in order to produce textual summary descriptions of each artist. These descriptions are used to compute similarity between artists and can be further used for producing recommendations. However, the paper does not present any evaluation experiments regarding the quality of the recommendations received by using this technique. Also, our approach differs from the one presented in [5] by the fact that they are searching the web for finding similar artists, whereas we rely on user tags (in particular from the Last.fm web site) for grouping tracks and artists and for building up user profiles. The most similar approach to ours is [1], which uses collaboratively created data from the Web for making recommendations. However, their goal is generating personalized tag recommendations for users of social bookmarking sites such as Delicious9 . Techniques for recommending tags do already exist and are based on the popularity of tags among all users, on time usage information, or on simple heuristics to extract keywords from the URL being tagged. Their approach complements these techniques and is based on recommending tags from URLs that are similar to the one in 9http://del.icio.us/
question according to two variants of cosine similarity met rics. We use tags as well, but use them for recommending nteresting songs based on tag-based profiles 4. Last. fm Functionalities and Usage Data 62976 perimental allout 9 indie rock As basis for our experiments we have used a crawl of 10 classic rock the last. fm website. Claiming to be the"social music rev- 11 female vocalists184,47836instrumental olution, Last. fm is a great place to congregate and share musical tastes, giving to its users access to thousands of 754 tracks from all musical genres. Before using the system 857140indie pop users need to create an account and specify their preferred hardcore thrash metal The interaction with the service can be don ower metal 73,921 through the web interface, via the embedded Flash player, or through the Last. fm player which the users need to down- 9Os essive metal730 load and install locally on the desktop. By listening to tracks and rating them(there are three options: Love this track, 24 black metal eggan 25 progressive rock Skip this track or Ban this track), user profiles are created. Based on these profiles, the service produces a number of Table 1. Number of times the most popular 5 personalized features, such as finding out about artists that tags were used on Last. fm the user likes, other people with similar tastes, appropriate charts, or events in the neighborhood. Based on the music the user likes. Last. fm connects her to other users that have similar tastes(her neighbors). However, for using this tea- tion regarding the number of times each tag has been used about one week before their neighbors appear number of users which have used the tag as well as lists of the list of neighbors is updated, based on what the user has similar tags together with their similarity scores. Based on been listening to during that week the data we have crawled we can observe that the usage of Other social connections can be created on Last. fm the tags follows largely a power law distribution, except for through friendship relations: the user can add friends the sharp drop at the end( Figure 1): after the top first 76 through an appropriate button, and accept friendship re most used tags(representing 0.36% of the tags) the usage quests to make the friendship"mutual". As with neighbors, number falls down to 5%o of the usage number correspond they can listen to their friends'radio stations. Users with ing to the most used tag("rock"). At the time of our crawl similar interests can get together also by joining groups Music statistics are generated for groups as well as for in- dividual users. Anyone can create these groups and encour- 100,000 age people to join. It is also possible for a group founder to 0,000 restrict access to the group and its associated forum Another interesting functionality offered by Last. fm which we will focus on in this paper is the possibility of tt马 agging: users can assign as many tags as they want to any track, album or artist in their profile. With tagging, they can find their items better, can view everyone or choose to play a tag-featured radio(music that has been tagged with some specific tag) For the purpose of our investigations, we crawled an ex Figure 1. Tag Distribut tensive subset of the Last. fm website, namely pages corre- acks and user profiles. All infor mation extracted was indexed and stored into a Lucene lo tags have been used by the users of the Last. fm system for a total number of 18, 735, 549 times. The top 50 most used Tags. We could identify a number of 21, 177 different tags are summarized in Table 1 tags, which are used on Last. fm for tagging tracks, artists or An analysis of the top 100 most used tags showed tha approximately 60%o of the tags represent genre descriptions http://ucene.apacheorg while the rest of 40% is shared among tags describing per-
question according to two variants of cosine similarity metrics. We use tags as well, but use them for recommending interesting songs based on tag-based profiles. 4. Last.fm Functionalities and Usage Data As basis for our experiments we have used a crawl of the Last.fm website. Claiming to be the “social music revolution”, Last.fm is a great place to congregate and share musical tastes, giving to its users access to thousands of tracks from all musical genres. Before using the system, users need to create an account and specify their preferred music genre. The interaction with the service can be done through the web interface, via the embedded Flash player, or through the Last.fm player which the users need to download and install locally on the desktop. By listening to tracks and rating them (there are three options: Love this track, Skip this track or Ban this track), user profiles are created. Based on these profiles, the service produces a number of personalized features, such as finding out about artists that the user likes, other people with similar tastes, appropriate charts, or events in the neighborhood. Based on the music the user likes, Last.fm connects her to other users that have similar tastes (her neighbors). However, for using this feature, users need to listen to at least 5 artists and wait for about one week before their neighbors appear. Every week, the list of neighbors is updated, based on what the user has been listening to during that week. Other social connections can be created on Last.fm through friendship relations: the user can add friends through an appropriate button, and accept friendship requests to make the friendship “mutual”. As with neighbors, they can listen to their friends’ radio stations. Users with similar interests can get together also by joining groups. Music statistics are generated for groups as well as for individual users. Anyone can create these groups and encourage people to join. It is also possible for a group founder to restrict access to the group and its associated forum. Another interesting functionality offered by Last.fm which we will focus on in this paper is the possibility of tagging: users can assign as many tags as they want to any track, album or artist in their profile. With tagging, they can find their items better, can view everyone else’s tags, or choose to play a tag-featured radio (music that has been tagged with some specific tag). For the purpose of our investigations, we crawled an extensive subset of the Last.fm website, namely pages corresponding to tags, music tracks and user profiles. All information extracted was indexed and stored into a Lucene10 index. Tags. We could identify a number of 21,177 different tags, which are used on Last.fm for tagging tracks, artists or 10http://lucene.apache.org/ Nr. Tag Freq. Nr. Tag Freq. 1 rock 866,040 26 hard rock 114,871 2 indie 488,944 27 industrial 110,207 3 seen live 478,098 28 punk rock 108,388 4 alternative 453,081 29 heavy metal 107,851 5 electronic 346,832 30 japanese 106,091 6 metal 312,221 31 experimental 104,122 7 pop 299,499 32 favorites 102,619 8 punk 247,665 33 chillout 98,663 9 indie rock 206,876 34 rap 97,106 10 classic rock 206,654 35 soundtrack 90,875 11 female vocalists 184,478 36 instrumental 87,022 12 alternative rock 180,320 37 hip hop 81,780 13 jazz 174,782 38 british 76,549 14 electronica 167,989 39 trance 76,460 15 hip-hop 158,571 40 indie pop 75,997 16 singer-songwriter 139,041 41 soul 75,692 17 folk 137,598 42 metalcore 74,989 18 hardcore 133,682 43 thrash metal 74,786 19 emo 130,831 44 power metal 73,921 20 death metal 129,766 45 progressive metal 73,800 21 dance 127,282 46 90s 72,279 22 80s 123,183 47 blues 68,931 23 ambient 119,680 48 acoustic 65,627 24 black metal 119,468 49 reggae 65,355 25 progressive rock 116,036 50 post-rock 64,095 Table 1. Number of times the most popular 50 tags were used on Last.fm. albums. For each of these tags we have extracted information regarding the number of times each tag has been used, number of users which have used the tag, as well as lists of similar tags together with their similarity scores. Based on the data we have crawled we can observe that the usage of the tags follows largely a power law distribution, except for the sharp drop at the end (Figure 1): after the top first 76 most used tags (representing 0.36% of the tags) the usage number falls down to 5% of the usage number corresponding to the most used tag (“rock”). At the time of our crawl Figure 1. Tag Distribution tags have been used by the users of the Last.fm system for a total number of 18,735,549 times. The top 50 most used tags are summarized in Table 1. An analysis of the top 100 most used tags showed that approximately 60% of the tags represent genre descriptions, while the rest of 40% is shared among tags describing per-
sonal impressions(e.g,"seen live"), artists(e.g,"female vocalists”), time period(eg“80s”), country of prove 10000 nience, soundtrack, tempo, or instruments Tracks. From the Last. fm music database we crawled 1.000 a total number of 317. 058 tracks and their associated at tributes, including track and artist name, number of times oteE2z the track has been played on Last. fm, albums featuring the track, users'comments and tags which have been used for tagging the tracks and the corresponding tags'scores(based on the number of times each tag has been used for taggin 1,00010000100000 each track). Figure 2 presents the variation of the track popularity over the collection of tracks, showing a similar distribution as in Figure 1, except for the power law curve Figure 3. User- Tag Distribution decreasing slower. 100000 收°tE 1 1,00010000100.000 User number Figure 4. User- Track Distribution ure 2. Track Distribution Users. We extracted details about 289 654 registered those tags that the users applied themselves to other track users on Last. fm. For these users the extracted information or artists. Although most music recommendation methods includes user ID, gender, age, location, registration date, IIs- use track-based algorithms to present the user with new tened number of tracks, list of friends and neighbors, overall teresting tracks, given the increasing tendency toward tag- top listened tracks and list of used tags. From this number ging all types of multimedia files on the Internet we wanted to investigate how these tags could be used for recommen to at least 50 tracks and who have not used at least 10 dif- dations. We also wanted to avoid extensive manual ranking. ferent tags, and were thus left with 12, 193 unique users for so opted to construct user profiles from locally stored MP3 our experiments. For these 12, 193 users the distribution of files. This usually works very well, as most users have quite the number of users and the total number of times each user much more than needed for a music recommender system made use of tags, by the time of our crawling Similarly, the relation user to total number of listened to provide satisfactory results tracks is shown in Figure 4. Note that both figures 3 and 4 For the rest of the paper we will distinguish between pro- are plotted for the set of users having listened to at least 50 files created for Last. fm users and for Non-Last fm users tracks and having used at least 10 different tags -therefore They differ in terms of the source of information which the sudden cutoff is used for building up the profiles: for Last. fm users the starting point is represented by their web pages on Last. fm, whereas in the case of the Non -Last fm users we start from 5. Tag-Based Profiles vs. Track-Based Profiles the information available on their desktops. For this type of users we extract metadata about each track existing on the Tags are more and more widely used, which makes it desktop, and match artist and track name(extracted from easier to extract them for all type oratio ename or ID3 tag) against the last fm music database including music tracks. We chose to extract available This, as we will show in section 5.2 provides all data nec- man annotations(tags) from Last. fm tracks or directly essary to create comprehensive user profiles that accurately
sonal impressions (e.g., “seen live”), artists (e.g., “female vocalists”), time period (e.g., “80s”), country of provenience, soundtrack, tempo, or instruments. Tracks. From the Last.fm music database we crawled a total number of 317,058 tracks and their associated attributes, including track and artist name, number of times the track has been played on Last.fm, albums featuring the track, users’ comments and tags which have been used for tagging the tracks and the corresponding tags’ scores (based on the number of times each tag has been used for tagging each track). Figure 2 presents the variation of the track popularity over the collection of tracks, showing a similar distribution as in Figure 1, except for the power law curve decreasing slower. Figure 2. Track Distribution Users. We extracted details about 289,654 registered users on Last.fm. For these users the extracted information includes user ID, gender, age, location, registration date, listened number of tracks, list of friends and neighbors, overall top listened tracks and list of used tags. From this number of users, we filtered out all users who have not yet listened to at least 50 tracks and who have not used at least 10 different tags, and were thus left with 12,193 unique users for our experiments. For these 12,193 users the distribution of tags is plotted in Figure 3, showing the correlation between the number of users and the total number of times each user made use of tags, by the time of our crawling. Similarly, the relation user to total number of listened tracks is shown in Figure 4. Note that both figures 3 and 4 are plotted for the set of users having listened to at least 50 tracks and having used at least 10 different tags – therefore the sudden cutoff. 5. Tag-Based Profiles vs. Track-Based Profiles Tags are more and more widely used, which makes it easier to extract them for all types of information objects, including music tracks. We chose to extract available human annotations (tags) from Last.fm tracks or directly use Figure 3. User - Tag Distribution Figure 4. User - Track Distribution those tags that the users applied themselves to other tracks or artists. Although most music recommendation methods use track-based algorithms to present the user with new interesting tracks, given the increasing tendency toward tagging all types of multimedia files on the Internet we wanted to investigate how these tags could be used for recommendations. We also wanted to avoid extensive manual ranking, so opted to construct user profiles from locally stored MP3 files. This usually works very well, as most users have quite a few music files in MP3 format on their desktop, usually much more than needed for a music recommender system to provide satisfactory results. For the rest of the paper we will distinguish between pro- files created for Last.fm users and for Non-Last.fm users. They differ in terms of the source of information which is used for building up the profiles: for Last.fm users the starting point is represented by their web pages on Last.fm, whereas in the case of the Non-Last.fm users we start from the information available on their desktops. For this type of users we extract metadata about each track existing on the desktop, and match artist and track name (extracted from filename or ID3 tag) against the Last.fm music database. This, as we will show in section 5.2 provides all data necessary to create comprehensive user profiles that accurately
reflect users' music preferences. We will use the following matched against the Last. fm database and only tracks with a TFxIDF score above 0.9 are kept for further processing ITF(TG)=Inverse Tag Frequency for Tag TG This pre-processing step is described below P(TR, U)=Preference of User U for Track TR Ale. 5.1.2.1: Get list of tracks P(TG, U)=Preference of User U for Tag TG 1: For each track(MP3) TR-U-Listened= Number of times User U has listened to Track TR Extract artist name and track name from filename as s/ Extract artist name and track name from ID3 tag as S2 Number of times Track T R was overall listened on earch with S on Last. fm index TG_Used For-TR= Number of times Track TR was tagged with Tag TG by tching query with Lucene TFxlDF score >0.9 dd tracks lt to the user's list of tracks utr TG. Used. Overall Number of times Tag TG was used overall 8: Return UTR TG-Used By- U= Number of times User U has used Tag TG Tracks Containing TG=Number of tracks on the User's Desktop that Once the list of tracks for a Non -Last. fm user is cre- ated, the track-profile is realized in a similar manner as for a Last. fm user: algorithm 5.1.1.1 is applied on the list of We distinguish between Last. fm and Non-Last fm users us- tracks with the only difference that the preference scores ng either Last. fm or Non-Last fim indices in the correspond- for the tracks are now a function of the overall number of ing formulas times tracks have been listened on last fm 5.1. Track-based Profiles Alg 5.1.2.2: Track-based profile for Non-Last fm user 1: Create list of tracks UTR applying Alg. 5.1.2.1 Track-based profiles are defined as collections of music 2: Apply Alg 5.1.1.1 on list of tracks UTR tracks with associated preference scores, describing users usical tastes. as follows: with P=P(TR, UNon-Last. fm)=log(TR.OverallListened) Profile Tracks TRi, P>ITR: =user's track, P=P(TR&, U)) 5.2. Tag-based Profiles 5.1.1 Track-based profiles for last. fm users Tag-based user profiles are defined as collections of In the case of Last. fm users the profiles are inferred from tags together with corresponding scores representing the the users web pages available on the Last. m site. Their user's interest in each of these tags. The formal definition is ollection of tracks includes all tracks the users have been given below. listening to inside the system. Their associated scores are a Profile Tags(U function of the number of times users have listened to these [< TGi, P>ITGi -user's tag Pi= p(TGi, U)H music tracks. The algorithm for creating the track-based profiles for this type of users is described below Again, we distinguish between profiles created for Last. fm and non-Last fm users. For this type of profiles, the list of Alg 5.1.1.1: Track-based profile for Last. fm user tags can be extracted either from the users' list of tracks 1: For (tags which have been used to tag the tracks) or directly from the tags the users have used themselves The different 4: Return user's track-based profile profile variants of tag-based profiles for Last. fm and Non-Last fm with P=P(TR, ULastf m)=log(TR,ULastfm Listened users are described in detail in Sections 5.2.1 and 5.2.2 5.2.1 Tag-based profiles for Last. fm users 5.1.2 Track-based profiles for Non- Last. fm users For Last. fm users, the first type of tag-based profiles can be created starting from the list of tracks the users have been For Non-Last fm users the only available source of personal information is represented by their desktops. We first e I The TExIDF score(term frequency-inverse document freque tract explicit metadata such as artist and track name either weight often used in information retrieval and text mining for evaluating from the filename or from the ID3 tags(if any)of the mu- importance increases proportionally to the number of times a word appears ic files existing on the desktop. This information is ther in the document but is offset by the frequency of the word in the corpus
reflect users’ music preferences. We will use the following notations: IT F (T G) = Inverse Tag Frequency for Tag T G p(T R, U) = Preference of User U for Track T R p(T G, U) = Preference of User U for Tag T G T R U Listened = Number of times User U has listened to Track T R T R Overall Listened = Number of times Track T R was overall listened on Last.fm T G UsedF or T R = Number of times Track T R was tagged with Tag T G by all users T G Used Overall = Number of times Tag T G was used overall T G UsedBy U = Number of times User U has used Tag T G T racks Containing T G = Number of tracks on the User’s Desktop that were tagged with Tag T G We distinguish between Last.fm and Non-Last.fm users using either Last.fm or Non-Last.fm indices in the corresponding formulas. 5.1. Track-based Profiles Track-based profiles are defined as collections of music tracks with associated preference scores, describing users’ musical tastes, as follows: P rof ile T racks(U) = {< T Ri, Pi > | T Ri =user’s track, Pi = p(T Ri, U)} 5.1.1 Track-based profiles for Last.fm users In the case of Last.fm users the profiles are inferred from the users’ web pages available on the Last.fm site. Their collection of tracks includes all tracks the users have been listening to inside the system. Their associated scores are a function of the number of times users have listened to these music tracks. The algorithm for creating the track-based profiles for this type of users is described below: Alg. 5.1.1.1: Track-based profile for Last.fm user 1: For each track T R in user’s tracks list UT R 2: Compute track’s score P 3: Add pair < T R, P > to user profile 4: Return user’s track-based profile with P = p(T R, ULast.fm) = log(T R ULast.fm Listened) 5.1.2 Track-based profiles for Non-Last.fm users For Non-Last.fm users the only available source of personal information is represented by their desktops. We first extract explicit metadata such as artist and track name either from the filename or from the ID3 tags (if any) of the music files existing on the desktop. This information is then matched against the Last.fm database and only tracks with a TFxIDF11 score above 0.9 are kept for further processing. This pre-processing step is described below: Alg. 5.1.2.1: Get list of tracks 1: For each track (MP3) on user desktop 2: Extract artist name and track name from filename as S1 3: Extract artist name and track name from ID3 tag as S2 4: Combine S1 and S2 into S 5: Search with S on Last.fm index 6: Retrieve tracks LT matching query with Lucene TFxIDF score >0.9 7: Add tracks LT to the user’s list of tracks UT R 8: Return UT R Once the list of tracks for a Non-Last.fm user is created, the track-profile is realized in a similar manner as for a Last.fm user: algorithm 5.1.1.1 is applied on the list of tracks with the only difference that the preference scores for the tracks are now a function of the overall number of times tracks have been listened on Last.fm. Alg. 5.1.2.2: Track-based profile for Non-Last.fm user 1: Create list of tracks UT R applying Alg. 5.1.2.1 2: Apply Alg. 5.1.1.1 on list of tracks UT R with P = p(T R, UNon−Last.fm) = log(T R Overall Listened) 5.2. Tag-based Profiles Tag-based user profiles are defined as collections of tags together with corresponding scores representing the user’s interest in each of these tags. The formal definition is given below: P rof ile T ags(U) = {< T Gi, Pi > | T Gi =user’s tag, Pi = p(T Gi, U)} Again, we distinguish between profiles created for Last.fm and non-Last.fm users. For this type of profiles, the list of tags can be extracted either from the users’ list of tracks (tags which have been used to tag the tracks) or directly from the tags the users have used themselves. The different variants of tag-based profiles for Last.fm and Non-Last.fm users are described in detail in Sections 5.2.1 and 5.2.2. 5.2.1 Tag-based profiles for Last.fm users For Last.fm users, the first type of tag-based profiles can be created starting from the list of tracks the users have been 11The TFxIDF score (term frequency-inverse document frequency) is a weight often used in information retrieval and text mining for evaluating how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus