Interweaving Public User Profiles on the Web 19 profile attributes LinkedIn 1538 photo. (a)Profile attributes (b)Completing service profiles Fig. l. Service profiles:(a) number of public profiles as well as the profile attributes that were crawled from the different services and(b) completing service profiles with aggregated profile data. Only the 338 users who have an account at each of the listed services are considered information gathered from the other services, the completeness increases to more than 98%(see Fig. 1(b)): profile fields that are often left blank, such as location and homepage, can be obtained from the social networking sites. Moreover, even the rather complete Facebook and LinkedIn profiles can benefit from profile aggregation: LinkedIn profiles can, on average, be improved by 7%, even though LinkedIn provides three attributes-interests, education and industry-that are not in the public profiles of the other services(cf. Fig. 2(a)) In summary, profile aggregation results in an extensive user profile that re- veals more information than the profiles at the individual services. Moreover, aggregation can be used to fill in missing attributes at the individual services 3.3 FOAF and vcard Generation In most Web 2.0 services, user profiles are primarily intended to be presented to other end-users. However, it is also possible to use the profile data to generate FOAF 3 profiles or vCard [14] entries that can be fed into applications such as Outlook, Thunderbird or FOAF Explorer Figure 2(a)lists the attributes each service can contribute to fill in a FOaF or vCard profile, if the corresponding fields are filled out by the user. Figure 2(b) shows to which degree the real service profiles of the 338 considered users can actually be applied to fill in the corresponding attributes with adequate values Using the aggregated profile data of the users, it is possible to generate FOAF profiles and v Card entries to an average degree of more than 84% and 88% respectively-the corresponding attributes are listed in Fig. 2(a). Google, Flickr and Twitter profiles provide much less information applicable to fill the FOAF nd vCard details. Although Facebook and LinkedIn both provide seven at tributes that can potentially be applied to generate the v Card profile, it is in- teresting to see that the actual LinkedIn user profiles are more valuable and produce vCard entries with average completeness of 45%; using Facebook as
Interweaving Public User Profiles on the Web 19 Service # crawled crawled profile attributes profiles Facebook 3080 nickname, first/last/full name, photo, email (hash), homepage, locale settings, affiliations LinkedIn 3606 nickname, first/last/full name, about, homepage, location, interests, education, affiliations, industry Twitter 1538 nickname, full name, photo, homepage, blog, location Flickr 2490 nickname, full name, photo, email, location Google 15947 nickname, full name, photo, about, homepage, blog, location (a) Profile attributes 0 0,2 0,4 0,6 0,8 1 Twitter (6) Google (7) Flickr (5) LinkedIn (10) Facebook (9) service (# considered profile attributes) completeness of profiles profile information available in the individual service profile information available after enrichment with aggregated profile (b) Completing service profiles Fig. 1. Service profiles: (a) number of public profiles as well as the profile attributes that were crawled from the different services and (b) completing service profiles with aggregated profile data. Only the 338 users who have an account at each of the listed services are considered. information gathered from the other services, the completeness increases to more than 98% (see Fig. 1(b)): profile fields that are often left blank, such as location and homepage, can be obtained from the social networking sites. Moreover, even the rather complete Facebook and LinkedIn profiles can benefit from profile aggregation: LinkedIn profiles can, on average, be improved by 7%, even though LinkedIn provides three attributes—interests, education and industry—that are not in the public profiles of the other services (cf. Fig. 2(a)). In summary, profile aggregation results in an extensive user profile that reveals more information than the profiles at the individual services. Moreover, aggregation can be used to fill in missing attributes at the individual services. 3.3 FOAF and vCard Generation In most Web 2.0 services, user profiles are primarily intended to be presented to other end-users. However, it is also possible to use the profile data to generate FOAF [3] profiles or vCard [14] entries that can be fed into applications such as Outlook, Thunderbird or FOAF Explorer. Figure 2(a) lists the attributes each service can contribute to fill in a FOAF or vCard profile, if the corresponding fields are filled out by the user. Figure 2(b) shows to which degree the real service profiles of the 338 considered users can actually be applied to fill in the corresponding attributes with adequate values. Using the aggregated profile data of the users, it is possible to generate FOAF profiles and vCard entries to an average degree of more than 84% and 88% respectively—the corresponding attributes are listed in Fig. 2(a). Google, Flickr and Twitter profiles provide much less information applicable to fill the FOAF and vCard details. Although Facebook and LinkedIn both provide seven attributes that can potentially be applied to generate the vCard profile, it is interesting to see that the actual LinkedIn user profiles are more valuable and produce vCard entries with average completeness of 45%; using Facebook as
F. Abel et al (a)Services and available attributes (b) Completing FOAF/ vCard profiles Fig. 2. FOAF/vCard profile generation: (a) services and attributes available in the the public profiles of Facebook(Fa), LinkedIn(L), Twitter (T), Flickr(FI), and Google (G) that can be applied to fill in a FoaF profile or a vCard entry and(b)completing FOAF and vCard profiles with the actual user profiles a data source this is only 34%. In summary, the aggregated profiles are thus a far better source of information to generate FOAF/vCard entries than the service-specific profiles 3.4S Our analysis of the user profiles distributed across the different services point out several advantages of profile aggregation and motivate the intertwining of profiles on the Web. With respect to the key questions raised at the beginning of the section. the main outcomes can be summarized as follows 1. Users fill in their public profiles at social networking services(Facebook LinkedIn) more extensively than profiles at social media services(Flickr Twitter) which can possibly be explained by differences in purpose of the 2. Profile aggregation provides multi-faceted profiles that reveal significantly more information about the users than individual service profiles can provide. 3. The aggregated user profile can be used to enrich incomplete profiles in individual services, to make them more complete 4. Service-specific profiles as well as the aggregated profiles can be applied to generate FoaF profiles and vCard entries. The aggregated profile represents the most useful profile, as it completes the FoaF profiles and v Card entries o 84% and 88% respectively As user profiles distributed on the Web describe different facets of the user, pro- file aggregation brings some advantages: users do not have to fill their profiles d over ag gain; applications can make use of more and richer facets/ at tributes of the user(e.g. for personalization purposes ). However, our analysis shows also the risk of intertwining user profiles. For example, users who deliber- ately leave out some fields when filling their Twitter profile might not be aware hat the corresponding information can be gathered from other sources
20 F. Abel et al. Attribute vCard FOAF Fa L T Fl G nickname x x x x x x x first name x x x last name x x x full name x x x x x x x profile photo x x x x x x about x x x email x x x x homepage x x x x x x blog x x x x location x x x x x x locale settings x x interests x x education x x affiliations x x x x industry x x (a) Services and available attributes 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 Twitter (4/5) Flickr (4/5) Google (4/5) Facebook (6/7) LinkedIn (8/6) Aggregated (11/11) service (# attributes applicable to FOAF/vCard) completeness of FOAF/vCard profiles completeness of vCard profiles completeness of FOAF profiles (b) Completing FOAF/vCard profiles Fig. 2. FOAF/vCard profile generation: (a) services and attributes available in the the public profiles of Facebook (Fa), LinkedIn (L), Twitter (T), Flickr (Fl), and Google (G) that can be applied to fill in a FOAF profile or a vCard entry and (b) completing FOAF and vCard profiles with the actual user profiles a data source this is only 34%. In summary, the aggregated profiles are thus a far better source of information to generate FOAF/vCard entries than the service-specific profiles. 3.4 Synopsis Our analysis of the user profiles distributed across the different services point out several advantages of profile aggregation and motivate the intertwining of profiles on the Web. With respect to the key questions raised at the beginning of the section, the main outcomes can be summarized as follows. 1. Users fill in their public profiles at social networking services (Facebook, LinkedIn) more extensively than profiles at social media services (Flickr, Twitter) which can possibly be explained by differences in purpose of the different systems. 2. Profile aggregation provides multi-faceted profiles that reveal significantly more information about the users than individual service profiles can provide. 3. The aggregated user profile can be used to enrich incomplete profiles in individual services, to make them more complete. 4. Service-specific profiles as well as the aggregated profiles can be applied to generate FOAF profiles and vCard entries. The aggregated profile represents the most useful profile, as it completes the FOAF profiles and vCard entries to 84% and 88% respectively. As user profiles distributed on the Web describe different facets of the user, pro- file aggregation brings some advantages: users do not have to fill their profiles over and over again; applications can make use of more and richer facets/attributes of the user (e.g. for personalization purposes). However, our analysis shows also the risk of intertwining user profiles. For example, users who deliberately leave out some fields when filling their Twitter profile might not be aware that the corresponding information can be gathered from other sources
Interweaving Public User Profiles on the Web 21 Table 1. Tagging statistics of the 139 users who have an account at Flickr, Stumble- Upon, and Delicious Upon Delicious Overall distinct tags 2345 176322 445.2156412 distinct tags per user 4.42 165837182 4 User Activity Data on the Web Most social media systems enable users to organize content with tags(freely chosen keywords ). The tagging activities of a user form a valuable source information for determining the interests of a user [12, 13. In our analysis we examine the nature of the tag-based profiles in different investigate the the benefits of aggregating profile data and answer the following estions 1. What kind of tag-based profiles do individual users have in the different 2. Does the aggregation of tag-based user profiles reveal more information about the users than the profiles available in some specific service? 3. Is it possible to predict tag-based profiles in a system, based on profile data gathered from another system? 4.1 Individual Tagging Behavior across Different Systems From the 116032 users, 139 users were randomly selected who linked their Flickr, StumbleUpon, and Delicious accounts. Table 1 lists the corresponding taggin statistics. For these users, we crawled 78412 tag assignments that were performed on the 200 latest images(Flickr)or bookmarks(Delicious and StumbleUpon) Overall, users tagged more actively in Delicious than in the other systems: more than 75% of the tagging activities originate from Delicious, 16.3% from Stum- bleUpon and 5% from Flickr. The usage frequency of the distinct tags shows a typical power-law distribution in all three systems, as well as in the aggregated set of tag assignments: while some tags are used very often, the majority of tags is used rarely or even just once On average, each user provided 564. 12 tag assignments across the different systems. The user activity distribution corresponds to a gaussian distribution 26.6% of the users have less than 200 tag assignments, 10. 1% have more than 1000 and 63. 3% have between 200 and 1000 tag assignments. Interestingly, people who actively tagged in one system do not necessarily perform many tag assign- ments in another system. For example, none of the top 5% taggers in Flickr or StumbleUpon is also among the top 10% taggers in Delicious. This observation of unbalanced tagging behavior across different systems again reveals possible
Interweaving Public User Profiles on the Web 21 Table 1. Tagging statistics of the 139 users who have an account at Flickr, StumbleUpon, and Delicious Flickr StumbleUpon Delicious Overall tag assignments 3781 12747 61884 78412 distinct tags 691 2345 11760 13212 tag assignments per user 27.2 91.71 445.21 564.12 distinct tags per user 5.22 44.42 165.83 71.82 4 User Activity Data on the Web Most social media systems enable users to organize content with tags (freely chosen keywords). The tagging activities of a user form a valuable source of information for determining the interests of a user [12, 13]. In our analysis we examine the nature of the tag-based profiles in different systems. Again, we investigate the the benefits of aggregating profile data and answer the following questions. 1. What kind of tag-based profiles do individual users have in the different systems? 2. Does the aggregation of tag-based user profiles reveal more information about the users than the profiles available in some specific service? 3. Is it possible to predict tag-based profiles in a system, based on profile data gathered from another system? 4.1 Individual Tagging Behavior across Different Systems From the 116032 users , 139 users were randomly selected who linked their Flickr, StumbleUpon, and Delicious accounts. Table 1 lists the corresponding tagging statistics. For these users, we crawled 78412 tag assignments that were performed on the 200 latest images (Flickr) or bookmarks (Delicious and StumbleUpon). Overall, users tagged more actively in Delicious than in the other systems: more than 75% of the tagging activities originate from Delicious, 16.3% from StumbleUpon and 5% from Flickr. The usage frequency of the distinct tags shows a typical power-law distribution in all three systems, as well as in the aggregated set of tag assignments: while some tags are used very often, the majority of tags is used rarely or even just once. On average, each user provided 564.12 tag assignments across the different systems. The user activity distribution corresponds to a gaussian distribution: 26.6% of the users have less than 200 tag assignments, 10.1% have more than 1000 and 63.3% have between 200 and 1000 tag assignments. Interestingly, people who actively tagged in one system do not necessarily perform many tag assignments in another system. For example, none of the top 5% taggers in Flickr or StumbleUpon is also among the top 10% taggers in Delicious. This observation of unbalanced tagging behavior across different systems again reveals possible
F. Abel et al StumbleUpon StumbleUpon (a)Type of tags in the systems (b) Type of overlapping tags Fig 3. Tag usage characterized with Wordnet categories:(a) Type of tags users ap in the different systems and(b) type of tags individual users apply in two differ be used to tackle sparsity problems. proe tagging systems: given a sparse tag- advantages of profile aggregation for current based user profile, the consideration of les produced in other systems might 4.2 Commonalities and Differences in Tagging Activities In order to analyze commonalities and differences of the users' tag-based profiles in the different systems, we mapped tags to Wordnet categories and consid- ered only those 65% of the tags for which such a mapping exists. Figure 3(a) shows that the type of tags in StumbleUpon and Delicious are quite similar except for cognition tags(e. g, research, thinking, which are used more often in Stumble Upon than in Delicious. For both systems, most of the tags--21.9% in StumbleUpon and 18.3% in Delicious-belong to the category communication (e.g, hypertext, web). By contrast, only 4.4% of the Flickr tags refer to the field of communication; the majority of tags(25.2%)denote locations(e. g, Hamburg, tuscany). Action(e. g, walking), people(e. g, me), and group tags(e.g,commu- nity)as well as words referring to some artifact (e.g, bike) occur in all three systems with similar frequency. However, the concrete tags seem to be different For example, while artifacts in Delicious refer to things like"tool"or"mobile device", the artifact tags in Flickr describe things like church"or"painting This observation is supported by Fig 3(b), which shows the average overlap of the individual category-specific tag profiles. On average, each user applied only 0.9% of the Flickr artifact tags tags also in Delicious. For Flickr and Delicious action tags allocate the biggest fraction of overlapping tags. It is interesting to see that the overlap of location tags between Flickr and Stumble Upon is 31. 1% even though location tags are used very seldomly in Stumble Upon (3. 3%, as Stumble Upon, it is likely that she will also use the same tag in Flickr tag in depicted in Fig. 3(a)). This means that if someone utilizes a locatie Having knowledge on the different(aggregated) tagging facets of a user opens the door for interesting applications. For example, a system could exploit
22 F. Abel et al. 0% 10% 20% 30% 40% other communication action artifact person group location cognition Flickr Delicious StumbleUpon (a) Type of tags in the systems 0% 10% 20% 30% 40% other communication action artifact person group location cognition Flickr & Delicious Flickr & StumbleUpon StumbleUpon & Delicious (b) Type of overlapping tags Fig. 3. Tag usage characterized with Wordnet categories: (a) Type of tags users apply in the different systems and (b) type of tags individual users apply in two different systems advantages of profile aggregation for current tagging systems: given a sparse tagbased user profile, the consideration of profiles produced in other systems might be used to tackle sparsity problems. 4.2 Commonalities and Differences in Tagging Activities In order to analyze commonalities and differences of the users’ tag-based profiles in the different systems, we mapped tags to Wordnet categories and considered only those 65% of the tags for which such a mapping exists. Figure 3(a) shows that the type of tags in StumbleUpon and Delicious are quite similar, except for cognition tags (e.g., research, thinking), which are used more often in StumbleUpon than in Delicious. For both systems, most of the tags—21.9% in StumbleUpon and 18.3% in Delicious—belong to the category communication (e.g., hypertext, web). By contrast, only 4.4% of the Flickr tags refer to the field of communication; the majority of tags (25.2%) denote locations (e.g., Hamburg, tuscany). Action (e.g., walking), people (e.g., me), and group tags (e.g., community) as well as words referring to some artifact (e.g., bike) occur in all three systems with similar frequency. However, the concrete tags seem to be different. For example, while artifacts in Delicious refer to things like “tool” or “mobile device”, the artifact tags in Flickr describe things like “church” or “painting”. This observation is supported by Fig. 3(b), which shows the average overlap of the individual category-specific tag profiles. On average, each user applied only 0.9% of the Flickr artifact tags tags also in Delicious. For Flickr and Delicious, action tags allocate the biggest fraction of overlapping tags. It is interesting to see that the overlap of location tags between Flickr and StumbleUpon is 31.1%, even though location tags are used very seldomly in StumbleUpon (3.3%, as depicted in Fig. 3(a)). This means that if someone utilizes a location tag in StumbleUpon, it is likely that she will also use the same tag in Flickr. Having knowledge on the different (aggregated) tagging facets of a user opens the door for interesting applications. For example, a system could exploit
Interweaving Public User Profiles on the web 雪 JFtkr:a(A) Deices and () (a)Overlap of tag-based profiles (b)Entropy and self-information Fig 4. Aggregation of tag-based profiles: (a) average overlap and(b) entropy and lf- information of service-specific profiles in comparison to the aggregated profiles StumbleUpon tags referring to locations to recommend Flickr pictures even if the user's Flickr profile is empty. In Sect. 4.4 we will present an approach that takes advantage of the faceted tag-based profiles for predicting tagging behavior 4.3 Aggregation of Tagging Activities To analyze the benefits of aggregating tag-based profiles in more detail we mea- sure the information gain, entropy and overlap of the individual profiles. Fig ure 4(a)describes the average overlap with respect to three different metrics given two tag-based profiles A and B, the overlap is (1)overlap mmAjI3T (2)ouerlapAinB= AF, or(3)overlapBinA= B. For example, ouerlapAinB tes the percentage of tags in a that also occur in B. The overlap of the tag-based profiles produced in Delicious and Stumble- pon is significantly higher than the overlap of service combinations that include Flickr. However, on average, a user still just applies 6.8% of her Delicious tags also in Stumble Upon, which is approximately as high as the percentage of tags a Stumble Upon user also applies in Flickr. Overall, the tag-based user profiles do not overlap strongly. Hence, users reveal different facets of their profiles in the different services Figure 4(b)compares the averaged entropy and self-information of the tag based profiles obtained from the different services with the aggregated profile ch conta entropy(T)=>p(t).self-information(t) In Equation 1, p(t) denotes the probability that the tag t was utilized by the corresponding user and self-information (t)=-log(p(t)). In Fig 4(b), we sum- marize self-information by building the average of the mean self-information of the users' tag-based profiles. among the service-specific profiles, the tag-based profiles in Delicious, which also have the largest size, bear the highest entropy and average self-information. By aggregating the tag-based profiles, self-information
Interweaving Public User Profiles on the Web 23 0% 5% 10% 15% 20% 25% 30% (A) StumbleUpon and (B) Delicious (A) StumbleUpon and (B) Flickr (A) Delicious and (B) Flickr service comparison average overlap of tag-based profiles Overlap (divided by size of smaller tag cloud) Overlap A in B (divided by size of tag cloud in service A) Overlap B in A (divided by size of tag cloud in service B) (a) Overlap of tag-based profiles 0 1 2 3 4 5 6 7 8 9 Flickr StumbleUpon Delicious Flickr & StumbleUpon & Delicious tag-based profiles in different services vs. aggregated profiles entropy / self-information (in bits) entropy self-information (b) Entropy and self-information Fig. 4. Aggregation of tag-based profiles: (a) average overlap and (b) entropy and self-information of service-specific profiles in comparison to the aggregated profiles StumbleUpon tags referring to locations to recommend Flickr pictures even if the user’s Flickr profile is empty. In Sect. 4.4 we will present an approach that takes advantage of the faceted tag-based profiles for predicting tagging behavior. 4.3 Aggregation of Tagging Activities To analyze the benefits of aggregating tag-based profiles in more detail we measure the information gain, entropy and overlap of the individual profiles. Figure 4(a) describes the average overlap with respect to three different metrics: given two tag-based profiles A and B, the overlap is (1) overlap = A∩B min(|A|,|B|) , (2) overlapAinB = A∩B |A| , or (3) overlapBinA = A∩B |B| . For example, overlapAinB denotes the percentage of tags in A that also occur in B. The overlap of the tag-based profiles produced in Delicious and StumbleUpon is significantly higher than the overlap of service combinations that include Flickr. However, on average, a user still just applies 6.8% of her Delicious tags also in StumbleUpon, which is approximately as high as the percentage of tags a StumbleUpon user also applies in Flickr. Overall, the tag-based user profiles do not overlap strongly. Hence, users reveal different facets of their profiles in the different services. Figure 4(b) compares the averaged entropy and self-information of the tagbased profiles obtained from the different services with the aggregated profile. The entropy of a tag-based profile T, which contains of a set of tags t, is computed as follows. entropy(T ) = t∈T p(t) · self-information(t) (1) In Equation 1, p(t) denotes the probability that the tag t was utilized by the corresponding user and self-information(t) = −log(p(t)). In Fig. 4(b), we summarize self-information by building the average of the mean self-information of the users’ tag-based profiles. Among the service-specific profiles, the tag-based profiles in Delicious, which also have the largest size, bear the highest entropy and average self-information. By aggregating the tag-based profiles, self-information