Planetary- scale views on a large Instant-Messaging Network Jure leskovec Eric horvitz Carnegie Mellon University Microsoft Research jure@cs. cmu. edu horvitz@microsoft.com ABSTRACT We explore a dataset of 30 billion conversations generated We present a study of anonymized data capturing a month by 240 million distinct users over one month. We found that of high-level communication activities within the whole of approximately 90 million distinct Messenger accounts were the Microsoft Messenger instant-messaging system. We ex accessed each day and that these users produced about 1 bil- amine characteristics and patterns that emerge from the col- lion conversations, with approximately 7 billion exchanged lective dynamics of large numbers of people, rather than the messages per day. 180 million of the 240 million active ac- actions and characteristics of individuals. The dataset con- counts had at least one conversation on the observation pe- tains summary properties of 30 billion conversations among riod. We found that 99% of the conversations occurred be- 240 million people. From the data, we construct a commu- tween 2 people, and the rest with greater numbers of partic nication graph with 180 million nodes and 1.3 billion undi- ants. To our knowledge, our investigation represents the ected edges, creating the largest social network constructed largest and most comprehensive study to date of presence and analyzed to date. We report on multiple aspects of and communications in an IM system. A recent report [6] the dataset and synthesized graph. We find that the graph estimated that approximately 12 billion instant messages are well-connected and robust to node removal. We inves sent each day. Given the estimate and the growth of IM, we estimate that we captured approximately half of the world's tigate on a planetary-scale the oft-cited report that people IM communication during the observation period are separated by "six degrees of separation"and find that the average path length among Messenger users is 6.6. We We created an undirected communication network fro also find that people tend to communicate more with each the data where each user is represented by a node and an other when they have similar age, language, and location edge is placed between users if they exchanged at least one and that cross-gender conversations are both more frequent message during the month of observation. The network rep- nd of longer duration than conversations with the same resents accounts that were active during une 2006. In sum- mary, the communication graph has 180 million nodes, rep- resenting users who participated in at least one conversation Categories and Subject Descriptors: H.2.8 Database and 1.3 billion undirected edges among active users,where Management:: Database applications- Data mining an edge indicates that a pair of people communicated. We General Terms: Measurement; Experimentation note that this graph should be distinguished from a buddy Keywords: Social networks; Communication networks; User raph where two people are connected if they appear on eacl demographics: Large data: Online communication others contact lists. The buddy graph for the data contains 240 million nodes and 9.1 billion edges. On average each 1. INTRODUCTION account has approximately 50 buddies on a contact list To highlight several of our key findings, we discovered that Large-scale web services provide unprecedented opportu- the communication network is well connected, with 99.9% nities to capture and analyze behavioral data on a plan of the nodes belonging to the largest connected component etary scale. We discuss findings drawn from aggregations We evaluated the oft-cited finding by Travers and migra of anonymized data representing one month(June 2006) of high-level communication activities of people using the Mi that any two people are linked to one another on average crosoft Messenger instant-messaging(IM)network. We did via a chain with" 6-degrees-of-separation"[17. We found not have nor seek access to the content of messages. Rather that the average shortest path length in the Messenger net- work is 6.6(median 6), which is half a link more than the we consider structural properties of a communication graph path length measured in the classic study.However, we and study how structure and communication relate to us also found that longer paths exist in the graph, with lengths demographic attributes, such as gender, age, and location The data set provides a unique lens for studying patterns of up to 29. We observed that the network is well clustered, human behavior on a wide scale with a clustering coefficient [19 that decays with exponent -0.37. This decay is significantly lower than the value we Jure Leskovec performed this research during an internship had expected given prior research [11]. We found strong t microsoft research homophily 9, 12 among users; people have more conversa- Copyright is held by the Intemational World Wide Web Conference Com tions and converse for longer durations with people who are mittee(Iw3C2). Distribution of these papers is limited to classroom us similar to themselves. We find the strongest homophily for the language used, followed by conversants' geographic lo- www 2008, April 21-25, 2008, Beijing, China ACM978-1-60558-085-2/08/04
Planetary-Scale Views on a Large Instant-Messaging Network Jure Leskovec ∗ Carnegie Mellon University jure@cs.cmu.edu Eric Horvitz Microsoft Research horvitz@microsoft.com ABSTRACT We present a study of anonymized data capturing a month of high-level communication activities within the whole of the Microsoft Messenger instant-messaging system. We examine characteristics and patterns that emerge from the collective dynamics of large numbers of people, rather than the actions and characteristics of individuals. The dataset contains summary properties of 30 billion conversations among 240 million people. From the data, we construct a communication graph with 180 million nodes and 1.3 billion undirected edges, creating the largest social network constructed and analyzed to date. We report on multiple aspects of the dataset and synthesized graph. We find that the graph is well-connected and robust to node removal. We investigate on a planetary-scale the oft-cited report that people are separated by “six degrees of separation” and find that the average path length among Messenger users is 6.6. We also find that people tend to communicate more with each other when they have similar age, language, and location, and that cross-gender conversations are both more frequent and of longer duration than conversations with the same gender. Categories and Subject Descriptors: H.2.8 Database Management: : Database applications – Data mining General Terms: Measurement; Experimentation. Keywords: Social networks; Communication networks; User demographics; Large data; Online communication. 1. INTRODUCTION Large-scale web services provide unprecedented opportunities to capture and analyze behavioral data on a planetary scale. We discuss findings drawn from aggregations of anonymized data representing one month (June 2006) of high-level communication activities of people using the Microsoft Messenger instant-messaging (IM) network. We did not have nor seek access to the content of messages. Rather, we consider structural properties of a communication graph and study how structure and communication relate to user demographic attributes, such as gender, age, and location. The data set provides a unique lens for studying patterns of human behavior on a wide scale. ∗ Jure Leskovec performed this research during an internship at Microsoft Research. Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2008, April 21–25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. We explore a dataset of 30 billion conversations generated by 240 million distinct users over one month. We found that approximately 90 million distinct Messenger accounts were accessed each day and that these users produced about 1 billion conversations, with approximately 7 billion exchanged messages per day. 180 million of the 240 million active accounts had at least one conversation on the observation period. We found that 99% of the conversations occurred between 2 people, and the rest with greater numbers of participants. To our knowledge, our investigation represents the largest and most comprehensive study to date of presence and communications in an IM system. A recent report [6] estimated that approximately 12 billion instant messages are sent each day. Given the estimate and the growth of IM, we estimate that we captured approximately half of the world’s IM communication during the observation period. We created an undirected communication network from the data where each user is represented by a node and an edge is placed between users if they exchanged at least one message during the month of observation. The network represents accounts that were active during June 2006. In summary, the communication graph has 180 million nodes, representing users who participated in at least one conversation, and 1.3 billion undirected edges among active users, where an edge indicates that a pair of people communicated. We note that this graph should be distinguished from a buddy graph where two people are connected if they appear on each other’s contact lists. The buddy graph for the data contains 240 million nodes and 9.1 billion edges. On average each account has approximately 50 buddies on a contact list. To highlight several of our key findings, we discovered that the communication network is well connected, with 99.9% of the nodes belonging to the largest connected component. We evaluated the oft-cited finding by Travers and Milgram that any two people are linked to one another on average via a chain with “6-degrees-of-separation” [17]. We found that the average shortest path length in the Messenger network is 6.6 (median 6), which is half a link more than the path length measured in the classic study. However, we also found that longer paths exist in the graph, with lengths up to 29. We observed that the network is well clustered, with a clustering coefficient [19] that decays with exponent −0.37. This decay is significantly lower than the value we had expected given prior research [11]. We found strong homophily [9, 12] among users; people have more conversations and converse for longer durations with people who are similar to themselves. We find the strongest homophily for the language used, followed by conversants’ geographic lo-
cations, and then age. We found that homophily does not hold for gender: people tend to converse more frequently and with longer durations with the opposite gender. We also examined the relation between communication and dis- ance and found that the number of conversations tends to decrease with increasing geographical distance between con- versant. However, communication links spanning longer distances tend to carry more and longer conversations (b)AddBuddy 2. INSTANT MESSAGING Figure 1: Distribution of the number of events per The use of IM has been become widely adopted in personal user.(a)Number of logins per user.(b)Number of and businesss communications, im clients allow users fast buddies added per use ear-synchronous communication, placing it between sy chronous communication mediums, such as real-time voi Presence events: These include login, logout interactions, and asynchronous communication mediums like ever login, add, remove and block a buddy, ad email [ 18. IM users exchange short text messages with one registered buddy (invite new user), change of or more users from their list of contacts, who have to be on- (busy, away, be-right-back, idle, etc. ) Events are line and logged into the IM system at the time of interaction and time stamped. As conversations and messages exchanged within them are ually very short, it has been observed that users employ Communication: For each user participating in the informal language session, the log contains the following tuple: session id loose grammar, numerous abbreviations, with minimal punctuation [ 10. Contact lists are commonly ser id, time joined the session, time left the session, number of messages sent, number of messages receive referred to as buddy lists and users on the lists are referred to as buddies User data: For each the following self-reported information is stored: age, gender, location(country, 2.1 Research on instant messaging ZIP), language, and IP address. We use the IP address Several studies on smaller datasets are related to this to decode the geographical coordinates, which we then work. Avrahami and Hudson 3 explored communication e to position users on the globe and to calculate dis- characteristics of 16 IM users. Similarly, Shi et al. 13ana- zed IM contact lists submitted by users to a public website and explored a static contact network of 140,000 people. Re- We gathered data for 30 days of June 2006. Each day cently, Xiao et al. 20] investigated IM traffic characteristics yielded about 150 gigabytes of compressed text logs (4.5 within a large organization with 400 users of Messenger. Or terabytes in total). Copying the data to a dedicated eight- study differs from the latter study in that we analyze the full processor server with 32 gigabytes of memory took 12 hours Messenger population over a one month period, capturing Our log-parsing system employed a pipeline of four threads the interaction of user demographic attributes, communica- that e the data allel, collapse the session join/leave tion patterns, and network structure events into sets of conversations, and save the data in a com- 2.2 Data description pact compressed binary format. This process compressed the data down to 45 gigabytes per day. Processing the data To construct the microsoft instant messenger communica- took an additional 4 to 5 hours per day. tion dataset, we combined three different sources of data:(1) A special challenge was to account for missing and dropped Iser demographic information,(2)time and user stampe events, and session"id recycling"across different IM servers vents describing the presence of a particular user, and ( 3) in a server farm. As part of this process, we closed a session communication session logs, where, for all participants, the 48 hours after the last leave session event. We closed sessions number of exchanged messages and the periods of time spent automatically if only one user was left in the conversation participating in sessions is recorded We use the terms session and conversation interchange- ably to refer to an IM interaction among two or more people. 3. USAGE POPULATION STATISTIcs Ithough the M ger system limits the number We shall first review several statistics drawn from aggre- ple communicating at the same time to 20, people can enter gations of users and their communication activitie d leave a conversation over time. We note that, for large sessions, people can come and go over time, so conversations 3.1 Levels of activity an be long with many different people participating. We Over the observation period, 242, 720, 596 users logged into observed some very long sessions with more than 50 partic- Messenger and 179, 792, 538 of these users were actively en- ipants joining over time gaged in conversations by sending or receiving at least one All of our data was anonymized; we had no access to per- IM message. Over the month of observation, 17, 510, 905 new sonally identifiable information. Also, we had no access to accounts were activated. As a representative day, on June text of the messages exchanged or any other information 1 2006, there were almost 1 billion(982,005, 323)different that could be used to uniquely identify users. We focused on sessions(conversations among any number of people), with analyzing high-level characteristics and patterns that emerge ore than 7 billion IM messages sent. Approximately 9 from the collective dynamics of 240 million people, rather million users logged in with 64 million different users becom- alyzed data can be split into three parts: presence data, th s engaged in conversations on that day. Approximately 1.5 han the actions and characteristics of individuals. The an- illion new users that were not registered within Microsoft communication data, and user demographic information Messenger were invited to join on that particular day
cations, and then age. We found that homophily does not hold for gender; people tend to converse more frequently and with longer durations with the opposite gender. We also examined the relation between communication and distance, and found that the number of conversations tends to decrease with increasing geographical distance between conversants. However, communication links spanning longer distances tend to carry more and longer conversations. 2. INSTANT MESSAGING The use of IM has been become widely adopted in personal and businesss communications. IM clients allow users fast, near-synchronous communication, placing it between synchronous communication mediums, such as real-time voice interactions, and asynchronous communication mediums like email [18]. IM users exchange short text messages with one or more users from their list of contacts, who have to be online and logged into the IM system at the time of interaction. As conversations and messages exchanged within them are usually very short, it has been observed that users employ informal language, loose grammar, numerous abbreviations, with minimal punctuation [10]. Contact lists are commonly referred to as buddy lists and users on the lists are referred to as buddies. 2.1 Research on Instant Messaging Several studies on smaller datasets are related to this work. Avrahami and Hudson [3] explored communication characteristics of 16 IM users. Similarly, Shi et al. [13] analyzed IM contact lists submitted by users to a public website and explored a static contact network of 140,000 people. Recently, Xiao et al. [20] investigated IM traffic characteristics within a large organization with 400 users of Messenger. Our study differs from the latter study in that we analyze the full Messenger population over a one month period, capturing the interaction of user demographic attributes, communication patterns, and network structure. 2.2 Data description To construct the Microsoft Instant Messenger communication dataset, we combined three different sources of data: (1) user demographic information, (2) time and user stamped events describing the presence of a particular user, and (3) communication session logs, where, for all participants, the number of exchanged messages and the periods of time spent participating in sessions is recorded. We use the terms session and conversation interchangeably to refer to an IM interaction among two or more people. Although the Messenger system limits the number of people communicating at the same time to 20, people can enter and leave a conversation over time. We note that, for large sessions, people can come and go over time, so conversations can be long with many different people participating. We observed some very long sessions with more than 50 participants joining over time. All of our data was anonymized; we had no access to personally identifiable information. Also, we had no access to text of the messages exchanged or any other information that could be used to uniquely identify users. We focused on analyzing high-level characteristics and patterns that emerge from the collective dynamics of 240 million people, rather than the actions and characteristics of individuals. The analyzed data can be split into three parts: presence data, communication data, and user demographic information: 100 102 104 106 100 105 1010 γ = 3.6 number of Login events per user count Login every 20 minutes Login every 15 seconds 100 102 104 100 105 1010 γ = 2.2 number of AddBuddy events per user count (a) Login (b) AddBuddy Figure 1: Distribution of the number of events per user. (a) Number of logins per user. (b) Number of buddies added per user. • Presence events: These include login, logout, first ever login, add, remove and block a buddy, add unregistered buddy (invite new user), change of status (busy, away, be-right-back, idle, etc.). Events are user and time stamped. • Communication: For each user participating in the session, the log contains the following tuple: session id, user id, time joined the session, time left the session, number of messages sent, number of messages received. • User data: For each user, the following self-reported information is stored: age, gender, location (country, ZIP), language, and IP address. We use the IP address to decode the geographical coordinates, which we then use to position users on the globe and to calculate distances. We gathered data for 30 days of June 2006. Each day yielded about 150 gigabytes of compressed text logs (4.5 terabytes in total). Copying the data to a dedicated eightprocessor server with 32 gigabytes of memory took 12 hours. Our log-parsing system employed a pipeline of four threads that parse the data in parallel, collapse the session join/leave events into sets of conversations, and save the data in a compact compressed binary format. This process compressed the data down to 45 gigabytes per day. Processing the data took an additional 4 to 5 hours per day. A special challenge was to account for missing and dropped events, and session “id recycling” across different IM servers in a server farm. As part of this process, we closed a session 48 hours after the last leave session event. We closed sessions automatically if only one user was left in the conversation. 3. USAGE & POPULATION STATISTICS We shall first review several statistics drawn from aggregations of users and their communication activities. 3.1 Levels of activity Over the observation period, 242,720,596 users logged into Messenger and 179,792,538 of these users were actively engaged in conversations by sending or receiving at least one IM message. Over the month of observation, 17,510,905 new accounts were activated. As a representative day, on June 1 2006, there were almost 1 billion (982,005,323) different sessions (conversations among any number of people), with more than 7 billion IM messages sent. Approximately 93 million users logged in with 64 million different users becoming engaged in conversations on that day. Approximately 1.5 million new users that were not registered within Microsoft Messenger were invited to join on that particular day
Female Male Figure 2: (a) Distribution of the number of people anticipating in a conversation.(b) Distribution of pread of du tions can be described by a power-law distribution. Figure 4: World and Messenger user population pyramid. Ages 15-30 are overrepresented in nger pop lation Figure 3:(a) Distribution of login duration. (b) Duration of times when people are not logged into the system(times between logout and login) Figure 5: Temporal characteristics of conversations We consider event distributions on a per-user basis in Fig (a) Average conversation duration per user;(b)time ure 1. The number of logins per user, displayed in Fig- etween conversations of users re 1(a), follows a heavy-tailed distribution with exponent 3.6. We note spikes in logins at 20 minute and 15 second atervals, which correspond to an auto-login function of the Focusing on the differences by gender, ger population es are overrep- resented for the 10-14 age interval. F 下m users. we see contact lists rather quickly. The spike at 600 buddies un- overall matches with the world population for age spans 10- doubtedly reflects the maximal allowed length of contac 14 and 35U39: for women users, we see a match for ages in Figure 2(a)displays the number of users per session. In the span of 30-34. We note that 6.5% of the population did Iessenger, multiple people can participate in conversations. not submit an age when creating their Messenger accounts people who can participate simultaneously in a conversa- 4. COMMUNICATION CHARACTERISTICS tion. Figure 2(b) shows the distribution over the session We now focus on characteristics and patterns durations, which can be modeled by a power-law distribu- munications. We limit the analysis to conversations between tion with exponent 3.6. two participants, which account for 99% of all conversations Next, we examine the distribution of the durations of pe. We first examine the distributions over conversation du- riods of time when people are logged on to the system. Let rations and times between conversations. Let user u have (tij, toj) denote a time ordered (tij< toj< tij+1)sequence C conversations in the observation period. Then, for every of online and offline times of a user, where ti, is the time conversation i of user u we create a tuple(tsu, i, teu, i, mu, i) where ts,: denotes the start time of the conversation, te Figure 3(a) plots the distribution of toj -tij over all j over is the end time of the conversation, and mu.i is the numbe users. Similarly, Figure 3(b) shows the distribution of of exchanged messages between the two users. We order the the periods of time when users are logged off, i.e. tij+1-to, conversations by their start time(tsu, i tsu, i +1).Then, over all j and over all Fitting the data to power-law calculate the aver distributions reveals exponents of 1.77 and 1.3, respectively. ration d(u)=2 teu, i-tsu, i, where the sum goes over The data shows that durations of being online tend to be all the u's conversations Figure 5(a) shows the distribution shorter and decay faster than durations that users are of of d(u) over all the users u. We find that the conversation fine. We also notice periodic effects of login durations of length can be described by a heavy-tailed distribution with 12, 24, and 48 hours, reflecting daily periodicities. We ob- exponent-3.7 and a mode of 4 minutes. serve similar periodicities for logout durations at multiples Figure 5(b)shows the intervals between consecutive con- of 24 hours 3.2 Demographic characteristics of the users tsu. i, where tsu. i+1 and tsu. i denote start times of two con- secutive conversations of user u. The power-law exponent of We compared the demographic characteristics of the Mes. the distribution over intervals is -1.5. This result is sim- senger population with 2005 world census data and found ilar to the temporal distribution for other kinds of human fferences between the statistics for age and gender. The communication activities, e.g., waiting times of emails and visualization of this comparison displayed in Figure 4 shows letters before a reply is generated 4. The exponent can be that users with reported ages in the 15-35 span of years are explained by a priority-queue model where tasks of different
100 101 102 103 104 105 106 107 108 109 100 101 102 Count Number of users per session ∝ x-3.5 20 102 103 104 105 106 107 108 109 1010 1011 100 101 102 Count Conversation duration ∝ x-3.67 Figure 2: (a) Distribution of the number of people participating in a conversation. (b) Distribution of the durations of conversations. The spread of durations can be described by a power-law distribution. 100 101 102 102 103 104 105 106 login duration count Data = 9.7e5 x−1.77 R2 =1.00 100 101 102 103 104 105 106 logout duration count Data = 6.9e5 x−1.34 R2 =1.00 Figure 3: (a) Distribution of login duration. (b) Duration of times when people are not logged into the system (times between logout and login). We consider event distributions on a per-user basis in Figure 1. The number of logins per user, displayed in Figure 1(a), follows a heavy-tailed distribution with exponent 3.6. We note spikes in logins at 20 minute and 15 second intervals, which correspond to an auto-login function of the IM client. As shown in Figure 1(b), many users fill up their contact lists rather quickly. The spike at 600 buddies undoubtedly reflects the maximal allowed length of contact lists. Figure 2(a) displays the number of users per session. In Messenger, multiple people can participate in conversations. We observe a peak at 20 users, the limit on the number of people who can participate simultaneously in a conversation. Figure 2(b) shows the distribution over the session durations, which can be modeled by a power-law distribution with exponent 3.6. Next, we examine the distribution of the durations of periods of time when people are logged on to the system. Let (tij , toj ) denote a time ordered (tij < toj < tij+1) sequence of online and offline times of a user, where tij is the time of the jth login, and toj is the corresponding logout time. Figure 3(a) plots the distribution of toj − tij over all j over all users. Similarly, Figure 3(b) shows the distribution of the periods of time when users are logged off, i.e. tij+1 −toj over all j and over all users. Fitting the data to power-law distributions reveals exponents of 1.77 and 1.3, respectively. The data shows that durations of being online tend to be shorter and decay faster than durations that users are of- fline. We also notice periodic effects of login durations of 12, 24, and 48 hours, reflecting daily periodicities. We observe similar periodicities for logout durations at multiples of 24 hours. 3.2 Demographic characteristics of the users We compared the demographic characteristics of the Messenger population with 2005 world census data and found differences between the statistics for age and gender. The visualization of this comparison displayed in Figure 4 shows that users with reported ages in the 15–35 span of years are 0.1 0.05 0 0.05 0.1 0−4 5−9 10−14 15−19 20−24 25−29 30−34 35−39 40−44 45−49 50−54 55−59 60−64 65−69 70−74 75−79 80−84 85−89 90−94 95−99 100+ Female Male proportion of the population age World population MSN population Figure 4: World and Messenger user population age pyramid. Ages 15–30 are overrepresented in the Messenger population. 100 102 104 100 105 1010 conversation duration [min] count Data = 1.5e11 x−3.70 R2 =0.99 100 105 104 106 108 time between conversations [min] count Data = 3.9e9 x−1.53 R2 =0.99 1 day 2 days 3 days Figure 5: Temporal characteristics of conversations. (a) Average conversation duration per user; (b) time between conversations of users. strongly overrepresented in the active Messenger population. Focusing on the differences by gender, females are overrepresented for the 10–14 age interval. For male users, we see overall matches with the world population for age spans 10– 14 and 35U39; for women users, we see a match for ages in ˚ the span of 30–34. We note that 6.5% of the population did not submit an age when creating their Messenger accounts. 4. COMMUNICATION CHARACTERISTICS We now focus on characteristics and patterns with communications. We limit the analysis to conversations between two participants, which account for 99% of all conversations. We first examine the distributions over conversation durations and times between conversations. Let user u have C conversations in the observation period. Then, for every conversation i of user u we create a tuple (tsu,i, teu,i, mu,i), where tsu,i denotes the start time of the conversation, teu,i is the end time of the conversation, and mu,i is the number of exchanged messages between the two users. We order the conversations by their start time (tsu,i < tsu,i+1). Then, for every user u, we calculate the average conversation duration d¯(u) = 1 C P i teu,i − tsu,i, where the sum goes over all the u’s conversations. Figure 5(a) shows the distribution of ¯d(u) over all the users u. We find that the conversation length can be described by a heavy-tailed distribution with exponent -3.7 and a mode of 4 minutes. Figure 5(b) shows the intervals between consecutive conversations of a user. We plot the distribution of tsu,i+1 − tsu,i, where tsu,i+1 and tsu,i denote start times of two consecutive conversations of user u. The power-law exponent of the distribution over intervals is − 1.5. This result is similar to the temporal distribution for other kinds of human communication activities, e.g., waiting times of emails and letters before a reply is generated [4]. The exponent can be explained by a priority-queue model where tasks of different
1336|37 277301|277 213499 1251.42 (a)Number of conversations()Conversation duration 1.431 1.4 two-person conversations during June 2006. (a) Percentage of conversations among users of differ- ength in seconds umber of exchanged messages per conversation;(d) number of exchanged messages (c) Messages per conversation (d) Messages per unit time per minute of conversation Figure 6: Communication characteristics of users by reported age. We plot age vs. age and the color(z- serve a similar phenomenon when plotting the average num- axis )represents the intensity of communication ber of exchanged that fica. mi, displayed in Figure 6(c). Again, we find lder people exchange more messages, and we observe priorities arrive and wait until all tasks with higher priority a dip for ages 25-45 and a slight peak for ages 15-25. Fig- e addressed. This model generates a task waiting time ure 6(d)displays the number of exchanged messages per un distribution described by a power-law with exponent -1.5. time; for each age pair, (a, 6), we measure cao 2ieca. Here, we see that younger people have faster-paced dialogs 5. COMMUNICATION DEMOGRAPHICS while older people exchange messages at a slower Next we examine the interplay of communication and user We note that the younger population (ages 10-35 )are demographic attributes, i. e, how geography, location, age strongly biased towards communicating with people of a and gender influence observed communication patterns similar age(diagonal trend in Figure 6(a)), and that users who report being of ages 35 years and above tend to com- 5.1 Communication by age municate more evenly across ages(rectangular pattern in We sought to understand how communication among peo- Fig. 6(a)). Moreover, older people have conversations of the ple changes with the reported ages of participating users longest durations, with a"valley"in the duration of conver Figures 6(a)-(d)use a heat-map visualization to commu- sations for users of ages 25-35. Such a dip may represent nicate properties for different age-age pairs. The rows and shorter, faster-paced and more intensive conversations asso- columns represent the ages of both parties participating, and ciated with work-related communications. versus more ex- the color at each age-age cell captures the logarithm of the tended, slower, and longer interactions associated with social value for the pairing. The color spectrum extends from blue Discours low value) through green, yellow, and onto red(the highest value). Because of potential misreporting at very low and 5.2 Communication by gender high ages, we concentrate on users with self-reported ages that fall bet ween 10 and 60 years s We report on analyses of properties of pairwise communi cations as a function of the self-reported gender of users in Let a tuple(ai, bi, di, mi)denote the ith conversation conversations in Table 1. Let Cg, h=I(gi, hi, di, mi): gi the entire dataset that occurred among users of ages ai g Ahi= h denote a set of conversations where the two par and bi. The conversation had a duration of di seconds ticipating users are of genders g and h. Note that g takes 3 ring which mi messages were exchanged. Let Ca, b= possible values: female, male, and unknown(unreported) I(ai, bi, di, mi): ai= aA bi= b denote a set of all con- Table 1(a)relays Cg, h for combinations of genders g and versations between users of ages a and b, respectively. h. The table shows that approximately 50% of conversations Figure 6(a)shows the number of conversations among peo. occur between male and female and 40% of the conversations ple of different ages. For every pair of ages(a, b) the color occur among users of the same gender(20% for each).A indicates the size of set Ca.b. i.e., the number of different small number of conversations occur between people who conversations between users of ages a and b. We note that id not reveal their gender s the notion of a conversation is symmetric, the plots ar Similarly, Table 1(b)shows the average conversation length symmetric. Most conversations occur between people of ages seconds, broken down by the gender of conversant, com- 10 to 20. The diagonal trend indicates that people tend to puted as TCoh Ziece n di. We find that male-male conver alk to people of similar age. This is true especially for age sations tend to be shortest, lasting approximately 4 min- groups between 10 and 30 years. We shall explore this ob- utes. Female-female conversations last 4.5 minutes on the servation in more detail in Section 6 rsations have the longest du- Figure 6(b)displays a heat map for the average conver- rations, taking more than 5 minutes on average. Beyond sation duration, computed as We note aking place over longer periods of time, more messages are that older people tend to have longer conversations. We ob exchanged in female-male conversations. Table 1(c) lists
10 20 30 40 50 60 10 15 20 25 30 35 40 45 50 55 60 10 20 30 40 50 60 10 15 20 25 30 35 40 45 50 55 60 (a) Number of conversations (b) Conversation duration 10 20 30 40 50 60 10 15 20 25 30 35 40 45 50 55 60 10 20 30 40 50 60 10 15 20 25 30 35 40 45 50 55 60 (c) Messages per conversation (d) Messages per unit time Figure 6: Communication characteristics of users by reported age. We plot age vs. age and the color (zaxis) represents the intensity of communication. priorities arrive and wait until all tasks with higher priority are addressed. This model generates a task waiting time distribution described by a power-law with exponent −1.5. 5. COMMUNICATION DEMOGRAPHICS Next we examine the interplay of communication and user demographic attributes, i.e., how geography, location, age, and gender influence observed communication patterns. 5.1 Communication by age We sought to understand how communication among people changes with the reported ages of participating users. Figures 6(a)-(d) use a heat-map visualization to communicate properties for different age–age pairs. The rows and columns represent the ages of both parties participating, and the color at each age–age cell captures the logarithm of the value for the pairing. The color spectrum extends from blue (low value) through green, yellow, and onto red (the highest value). Because of potential misreporting at very low and high ages, we concentrate on users with self-reported ages that fall between 10 and 60 years. Let a tuple (ai, bi, di, mi) denote the ith conversation in the entire dataset that occurred among users of ages ai and bi. The conversation had a duration of di seconds during which mi messages were exchanged. Let Ca,b = {(ai, bi, di, mi) : ai = a ∧ bi = b} denote a set of all conversations between users of ages a and b, respectively. Figure 6(a) shows the number of conversations among people of different ages. For every pair of ages (a, b) the color indicates the size of set Ca,b, i.e., the number of different conversations between users of ages a and b. We note that, as the notion of a conversation is symmetric, the plots are symmetric. Most conversations occur between people of ages 10 to 20. The diagonal trend indicates that people tend to talk to people of similar age. This is true especially for age groups between 10 and 30 years. We shall explore this observation in more detail in Section 6. Figure 6(b) displays a heat map for the average conversation duration, computed as 1 |Ca,b| P i∈Ca,b di. We note that older people tend to have longer conversations. We ob- (a) U F M U 1.3 3.6 3.7 F 21.3 49.9 M 20.2 (b) U F M U 277 301 277 F 275 304 M 252 (c) U F M U 5.7 7.1 6.7 F 6.6 7.6 M 5.9 (d) U F M U 1.25 1.42 1.38 F 1.43 1.50 M 1.42 Table 1: Cross-gender communication, based on all two-person conversations during June 2006. (a) Percentage of conversations among users of different self-reported gender; (b) average conversation length in seconds; (c) number of exchanged messages per conversation; (d) number of exchanged messages per minute of conversation. serve a similar phenomenon when plotting the average number of exchanged messages per conversation, computed as 1 |Ca,b| P i∈Ca,b mi, displayed in Figure 6(c). Again, we find that older people exchange more messages, and we observe a dip for ages 25–45 and a slight peak for ages 15–25. Figure 6(d) displays the number of exchanged messages per unit time; for each age pair, (a, b), we measure 1 |Ca,b| P i∈Ca,b mi di . Here, we see that younger people have faster-paced dialogs, while older people exchange messages at a slower pace. We note that the younger population (ages 10–35) are strongly biased towards communicating with people of a similar age (diagonal trend in Figure 6(a)), and that users who report being of ages 35 years and above tend to communicate more evenly across ages (rectangular pattern in Fig. 6(a)). Moreover, older people have conversations of the longest durations, with a “valley” in the duration of conversations for users of ages 25–35. Such a dip may represent shorter, faster-paced and more intensive conversations associated with work-related communications, versus more extended, slower, and longer interactions associated with social discourse. 5.2 Communication by gender We report on analyses of properties of pairwise communications as a function of the self-reported gender of users in conversations in Table 1. Let Cg,h = {(gi, hi, di, mi) : gi = g ∧hi = h} denote a set of conversations where the two participating users are of genders g and h. Note that g takes 3 possible values: female, male, and unknown (unreported). Table 1(a) relays |Cg,h| for combinations of genders g and h. The table shows that approximately 50% of conversations occur between male and female and 40% of the conversations occur among users of the same gender (20% for each). A small number of conversations occur between people who did not reveal their gender. Similarly, Table 1(b) shows the average conversation length in seconds, broken down by the gender of conversant, computed as 1 |Cg,h| P i∈Cg,h di. We find that male–male conversations tend to be shortest, lasting approximately 4 minutes. Female–female conversations last 4.5 minutes on the average. Female–male conversations have the longest durations, taking more than 5 minutes on average. Beyond taking place over longer periods of time, more messages are exchanged in female–male conversations. Table 1(c) lists
values for iCo.a zinco. mi and shows that, in female-male conversations, 7.6 messages are exchanged per conversation on the average as opposed to 6.6 and 5.9 for female-female and male-male, respectively. Table 1(d)shows the cor munication intensity computed as ∑icna:The number of messages exchanged per minute of conversation for male-female conversations is higher at 1.5 messages per ninute than for cross gender conversations, where the rate is 1.43 messages per minute. We examined the number of communication ties. where a tie is established between two people when they exchange at least one message during the observation period. We Figure 7: Number of users at a particular geographic female ties, and 640 million cross-gender ties. The Mes- location. Color of dots represents the number of senger population consists of 100 million males and 80 mil lion females by self report. These findings demonstrate that ties are not heavily gender biased; based on the popula tion, random chance predicts 31% male-male, 20% female- female, and 49% female-male links. We observe 25% male- male, 21% female-female, and 54% cross-gender links, thus demonstrating a minor bias of female-male links. The results reported in Table 1 run counter to prior stud ies reporting that communication among individuals who resemble one other(same gender) occurs more often(see 9 and references therein). We identified significant heterophily where people tend to communicate more with people of the pposite gender. However, we note that link heterogeneity was very close to the population value[8, i. e,, the number of same-and cross-gender ties roughly corresponds to randor Figure 8: Number of Messenger users per capita. chance. This shows there is no significant bias in linking Color intensity corresponds to the number of users for gender. However, we observe that cross-gender con per capita in the cell of the grid sations tend to be longer and to include more messages suggesting that more effort is devoted to conversations with Figure 9 shows a heat map that represents the intensi- the opposite sex. ties of Messenger communications on an international scale To create this map we place the world map on a fine grid 5.3 World geography and communicati where each cell of the grid contains the count of the number We now focus on the influence of geography and distance of conversations that pass through that point by increasing among participants on communications. Figure 7 shows the the count of all cells on the straight line between the geo- geographical locations of Messenger users. The general lo- locations of pairs of conversants. The color indicates the cation of the user was obtained via reverse IP lookup. We number of conversations crossing each point, providing a vi- plot all latitude/longitude positions linked to the position of sualization of the key flows of communication. For example, Australia and New Zealand have communications flowing dot corresponds to the logarithm of the number of logins towards Europe and United States. Similar flows hold for from the respective location, again using a spectrum of col- Japan. We see that Brazilian communications are weighted ors ranging from blue(low) through green and yellow to red toward Europe and Asia. We can also explore the flows of (high). Although the maps are built solely by plotting these transatlantic and US transcontinental communications positions, a recognizable world map is generated. We find that North America, Europe, and Japan are very dense, with 5.4 Communication among countries many users from those regions using Messenger. For the rest Communication among people within different countries of the world, the population of Messenger users appears so varies depending on the locations of conversants. We reside largely in coastal regions examine two such views. Figure 10(a) shows the top coun- We can condition the densities and behaviors of Messen- tries by the number of conversations between pairs of coun- ger users on multiple geographical and socioeconomic vari- tries. We examined all pairs of countries with more than ables and explore relationships between electronic commu 10 million conversations per month. The width of edges in nications and other attributes. As an example, harnessed the figure is proportional to the logarithm of the number of the United Nations gridded world population data to pro- conversations among the countries. We find that the United vide estimates of the number of people living in each cell. States and Spain appear to serve as hubs and that edges Given this data, and the data from Figure 7, we calculate appear largely between historically or ethnically connected the number of users per capita, displayed in Figure 8. Now countries. As examples, Spain is connected with the Span- we see transformed picture where several sparsely populated ish speaking countries in South America, Germany links to stand out as having a high usage per capita. These Turkey, Portugal to Brazil, and China to Korea. include the center of the United States, Canada. Figure 10(b) displays a similar plot where we consider navia. Ireland. Australia and South Korea. country pairs by the average duration of conversations. The
values for 1 |Cg,h| P i∈Cg,h mi and shows that, in female–male conversations, 7.6 messages are exchanged per conversation on the average as opposed to 6.6 and 5.9 for female–female and male–male, respectively. Table 1(d) shows the communication intensity computed as 1 |Cg,h| P i∈Cg,h mi di . The number of messages exchanged per minute of conversation for male–female conversations is higher at 1.5 messages per minute than for cross-gender conversations, where the rate is 1.43 messages per minute. We examined the number of communication ties, where a tie is established between two people when they exchange at least one message during the observation period. We computed 300 million male–male ties, 255 million female– female ties, and 640 million cross-gender ties. The Messenger population consists of 100 million males and 80 million females by self report. These findings demonstrate that ties are not heavily gender biased; based on the population, random chance predicts 31% male–male, 20% female– female, and 49% female–male links. We observe 25% male– male, 21% female–female, and 54% cross-gender links, thus demonstrating a minor bias of female–male links. The results reported in Table 1 run counter to prior studies reporting that communication among individuals who resemble one other (same gender) occurs more often (see [9] and references therein). We identified significant heterophily, where people tend to communicate more with people of the opposite gender. However, we note that link heterogeneity was very close to the population value [8], i.e., the number of same- and cross-gender ties roughly corresponds to random chance. This shows there is no significant bias in linking for gender. However, we observe that cross-gender conversations tend to be longer and to include more messages, suggesting that more effort is devoted to conversations with the opposite sex. 5.3 World geography and communication We now focus on the influence of geography and distance among participants on communications. Figure 7 shows the geographical locations of Messenger users. The general location of the user was obtained via reverse IP lookup. We plot all latitude/longitude positions linked to the position of servers where users log into the service. The color of each dot corresponds to the logarithm of the number of logins from the respective location, again using a spectrum of colors ranging from blue (low) through green and yellow to red (high). Although the maps are built solely by plotting these positions, a recognizable world map is generated. We find that North America, Europe, and Japan are very dense, with many users from those regions using Messenger. For the rest of the world, the population of Messenger users appears to reside largely in coastal regions. We can condition the densities and behaviors of Messenger users on multiple geographical and socioeconomic variables and explore relationships between electronic communications and other attributes. As an example, harnessed the United Nations gridded world population data to provide estimates of the number of people living in each cell. Given this data, and the data from Figure 7, we calculate the number of users per capita, displayed in Figure 8. Now we see transformed picture where several sparsely populated regions stand out as having a high usage per capita. These regions include the center of the United States, Canada, Scandinavia, Ireland, Australia, and South Korea. Figure 7: Number of users at a particular geographic location. Color of dots represents the number of users. Figure 8: Number of Messenger users per capita. Color intensity corresponds to the number of users per capita in the cell of the grid. Figure 9 shows a heat map that represents the intensities of Messenger communications on an international scale. To create this map, we place the world map on a fine grid, where each cell of the grid contains the count of the number of conversations that pass through that point by increasing the count of all cells on the straight line between the geolocations of pairs of conversants. The color indicates the number of conversations crossing each point, providing a visualization of the key flows of communication. For example, Australia and New Zealand have communications flowing towards Europe and United States. Similar flows hold for Japan. We see that Brazilian communications are weighted toward Europe and Asia. We can also explore the flows of transatlantic and US transcontinental communications. 5.4 Communication among countries Communication among people within different countries also varies depending on the locations of conversants. We examine two such views. Figure 10(a) shows the top countries by the number of conversations between pairs of countries. We examined all pairs of countries with more than 10 million conversations per month. The width of edges in the figure is proportional to the logarithm of the number of conversations among the countries. We find that the United States and Spain appear to serve as hubs and that edges appear largely between historically or ethnically connected countries. As examples, Spain is connected with the Spanish speaking countries in South America, Germany links to Turkey, Portugal to Brazil, and China to Korea. Figure 10(b) displays a similar plot where we consider country pairs by the average duration of conversations. The