Biological Technological Informati Social 9目 8 器三刻含到目器到图冒图器目 a二 园正己别B园总却 9s9s三s抄÷5引s 含 aNe 。 自到自 g88 与装旨
182 M. E. J. NEWMAN Table 3.1 Basic statistics for a number of published networks. The properties measured are as follows: total number of vertices n; total number of edges m; mean degree z; mean vertex–vertex distance ; type of graph, directed or undirected; exponent α of degree distribution if the distribution follows a power law (or “–” if not; in/out-degree exponents are given for directed graphs); clustering coefficient C(1) from (3.3); clustering coefficient C(2) from (3.6); degree correlation coefficient r, section 3.6. The last column gives the citation for the network in the bibliography. Blank entries indicate unavailable data. Network Type n m z α C(1) C(2) r Ref(s). Social film actors undirected 449 913 25 516 482 113.43 3.48 2.3 0.20 0.78 0.208 [20, 415] company directors undirected 7 673 55 392 14.44 4.60 – 0.59 0.88 0.276 [105, 322] math coauthorship undirected 253 339 496 489 3.92 7.57 – 0.15 0.34 0.120 [107, 181] physics coauthorship undirected 52 909 245 300 9.27 6.19 – 0.45 0.56 0.363 [310, 312] biology coauthorship undirected 1 520251 11 803 064 15.53 4.92 – 0.088 0.60 0.127 [310, 312] telephone call graph undirected 47 000 000 80 000 000 3.16 2.1 [8, 9] email messages directed 59 912 86 300 1.44 4.95 1.5/2.0 0.16 [136] email address books directed 16 881 57 029 3.38 5.22 – 0.17 0.13 0.092 [320] student relationships undirected 573 477 1.66 16.01 – 0.005 0.001 −0.029 [45] sexual contacts undirected 2 810 3.2 [264, 265] InformationWWW nd.edu directed 269 504 1 497 135 5.55 11.27 2.1/2.4 0.11 0.29 −0.067 [14, 34] WWW Altavista directed 203 549 046 2 130 000 000 10.46 16.18 2.1/2.7 [74] citation network directed 783 339 6 716 198 8.57 3.0/– [350] Roget’s Thesaurus directed 1 0 22 5 10 3 4.99 4.87 – 0.13 0.15 0.157 [243] word co-occurrence undirected 460902 17 000 000 70.13 2.7 0.44 [119, 157] Technological Internet undirected 10697 31 992 5.98 3.31 2.5 0.035 0.39 −0.189 [86, 148] power grid undirected 4 941 6 594 2.67 18.99 – 0.10 0.080 −0.003 [415] train routes undirected 587 19 603 66.79 2.16 – 0.69 −0.033 [365] software packages directed 1 439 1 723 1.20 2.42 1.6/1.4 0.070 0.082 −0.016 [317] software classes directed 1 377 2 213 1.61 1.51 – 0.033 0.012 −0.119 [394] electronic circuits undirected 24 097 53 248 4.34 11.05 3.0 0.010 0.030 −0.154 [155] peer-to-peer network undirected 880 1 296 1.47 4.28 2.1 0.012 0.011 −0.366 [6, 353] Biologicalmetabolic network undirected 765 3 686 9.64 2.56 2.2 0.090 0.67 −0.240 [213] protein interactions undirected 2 115 2 240 2.12 6.80 2.4 0.072 0.071 −0.156 [211] marine food web directed 135 598 4.43 2.05 – 0.16 0.23 −0.263 [203] freshwater food web directed 92 997 10.84 1.90 – 0.40 0.48 −0.326 [271] neural network directed 307 2 359 7.68 3.97 – 0.18 0.28 −0.226 [415, 420]
THE STRUC TURE AND FUNCTION OF COMPLEX NETWORKS 183 ig. 3.1 Illustration of the definition of the clustering coefficient C, (3.3). This network has one iangle and eight connected triples, and therefore has a clustering coeficient of3xi=3 The individual vertices have local clustering coefficients, Eq.(3.5), of1,1,6,0, and O, fo a mean value, (3.6), ofC=I of legs of a journey for an air or train traveler, the time it takes for a disease to spread throughout a population, and so forth. The small-world effect also underlies some well-known parlor games, particularly the calculation of Erdos numbers [107 and Bacon numbers. 10 On the other hand, the small-world effect is also mathematically obvious. If the number of vertices within a distance r of a typical central vertex grows expo- nentially with r-and this is true of many networks, including the random graph 4.1)-then the value of e will in world effect" has thus taken on a more precise meaning: networks are said to show the small-world effect if the value of e scales logarithmically or slower with network size for fixed mean degree. Logarithmic scaling can be proved for a variety of network models 61, 63, 88, 127, 164] and has also been observed in various real-world networks [13, 311, 312. Some networks have mean vertex-vertex distances that in- crease slower than log n. Bollobas and Riordan [64 have shown that networks with power-law degree distributions(section 3. 3) have values of e that increase no faster than log n/loglog n(see also [164 ) and Cohen and Havlin [95 have given arguments that suggest that the actual variation may be slower even than this 3.2. Transitivity or Clustering. A clear deviation from the behavior of the ran- dom graph can be seen in the property of network transitivity, sometimes also called clustering, although the latter term also has another meaning in the study of net- orks(see section 3. 7) and so can be confusing. In many networks it is found that vertex A is connected to vertex B and vertex b to vertex c. then there is a heightened probability that vertex a will also be connected to vertex C. In the language of social networks, the friend of your friend is likely also to be your friend. In terms of network topology, transitivity means the presence of a heightened number of triangles in the network-sets of three vertices each of which is connected to each of the others. It can be quantified by defining a clustering coefficient C thus C 3x number of triangles in the network umber of connected triples of vertices where a"connected triple"means a single vertex with edges running to an unordered pair of others(see Figure 3.1) effect, C measures the fraction of triples that have their third edge filled in o complete the triangle. The factor of three in the numerator accounts for the fact that each triangle contributes to three triples and ensures that C lies in the range 10http://www.cs.virginiaedu/oracle/
THE STRUCTURE AND FUNCTION OF COMPLEX NETWORKS 183 Fig. 3.1 Illustration of the definition of the clustering coefficient C, (3.3). This network has one triangle and eight connected triples, and therefore has a clustering coefficient of 3× 1 8 = 3 8 . The individual vertices have local clustering coefficients, Eq. (3.5), of 1, 1, 1 6 , 0, and 0, for a mean value, (3.6), of C = 13 30 . of legs of a journey for an air or train traveler, the time it takes for a disease to spread throughout a population, and so forth. The small-world effect also underlies some well-known parlor games, particularly the calculation of Erd˝os numbers [107] and Bacon numbers.10 On the other hand, the small-world effect is also mathematically obvious. If the number of vertices within a distance r of a typical central vertex grows exponentially with r—and this is true of many networks, including the random graph (section 4.1)—then the value of will increase as log n. In recent years the term “small-world effect” has thus taken on a more precise meaning: networks are said to show the small-world effect if the value of scales logarithmically or slower with network size for fixed mean degree. Logarithmic scaling can be proved for a variety of network models [61, 63, 88, 127, 164] and has also been observed in various real-world networks [13, 311, 312]. Some networks have mean vertex–vertex distances that increase slower than log n. Bollob´as and Riordan [64] have shown that networks with power-law degree distributions (section 3.3) have values of that increase no faster than log n/ log log n (see also [164]), and Cohen and Havlin [95] have given arguments that suggest that the actual variation may be slower even than this. 3.2. Transitivity or Clustering. A clear deviation from the behavior of the random graph can be seen in the property of network transitivity, sometimes also called clustering, although the latter term also has another meaning in the study of networks (see section 3.7) and so can be confusing. In many networks it is found that if vertex A is connected to vertex B and vertex B to vertex C, then there is a heightened probability that vertex A will also be connected to vertex C. In the language of social networks, the friend of your friend is likely also to be your friend. In terms of network topology, transitivity means the presence of a heightened number of triangles in the network—sets of three vertices each of which is connected to each of the others. It can be quantified by defining a clustering coefficient C thus: (3.3) C = 3× number of triangles in the network number of connected triples of vertices, where a “connected triple” means a single vertex with edges running to an unordered pair of others (see Figure 3.1). In effect, C measures the fraction of triples that have their third edge filled in to complete the triangle. The factor of three in the numerator accounts for the fact that each triangle contributes to three triples and ensures that C lies in the range 10http://www.cs.virginia.edu/oracle/
0<C<I. In simple terms, C is the mean probability that two vertices that are network neighbors of the same other vertex will themselves be neighbors. It can also be written in the form ne network (3.4) C= 6x number of triangles number of paths of h two where a path of length two refers to a directed path starting from a specified vertex This definition shows that C is also the mean probability that the friend of your friend The definition of C given here has been widely used in the sociology literature, where it is referred to as the "fraction of transitive triples. "1 In the mathematical and physical literature it seems to have been first discussed by Barrat and Weigt [40 An alternative definition of the clustering coefficient, also widely used, has been given by Watts and Strogatz [415, who proposed defining a local value (35) C- number of triangles connected to vertex i number of triples centered on vertex i For vertices with degree 0 or 1, for which both numerator and denominator are zero we put Ci=0. Then the clustering coefficient for the whole network is the average This definition effectively reverses the order of the operations of taking the ratio of triangles to triples and of averaging over vertices--one here calculates the mean of the ratio, rather than the ratio of the means. It tends to weight the contributions of low-degree vertices more heavily, because such vertices have a small denominator in 3.5) and hence can give quite different results from(3.3). In Table 3.1 we give both measures for a number of networks(denoted C() and C()in the table). Normally our first definition (3. 3)is easier to calculate analytically, but (3.6)is easily calculated on a computer and has found wide use in numerical studies and data analysis. It important when reading (or writing) literature in this area to be clear about which definition of the clustering coefficient use. The difference between the two is The local clustering Ci above has been used quite widely in its own right in the sociological literature, where it is referred to as the"network density"[362. Its dependence on the degree ki of the central vertex i has been studied by Dorogovtsev, Goltsev, and Mendes[113 and Szabo, Alava, and Kertesz [ 388: both groups found that Ci falls off with ki approximately as k-I for certain models of scale-free networks section 3. 3. 1). Similar behavior has also been observed empirically in real-world networks 348, 349, 396 In general, regardless of which definition of the clustering coefficient is used, the alues tend to be considerably higher than for a random graph with a similar number of vertices and edges. Indeed, it is suspected that for many types of networks the probability that the friend of your friend is also your friend should tend to a nonzero limit as the network becomes large, so that C=O(1)as n,oo. 2 On the random For example, the standard network analysis program UCInet includes a function to calculate this quantity for any network. 12 An exception is scale-free networks with Ci N,, as described above. For such networks(3. 3) tends to zero as n-00, although( 3. 6)is still finite
184 M. E. J. NEWMAN 0 ≤ C ≤ 1. In simple terms, C is the mean probability that two vertices that are network neighbors of the same other vertex will themselves be neighbors. It can also be written in the form (3.4) C = 6× number of triangles in the network number of paths of length two , where a path of length two refers to a directed path starting from a specified vertex. This definition shows that C is also the mean probability that the friend of your friend is also your friend. The definition of C given here has been widely used in the sociology literature, where it is referred to as the “fraction of transitive triples.”11 In the mathematical and physical literature it seems to have been first discussed by Barrat and Weigt [40]. An alternative definition of the clustering coefficient, also widely used, has been given by Watts and Strogatz [415], who proposed defining a local value (3.5) Ci = number of triangles connected to vertex i number of triples centered on vertex i . For vertices with degree 0 or 1, for which both numerator and denominator are zero, we put Ci = 0. Then the clustering coefficient for the whole network is the average (3.6) C = 1 n i Ci. This definition effectively reverses the order of the operations of taking the ratio of triangles to triples and of averaging over vertices—one here calculates the mean of the ratio, rather than the ratio of the means. It tends to weight the contributions of low-degree vertices more heavily, because such vertices have a small denominator in (3.5) and hence can give quite different results from (3.3). In Table 3.1 we give both measures for a number of networks (denoted C(1) and C(2) in the table). Normally our first definition (3.3) is easier to calculate analytically, but (3.6) is easily calculated on a computer and has found wide use in numerical studies and data analysis. It is important when reading (or writing) literature in this area to be clear about which definition of the clustering coefficient is in use. The difference between the two is illustrated in Figure 3.1. The local clustering Ci above has been used quite widely in its own right in the sociological literature, where it is referred to as the “network density” [362]. Its dependence on the degree ki of the central vertex i has been studied by Dorogovtsev, Goltsev, and Mendes [113] and Szab´o, Alava, and Kert´esz [388]; both groups found that Ci falls off with ki approximately as k−1 i for certain models of scale-free networks (section 3.3.1). Similar behavior has also been observed empirically in real-world networks [348, 349, 396]. In general, regardless of which definition of the clustering coefficient is used, the values tend to be considerably higher than for a random graph with a similar number of vertices and edges. Indeed, it is suspected that for many types of networks the probability that the friend of your friend is also your friend should tend to a nonzero limit as the network becomes large, so that C = O(1) as n → ∞. 12 On the random 11For example, the standard network analysis program UCInet includes a function to calculate this quantity for any network. 12An exception is scale-free networks with Ci ∼ k−1 i , as described above. For such networks (3.3) tends to zero as n → ∞, although (3.6) is still finite
THE STRUC TURE AND FUNCTION OF COMPLEX NETWORKS graph, by contrast, C=O(n-)for large n(either definition of C)and hence the real-world and random graph values can be expected to differ by a factor of order This point is discussed further in section 4.1 The clustering coefficient measures the density of triangles in a network. An obvious generalization is to ask about the density of longer loops also: loops of length four and above. A number of authors have looked at such higher order clustering coefficients [ 54, 79, 165, 172, 316, although there is so far no clean theory, similar to a cumulant expansion, that separates the independent contributions of the various orders from one another. If more than one edge is permitted between a pair of vertices then there is also a lower order clustering coefficient that describes the density of loops of length two. This coefficient is particularly important in directed graphs where the two edges in question can point in opposite directions. The probability that two vertices in a directed network point to each other is called the reciprocity and is often measured in directed social networks 362, 408. It has been examined occasionally in other contexts too, such as the World Wide Web 3, 137 and email networks 320 3.3. Degree Distributions. Recall that the degree of a vertex in a network is the number of edges incident on (i.e, connected to)that vertex. We define Pk to be the fraction of vertices in the network that have degree k. Equivalently, Pk is the probability that a vertex chosen uniformly at random has degree k. A plot of Pk for any given network can be formed by making a histogram of the degrees of vertices This histogram is the degree distribution for the network. In a random graph of the type studied by Erdos and Renyi 141, 142, 143, each edge is present or absent with equal probability, and hence the degree distribution is, as mentioned earlier, binomial or Poisson in the limit of large graph size. Real-world networks are mostly found to be very unlike the random graph in their degree distributions. Far from having a Poisson distribution, the degrees of the vertices in most networks are highly right skewed, meaning that their distribution has a long right tail of values that are far above the mean Measuring this tail is somewhat tricky. Although in theory one just has to con- struct a histogram of the degrees, in practice one rarely has enough measurements to get good statistics in the tail, and direct histograms are thus usually rather noisy(see the histograms in[74, 148, 342, for example). There are two accepted ways to get around this problem. One is to constructed a histogram in which the bin sizes increase exponentially with degree. For example, the first few bins might cover degree ranges 1, 2-3, 4-7, 8-15, and so on. The number of samples in each bin is then divided by the width of the bin to normalize the measurement this method of constructing a histogram is often used when the histogram is to be plotted with a logarithmic degree scale, so that the widths of the bins will appear even. Because the bins get wider as ve get out into the tail, the problems with statistics are reduced, although they are still present to some extent as long as pk falls off faster than k-l, which it must if the distribution is to be integrable An alternative way of presenting degree data is to make a plot of the cumulative distribution function which is the probability that the degree is greater than or equal to k. Such a plot has the advantage that all the original data are represented. When we make a conventional histogram by binning, any differences between the values of data points that fall
THE STRUCTURE AND FUNCTION OF COMPLEX NETWORKS 185 graph, by contrast, C = O(n−1) for large n (either definition of C) and hence the real-world and random graph values can be expected to differ by a factor of order n. This point is discussed further in section 4.1. The clustering coefficient measures the density of triangles in a network. An obvious generalization is to ask about the density of longer loops also: loops of length four and above. A number of authors have looked at such higher order clustering coefficients [54, 79, 165, 172, 316], although there is so far no clean theory, similar to a cumulant expansion, that separates the independent contributions of the various orders from one another. If more than one edge is permitted between a pair of vertices, then there is also a lower order clustering coefficient that describes the density of loops of length two. This coefficient is particularly important in directed graphs where the two edges in question can point in opposite directions. The probability that two vertices in a directed network point to each other is called the reciprocity and is often measured in directed social networks [362, 408]. It has been examined occasionally in other contexts too, such as the World Wide Web [3, 137] and email networks [320]. 3.3. Degree Distributions. Recall that the degree of a vertex in a network is the number of edges incident on (i.e., connected to) that vertex. We define pk to be the fraction of vertices in the network that have degree k. Equivalently, pk is the probability that a vertex chosen uniformly at random has degree k. A plot of pk for any given network can be formed by making a histogram of the degrees of vertices. This histogram is the degree distribution for the network. In a random graph of the type studied by Erd˝os and R´enyi [141, 142, 143], each edge is present or absent with equal probability, and hence the degree distribution is, as mentioned earlier, binomial, or Poisson in the limit of large graph size. Real-world networks are mostly found to be very unlike the random graph in their degree distributions. Far from having a Poisson distribution, the degrees of the vertices in most networks are highly rightskewed, meaning that their distribution has a long right tail of values that are far above the mean. Measuring this tail is somewhat tricky. Although in theory one just has to construct a histogram of the degrees, in practice one rarely has enough measurements to get good statistics in the tail, and direct histograms are thus usually rather noisy (see the histograms in [74, 148, 342], for example). There are two accepted ways to get around this problem. One is to constructed a histogram in which the bin sizes increase exponentially with degree. For example, the first few bins might cover degree ranges 1, 2–3, 4–7, 8–15, and so on. The number of samples in each bin is then divided by the width of the bin to normalize the measurement. This method of constructing a histogram is often used when the histogram is to be plotted with a logarithmic degree scale, so that the widths of the bins will appear even. Because the bins get wider as we get out into the tail, the problems with statistics are reduced, although they are still present to some extent as long as pk falls off faster than k−1, which it must if the distribution is to be integrable. An alternative way of presenting degree data is to make a plot of the cumulative distribution function (3.7) Pk = ∞ k=k pk , which is the probability that the degree is greater than or equal to k. Such a plot has the advantage that all the original data are represented. When we make a conventional histogram by binning, any differences between the values of data points that fall in
the same bin are lost. The cumulative distribution function does not suffer from this problem. The cumulative distribution also reduces the noise in the tail. On the downside, the plot doesn't give a direct visualization of the degree distribution itself. and adjacent points on the plot are not statistically independent, making correct fits to the data tricky. In Figure 3.2 we show cumulative distributions of degree for a number of the networks described in section 2. As the figure shows, the distributions are indeed all right-skewed. Many of them follow power laws in their tails: Pk Nk-a for some constant exponent a. Note that such power-law distributions show up as power laws in the cumulative distributions also, but with exponent a-1 rather than a (38) Some of the other distributions have exponential tails: pk N e-k/*. These also give exponentials in the cumulative distribution, but with the same exponent B=∑队~∑c~e-k This makes power-law and exponential distributions particularly easy to spot experi- mentally, by plotting the corresponding cumulative distributions on logarithmic scales (for power laws) or semilogarithmic scales(for exponentials) For other types of networks degree distributions can be more complicated. For bipartite graphs, for instance( section 1.1), there are two degree distributions, one for each type of vertex. For directed graphs each vertex has both an in-degree and n out-degree, and the degree distribution therefore becomes a function Pik of two variables, representing the fraction of vertices that simultaneously have in-degree j and out-degree k. In empirical studies of directed graphs like the Web, researchers have usually given only the individual distributions of in- and out-degree[ 14, 34, 74 i. e, the distributions derived by summing Pik over one or other of its indices. This however, discards much of the information present in the joint distribution. It has been found that in- and out-degrees are quite strongly correlated in some networks 320 which suggests that there is more to be gleaned from the joint distribution than is 3.3.. Scale-Free Networks. Networks with power-law degree distributions have been the focus of a great deal of attention in the literature [13, 120, 386. They are sometimes referred to as scale-free networks 32, although it is only their degree distributions that are scale-free 3 one can and usually does have scales present in other network properties. The earliest published example of a scale-free network is probably Price's network of citations between scientific papers 342(see section 2.2) He quoted a value of a= 2.5 to 3 for the exponent of his network. In a later paper he quoted a more accurate figure of a= 3.04 343. He also found a power-law distribution for the out-degree of the network(number of bibliography entries in each paper), although later work has called this into question [395. More recently, power- law degree distributions have been observed in a host of other networks, including 13The term"scale-free" refers to any functional form f(r)that remains unchanged to within a multiplicative factor under a rescaling of the independent variable r. In effect this means power-law forms, since these are the only solutions to f(ar)=bf(r), and hence "power-law"and"scale-free" for
186 M. E. J. NEWMAN the same bin are lost. The cumulative distribution function does not suffer from this problem. The cumulative distribution also reduces the noise in the tail. On the downside, the plot doesn’t give a direct visualization of the degree distribution itself, and adjacent points on the plot are not statistically independent, making correct fits to the data tricky. In Figure 3.2 we show cumulative distributions of degree for a number of the networks described in section 2. As the figure shows, the distributions are indeed all right-skewed. Many of them follow power laws in their tails: pk ∼ k−α for some constant exponent α. Note that such power-law distributions show up as power laws in the cumulative distributions also, but with exponent α − 1 rather than α: (3.8) Pk ∼ ∞ k=k k−α ∼ k−(α−1). Some of the other distributions have exponential tails: pk ∼ e−k/κ. These also give exponentials in the cumulative distribution, but with the same exponent: (3.9) Pk = ∞ k=k pk ∼ ∞ k=k e−k /κ ∼ e−k/κ. This makes power-law and exponential distributions particularly easy to spot experimentally, by plotting the corresponding cumulative distributions on logarithmic scales (for power laws) or semilogarithmic scales (for exponentials). For other types of networks degree distributions can be more complicated. For bipartite graphs, for instance (section 1.1), there are two degree distributions, one for each type of vertex. For directed graphs each vertex has both an in-degree and an out-degree, and the degree distribution therefore becomes a function pjk of two variables, representing the fraction of vertices that simultaneously have in-degree j and out-degree k. In empirical studies of directed graphs like the Web, researchers have usually given only the individual distributions of in- and out-degree [14, 34, 74], i.e., the distributions derived by summing pjk over one or other of its indices. This, however, discards much of the information present in the joint distribution. It has been found that in- and out-degrees are quite strongly correlated in some networks [320], which suggests that there is more to be gleaned from the joint distribution than is normally appreciated. 3.3.1. Scale-Free Networks. Networks with power-law degree distributions have been the focus of a great deal of attention in the literature [13, 120, 386]. They are sometimes referred to as scale-free networks [32], although it is only their degree distributions that are scale-free;13 one can and usually does have scales present in other network properties. The earliest published example of a scale-free network is probably Price’s network of citations between scientific papers [342] (see section 2.2). He quoted a value of α = 2.5 to 3 for the exponent of his network. In a later paper he quoted a more accurate figure of α = 3.04 [343]. He also found a power-law distribution for the out-degree of the network (number of bibliography entries in each paper), although later work has called this into question [395]. More recently, powerlaw degree distributions have been observed in a host of other networks, including 13The term “scale-free” refers to any functional form f(x) that remains unchanged to within a multiplicative factor under a rescaling of the independent variable x. In effect this means power-law forms, since these are the only solutions to f(ax) = bf(x), and hence “power-law” and “scale-free” are, for our purposes, synonymous