to about n= 10000 vertices, with current(circa 2003) desktop computers. In some special cases one can do current in better. In particular, we note that the removal of an edge only affects the betweenness of other edges that fall the same component, and hence that we need only recalculate betweennesses in that component. Networks with strong community structure often break apart i parate components quite early in the progress of the al- gorithm, substantially reducing the amount of work that needs to be done on subsequent steps. Whether this re- sults in a change in the computational complexity of the current out algorithm for any commonly occurring classes of graphs is an open question, but it certainly gives a substantial FIG. 5: An example of the type of resistor network considered speed boost to many of the calculations described in this here, in which a unit resistance is placed on each edge and unit current flows into and out of the source and sink vertices Some networks are directed, i.e., their edges run in one direction only. The world wide web is an example: links in the web point in one direction only from one web use the absolute magnitude of the current flow as our page to another. One could imagine a generalization of betweenness score for each source/sink pair the shortest-path betweenness that allowed for directed The current flows in the network are governed by edges by counting only those paths that travel in the Kirchhoffs laws. To solve them we proceed as follows forward direction along edges. Such a calculation is a for each separate component of the graph. Let V be the trivial variation on the one described above. However voltage at vertex i, measured relative to any convenient we have found that in many cases it is better to ignore point. Then for all i we have the directed nature of a network in calculating commu nity structure. Often an edge acts simply as an indication Aii(Vi-Vi)=Sis-8it of a connection between two nodes . and its direction is unimportant. For example, in Ref. 25 we applied our where Aii is the ij element of the adjacency matrix of algorithm to a food web of predator-prey interactions the graph, i.e., Aij= l if i and j are connected by an etween marine species. Predator-prey interactions are edge and Ai;=0 otherwise. The left-hand side of Eq(1) clearly directed--one species may eat another, but it is represents the net current How out of vertex i along edges unlikely that the reverse is simultaneously true. However, of the network, and the right-hand side represents the as far as community structure is concerned, we want to source and sink. Defining ki= 2; Aij, which is the know only which species have interactions with which vertex degree, and creating a diagonal matrix D witI others. We find, therefore, that our algorithm applied these degrees on the diagonal Dii= ki, this equation can to the undirected version of the food web works well at be written in matrix form as(D-A).V=s, where the picking out the community structure, and no special al- source vector s has components Sec nple of our method applied to a directed graphi gorithm is needed for the directed case. We give anothe for i=s D fo 0 otherwise We cannot directly invert the matrix D-a to get the B. Resistor networks voltage vector V, because the matrix(which is just the graph Laplacian) is singular. This is equivalent to saying As examples of betweenness measures that take more that there is one undetermined degree of freedom corre- than just shortest paths into account, we proposed in sponding to the choice of reference potential for measur- Sec. II measures based on random walks and resistor net- ing the voltages. We can add any constant to a solution works. In fact, as we now show, when appropriately for the vertex voltages and get another solution--only fined these two measures are precisely the same. Here we the voltage differences matter. In choosing the reference derive the resistance measure first, since it turns out to potential, we fix this degree of freedom, leaving only n-1 be simpler; in the following section we derive the random more to be determined. In mathematical terms, once any walk measure and show that the two are equivalent n-1 of the equations in our matrix formulation are sat- Consider the network created by placing a unit resis- isfied, the remaining one is also automatically satisfied so ta ance on every edge of our network, a unit current source long as current is conserved in the network as a whole. at vertex s, and a unit current sink at vertext(see Fig. 5). i.e., so long as >i si=0, which is clearly true in this Clearly the current between s and t will flow primarily case along short paths, but some will flow along longer ones, Choosing any vertex v to be the reference point, there- roughly in inverse proportion to their length. We will fore, we remove the row and column corresponding
6 to about n = 10 000 vertices, with current (circa 2003) desktop computers. In some special cases one can do better. In particular, we note that the removal of an edge only affects the betweenness of other edges that fall in the same component, and hence that we need only recalculate betweennesses in that component. Networks with strong community structure often break apart into separate components quite early in the progress of the algorithm, substantially reducing the amount of work that needs to be done on subsequent steps. Whether this results in a change in the computational complexity of the algorithm for any commonly occurring classes of graphs is an open question, but it certainly gives a substantial speed boost to many of the calculations described in this paper. Some networks are directed, i.e., their edges run in one direction only. The world wide web is an example; links in the web point in one direction only from one web page to another. One could imagine a generalization of the shortest-path betweenness that allowed for directed edges by counting only those paths that travel in the forward direction along edges. Such a calculation is a trivial variation on the one described above. However, we have found that in many cases it is better to ignore the directed nature of a network in calculating community structure. Often an edge acts simply as an indication of a connection between two nodes, and its direction is unimportant. For example, in Ref. 25 we applied our algorithm to a food web of predator-prey interactions between marine species. Predator-prey interactions are clearly directed—one species may eat another, but it is unlikely that the reverse is simultaneously true. However, as far as community structure is concerned, we want to know only which species have interactions with which others. We find, therefore, that our algorithm applied to the undirected version of the food web works well at picking out the community structure, and no special algorithm is needed for the directed case. We give another example of our method applied to a directed graph in Sec. V D. B. Resistor networks As examples of betweenness measures that take more than just shortest paths into account, we proposed in Sec. II measures based on random walks and resistor networks. In fact, as we now show, when appropriately de- fined these two measures are precisely the same. Here we derive the resistance measure first, since it turns out to be simpler; in the following section we derive the random walk measure and show that the two are equivalent. Consider the network created by placing a unit resistance on every edge of our network, a unit current source at vertex s, and a unit current sink at vertex t (see Fig. 5). Clearly the current between s and t will flow primarily along short paths, but some will flow along longer ones, roughly in inverse proportion to their length. We will t s current in current out FIG. 5: An example of the type of resistor network considered here, in which a unit resistance is placed on each edge and unit current flows into and out of the source and sink vertices. use the absolute magnitude of the current flow as our betweenness score for each source/sink pair. The current flows in the network are governed by Kirchhoff’s laws. To solve them we proceed as follows for each separate component of the graph. Let Vi be the voltage at vertex i, measured relative to any convenient point. Then for all i we have X j Aij (Vi − Vj ) = δis − δit, (1) where Aij is the ij element of the adjacency matrix of the graph, i.e., Aij = 1 if i and j are connected by an edge and Aij = 0 otherwise. The left-hand side of Eq. (1) represents the net current flow out of vertex i along edges of the network, and the right-hand side represents the source and sink. Defining ki = P j Aij , which is the vertex degree, and creating a diagonal matrix D with these degrees on the diagonal Dii = ki , this equation can be written in matrix form as (D − A)· V = s, where the source vector s has components si = ( +1 for i = s −1 for i = t 0 otherwise. (2) We cannot directly invert the matrix D − A to get the voltage vector V, because the matrix (which is just the graph Laplacian) is singular. This is equivalent to saying that there is one undetermined degree of freedom corresponding to the choice of reference potential for measuring the voltages. We can add any constant to a solution for the vertex voltages and get another solution—only the voltage differences matter. In choosing the reference potential, we fix this degree of freedom, leaving only n−1 more to be determined. In mathematical terms, once any n − 1 of the equations in our matrix formulation are satisfied, the remaining one is also automatically satisfied so long as current is conserved in the network as a whole, i.e., so long as P i si = 0, which is clearly true in this case. Choosing any vertex v to be the reference point, therefore, we remove the row and column corresponding to
that vertex from D and a before inverting. Denoting by the is element of M, which we denote [Mlis. In the resulting (n-1)x(n-1)matrices D, and Au, we particular, walks end up at v and w with probabilities can then write Mles and Mws, and of those a fraction 1/ky and 1/kw respectively then pass along the edge(v, w)in one (3) direction or the other. Note that they may also have Calculation of the currents in the network thus involves fore reaching this point. )Summing over all n, the mean and taking the differences of pairs of columns to get tI number of times that a walk of any length traverses the voltage vector V for each possible source/sink pair.(The ge from v to w is k[(i -mi)vs, and similarly for voltage for the one missing vertex v is always zero, by alks that go from w to u. hypothesis. The absolute magnitudes of the differences To highlight the similarity with the current-How be- of voltages along each edge give us betweenness scores tweenness of Sec. III B, let us denote these two numbers for the given source and sink. Summing over all sources and Vw respectively. Then we can write and sinks, we then get our complete betweenness score The matrix inversion takes time o(n )in the worst (I-M)-1·s=(D2-A)-1·s,(4) case, while the subsequent calculation of betweennesses where the source vector s is the vector whose components takes time O(mn2), where as before m is the number of are all 0 except for a single 1 in the position corresponding edges and n the number of vertices in the graph. Thus, to the source vertex s the entire community structure algorithm, including the Now we define our random-walk betweenness for the recalculation step, will take o((n+m)mn2)time to com- edge(u, w)to be the absolute value of the difference of plete, or O(n")on a sparse graph. Although, as we will the two probabilities Vu and Vu, i. e, the net number of ee,the algorithm is good at finding community struc- times the walk passes along the edge in one direction ture, this poor performance makes it practical only for This seems a natural definition-it makes little sense to maller graphs; a few hundreds of vertices is the most accord an edge high betweenness simply because a walk hat we have been able to do. It is for this reason that went back and forth along it many times. It is the differ- we recommend using the shortest-path betweenness al- ence between the numbers of times the edge is traversed gorithm in most cases, which gives results about as good in either direction that matters [48] or better with considerably less effort But now we see that this method is very similar to the resistor network calculation of Sec. IIIB. In that calcu ation we also evaluated(Dt-At-I s for a suitable C. Random walks source vector and then took differences of the resulting numbers. The only difference is that in the current-flow quires us to calculate how often on average random walks Purely for the purposes of mathematical convenience, IF The random-walk betweenness described in Sec. Ii re- calculation we had a sink term in s as well as a source tarting at vertex s will pass down a particular edge from can add such a sink in the present case at the target ver vertex u to vertex w(or vice versa) before finding their tex t-this makes no difference to the solution for V since ay to a given target vertex t. To calculate this quantity the tth row has been removed from the equations anyway we proceed as follows for each separate component of the By doing this, however, we turn the equations into pre cisely the form of the current-Hlow calculation, and hence As before, let Aii be an element of the adjacency y ma- it becomes clear that the two measures are numerically trix such that Aij= l if vertices i and j are connected by identical, although their derivation is quite different. (It an edge and Aii=0 otherwise. Consider a random walk also immediately follows that we can remove any that on each step decides uniformly between the neigh- column and still get the same answer-it doesn bors of the current vertex j and takes a step to one of to be row and column t, although physically this hem. The number of neighbors is just the degree of the makes the most sense vertex k,=2iAij, and the probability for the transition from j to i is Aij/kj, which we can regard as an element of the matrix A.D. where d is the diagonal IV. QUANTIFYING THE STRENGTH OF matrix with d= k COMMUNITY STRUCTURE We are interested in walks that terminate when they reach the target t, so that t is an absorbing state. The As we show in Sec. V, our community structure algo- most convenient way to represent this is just to remove rithms do an excellent job of recovering known communi entirely the vertex t from the graph, so that no walk ties both in artificially generated random networks and reaches any other vertex from t. Thus let Mt= At D real-world examples. However, in practical situations the be the matrix M with the tth row and column removed algorithms will normally be used on networks for which (and similarly for At and Dt) he communities are not known ahead of time. This Now the probability that a walk starts at s, takes n raises a new problem: how do we know when the comm steps, and ends up at some other vertex(not t), is given nities found by the algorithm are good ones? Our algo-
7 that vertex from D and A before inverting. Denoting the resulting (n − 1) × (n − 1) matrices Dv and Av, we can then write V = (Dv − Av) −1 · s. (3) Calculation of the currents in the network thus involves inverting Dv − Av once for any convenient choice of v, and taking the differences of pairs of columns to get the voltage vector V for each possible source/sink pair. (The voltage for the one missing vertex v is always zero, by hypothesis.) The absolute magnitudes of the differences of voltages along each edge give us betweenness scores for the given source and sink. Summing over all sources and sinks, we then get our complete betweenness score. The matrix inversion takes time O(n 3 ) in the worst case, while the subsequent calculation of betweennesses takes time O(mn2 ), where as before m is the number of edges and n the number of vertices in the graph. Thus, the entire community structure algorithm, including the recalculation step, will take O (n+m)mn2 time to complete, or O(n 4 ) on a sparse graph. Although, as we will see, the algorithm is good at finding community structure, this poor performance makes it practical only for smaller graphs; a few hundreds of vertices is the most that we have been able to do. It is for this reason that we recommend using the shortest-path betweenness algorithm in most cases, which gives results about as good or better with considerably less effort. C. Random walks The random-walk betweenness described in Sec. II requires us to calculate how often on average random walks starting at vertex s will pass down a particular edge from vertex v to vertex w (or vice versa) before finding their way to a given target vertex t. To calculate this quantity we proceed as follows for each separate component of the graph. As before, let Aij be an element of the adjacency matrix such that Aij = 1 if vertices i and j are connected by an edge and Aij = 0 otherwise. Consider a random walk that on each step decides uniformly between the neighbors of the current vertex j and takes a step to one of them. The number of neighbors is just the degree of the vertex kj = P i Aij , and the probability for the transition from j to i is Aij/kj , which we can regard as an element of the matrix M = A · D−1 , where D is the diagonal matrix with Dii = ki . We are interested in walks that terminate when they reach the target t, so that t is an absorbing state. The most convenient way to represent this is just to remove entirely the vertex t from the graph, so that no walk ever reaches any other vertex from t. Thus let Mt = At ·D−1 t be the matrix M with the tth row and column removed (and similarly for At and Dt). Now the probability that a walk starts at s, takes n steps, and ends up at some other vertex (not t), is given by the is element of Mn t , which we denote [Mn t ]is. In particular, walks end up at v and w with probabilities [Mn t ]vs and [Mn t ]ws, and of those a fraction 1/kv and 1/kw respectively then pass along the edge (v, w) in one direction or the other. (Note that they may also have passed along this edge an arbitrary number of times before reaching this point.) Summing over all n, the mean number of times that a walk of any length traverses the edge from v to w is k −1 v [(I − Mt) −1 ]vs, and similarly for walks that go from w to v. To highlight the similarity with the current-flow betweenness of Sec. III B, let us denote these two numbers Vv and Vw respectively. Then we can write V = D−1 · (I − Mt) −1 · s = (Dt − At) −1 · s, (4) where the source vector s is the vector whose components are all 0 except for a single 1 in the position corresponding to the source vertex s. Now we define our random-walk betweenness for the edge (v, w) to be the absolute value of the difference of the two probabilities Vv and Vw, i.e., the net number of times the walk passes along the edge in one direction. This seems a natural definition—it makes little sense to accord an edge high betweenness simply because a walk went back and forth along it many times. It is the difference between the numbers of times the edge is traversed in either direction that matters [48]. But now we see that this method is very similar to the resistor network calculation of Sec. III B. In that calculation we also evaluated (Dt − At) −1 · s for a suitable source vector and then took differences of the resulting numbers. The only difference is that in the current-flow calculation we had a sink term in s as well as a source. Purely for the purposes of mathematical convenience, we can add such a sink in the present case at the target vertex t—this makes no difference to the solution for V since the tth row has been removed from the equations anyway. By doing this, however, we turn the equations into precisely the form of the current-flow calculation, and hence it becomes clear that the two measures are numerically identical, although their derivation is quite different. (It also immediately follows that we can remove any row or column and still get the same answer—it doesn’t have to be row and column t, although physically this choice makes the most sense.) IV. QUANTIFYING THE STRENGTH OF COMMUNITY STRUCTURE As we show in Sec. V, our community structure algorithms do an excellent job of recovering known communities both in artificially generated random networks and in real-world examples. However, in practical situations the algorithms will normally be used on networks for which the communities are not known ahead of time. This raises a new problem: how do we know when the communities found by the algorithm are good ones? Our algo-