Finding and evaluating community structure in networks M. E.J. Newmanl, 2 and M. Girvan2,3 Department of Physics and Center for the Study of Compler Systems University of Michigan, Ann Arbor, M 48109-1120 sAnta Fe Institute, 1399 Hyde Park Road, Santa Fe, NM87501 Department of Physics, Cornell University, Ithaca, NY 14853-2501 We propose and study a set of algorithms for discovering community str natural divisions of network nodes into densely connected subgroups. Our algorithms all share two definitive features: first, they involve iterative removal of edges from the network to split it into communities, the edges removed being identified using one of a number of possible"betweenness measures, and second, these measures are, crucially, recalculated after each removal. We also propose a measure for the strength of the community structure found by our algorithms, which gives us objective metric for choosing the number of communities into which a network should be divided We demonstrate that our algorithms are highly effective at discovering community structure in both computer-generated and real-world network data, and show how they can be used to shed light on the sometimes dauntingly complex structure of networked systems. 日苏 I. INTRODUCTION hierarchical clustering in sociology [18, 19. Before pre- senting our own findings, it is worth reviewing some of Empirical studies and theoretical modeling of networks this preceding work, to understand its achievements and have been the subject of a large body of recent research in where it falls short c statistical physics and applied mathematics [1, 2, 3, 4 Graph partitioning is a problem that arises in, for ex Network ideas have been applied with great success to ample, parallel computing. Suppose we have a num- opics as diverse as the Internet and the world wide ber n of intercommunicating computer processes, which web [5, 6, 7, epidemiology [8, 9, 10, 11], scientific ci- we wish to distribute over a number g of computer proces- tation and collaboration [12, 13), metabolism [14, 15], sors. Processes do not necessarily need to communicate and ecosystems [16, 17 to name but a few. a property with all others, and the pattern of required communica L that seems to be common to many networks is commu- tions can be represented by a graph or network in which nity structure, the division of network nodes into groups the vertices represent processes and edges join process within which the network connections are dense. but be- pairs that need to communicate. The problem is to allo- tween which they are sparser--see Fig. 1. The ability to cate the processes to processors in such a way as roughly C find and analyze such groups can provide invaluable help to balance the load on each processor, while at the same Q in understanding and visualizing the structure of net- time minimizing the number of edges that run between oo works. In this paper we show how this can be achieved. processors, so that the amount of interprocessor commu- The study of community structure in networks has a nication(which is normally slow) is minimized. In gen- long history. It is closely related to the ideas of graph eral, finding an exact solution to a partitioning task of artitioning in graph theory and computer science, and this kind is believed to be an NP- ng it prohibitively difficult to solve for large graphs, but a wide variety of heuristic algorithms have been devel- oped that give acceptably good solutions in many cases the best known being perhaps the Kernighan-Lin alg rithm [201, which runs in time O(n )on sparse graphs A solution to the graph partitioning problem is how- ever not particularly helpful for analyzing and under standing networks in general. If we merely want to find if and how a given network breaks down into commu nities, we probably dont know how many such com munities there are going to be, and there is no reason why they should be roughly the same size. Furthermore, he number of inter-community edges needn't be strictly minimized either, since more such edges are admissible between large communities than between small ones As far as our goals in this paper are concerned, a mor type considered in this paper. In this case there are three useful approach is that taken by social network analysis immunities, denoted by the dashed circles, which have dense with the set of techniques known as hierarchical cluster internal links but between which there are only a lower density ing. These techniques are aimed at discovering natural of external links divisions of (social)networks into groups, based on va
arXiv:cond-mat/0308217v1 [cond-mat.stat-mech] 11 Aug 2003 Finding and evaluating community structure in networks M. E. J. Newman1, 2 and M. Girvan2, 3 1Department of Physics and Center for the Study of Complex Systems, University of Michigan, Ann Arbor, MI 48109–1120 2Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501 3Department of Physics, Cornell University, Ithaca, NY 14853–2501 We propose and study a set of algorithms for discovering community structure in networks— natural divisions of network nodes into densely connected subgroups. Our algorithms all share two definitive features: first, they involve iterative removal of edges from the network to split it into communities, the edges removed being identified using one of a number of possible “betweenness” measures, and second, these measures are, crucially, recalculated after each removal. We also propose a measure for the strength of the community structure found by our algorithms, which gives us an objective metric for choosing the number of communities into which a network should be divided. We demonstrate that our algorithms are highly effective at discovering community structure in both computer-generated and real-world network data, and show how they can be used to shed light on the sometimes dauntingly complex structure of networked systems. I. INTRODUCTION Empirical studies and theoretical modeling of networks have been the subject of a large body of recent research in statistical physics and applied mathematics [1, 2, 3, 4]. Network ideas have been applied with great success to topics as diverse as the Internet and the world wide web [5, 6, 7], epidemiology [8, 9, 10, 11], scientific citation and collaboration [12, 13], metabolism [14, 15], and ecosystems [16, 17], to name but a few. A property that seems to be common to many networks is community structure, the division of network nodes into groups within which the network connections are dense, but between which they are sparser—see Fig. 1. The ability to find and analyze such groups can provide invaluable help in understanding and visualizing the structure of networks. In this paper we show how this can be achieved. The study of community structure in networks has a long history. It is closely related to the ideas of graph partitioning in graph theory and computer science, and FIG. 1: A small network with community structure of the type considered in this paper. In this case there are three communities, denoted by the dashed circles, which have dense internal links but between which there are only a lower density of external links. hierarchical clustering in sociology [18, 19]. Before presenting our own findings, it is worth reviewing some of this preceding work, to understand its achievements and where it falls short. Graph partitioning is a problem that arises in, for example, parallel computing. Suppose we have a number n of intercommunicating computer processes, which we wish to distribute over a number g of computer processors. Processes do not necessarily need to communicate with all others, and the pattern of required communications can be represented by a graph or network in which the vertices represent processes and edges join process pairs that need to communicate. The problem is to allocate the processes to processors in such a way as roughly to balance the load on each processor, while at the same time minimizing the number of edges that run between processors, so that the amount of interprocessor communication (which is normally slow) is minimized. In general, finding an exact solution to a partitioning task of this kind is believed to be an NP-complete problem, making it prohibitively difficult to solve for large graphs, but a wide variety of heuristic algorithms have been developed that give acceptably good solutions in many cases, the best known being perhaps the Kernighan–Lin algorithm [20], which runs in time O(n 3 ) on sparse graphs. A solution to the graph partitioning problem is however not particularly helpful for analyzing and understanding networks in general. If we merely want to find if and how a given network breaks down into communities, we probably don’t know how many such communities there are going to be, and there is no reason why they should be roughly the same size. Furthermore, the number of inter-community edges needn’t be strictly minimized either, since more such edges are admissible between large communities than between small ones. As far as our goals in this paper are concerned, a more useful approach is that taken by social network analysis with the set of techniques known as hierarchical clustering. These techniques are aimed at discovering natural divisions of (social) networks into groups, based on var-
FIG. 2:a hierarchical tree or dendrogram illustrating the ype of output generated by the algorithms described here The circles at the bottom of the figure represent the ind FIG. 3: Agglomerative clustering methods are typically good vidual vertices of the network. As we move up the tree the at discovering the strongly linked cores of communities(bold vertices join together to form larger and larger communities, vertices and edges)but tend to leave out peripheral vertices, as indicated by the lines, until we reach the top where all are even when, as here, most of them clearly belong to one com- joined together in a single community. Alternatively, we the munity or another dendrogram depicts an initially connected network splitting into smaller and smaller communities as we go from top to bottom.A cross-section of the tree at any level, as indicated dency to find only the cores of communities and leave the dotted line, will give the communities at that level The vertical height of the split-points in the tree are indica- out the periphery. The core nodes in a community of- tive only of the order in which the splits (or joins)took place, ten have strong similarity, and hence are connected early although it is possible to construct more elaborate dendro- in the agglomerative proe but peripheral nodes that grams in which these heights contain other informatio have no strong similarity to others tend to get neglected leading to structures like that shown in Fig 3. In this figure, there are a number of peripheral nodes whose com munity membership is obvious to the eyein most cases ous metrics of similarity or strength of connection be- they have only a single link to a specific community- tween vertices. They fall into two broad classes, agglom- but agglomerative methods often fail to place such nodes erative and divisive [19, depending on whether they correctl cus on the addition or removal of edges to or from the net In this paper, therefore, we focus on divisive meth- work. In an agglomerative method, similarities are cal- ods. These methods have been relatively little studied culated by one method or another between vertex pairs, in the previous literature, either in social network the- and edges are then added to an initially empty network ory or elsewhere but, as we will see, seem to offer a (n vertices with no edges) starting with the vertex pairs lot of promise. In a divisive method, we start with the with highest similarity. The procedure can be halted at network of interest and attempt to find the least similar any point, and the resulting components in the network connected pairs of vertices and then remove the edges are taken to be the communities. Alternatively, the en- between them. By doing this repeatedly, we divide the tire progression of the algorithm from empty graph to network into smaller and smaller components, and again complete graph can be represented in the form of a tree we can stop the process at any stage and take the com- or dendrogram such as that shown in Fig. 2. Horizontal ponents at that stage to be the network communities cuts through the tree represent the communities appro- Again, the process can be represented as a dendrogram te to different halting points depicting the successive splits of the network into smaller agglomerative methods based on a wide variety of sim- and smaller groups ilarity measures have been applied to different networks. The approach we take follows roughly these lines Some networks have natural similarity metrics built in. but adopts a somewhat different philosophical viewpoint For example, in the widely studied network of collabo- Rather than looking for the most weakly connected ver rations between film actors [ 21, 22, in which two actors tex pairs, our approach will be to look for the edges in the e connected if they have appeared in the same film, one network that are most "between"other vertices, meaning could quantify similarity by how many films actors have that the edge is, in some sense, responsible for connect- appeared in together 23. Other networks have no natu- ing many pairs of others. Such edges need not be weak al metric, but suitable ones can be devised using correla- at all in the similarity sense. How this idea works out tion coefficients, path lengths, or matrix methods. a well practice will become clear in the course of the presenta known example of an agglomerative clustering method is tion. the Concor algorithm of Breiger et al. 24 Briefly then, the outline of this paper is as follow Agglomerative methods have their problems however. In Sec. II we describe the crucial concepts behind our One concern is that they fail with some frequency to find methods for finding community structure in networks and the correct communities in networks were the commu- show how these concepts can be turned into a concrete nity structure is known, which makes it difficult to place prescription for performing calculations. In Sec. Ill we much trust in them in other cases. Another is their ten- describe in detail the implementation of our methods. In
2 FIG. 2: A hierarchical tree or dendrogram illustrating the type of output generated by the algorithms described here. The circles at the bottom of the figure represent the individual vertices of the network. As we move up the tree the vertices join together to form larger and larger communities, as indicated by the lines, until we reach the top, where all are joined together in a single community. Alternatively, we the dendrogram depicts an initially connected network splitting into smaller and smaller communities as we go from top to bottom. A cross-section of the tree at any level, as indicated by the dotted line, will give the communities at that level. The vertical height of the split-points in the tree are indicative only of the order in which the splits (or joins) took place, although it is possible to construct more elaborate dendrograms in which these heights contain other information. ious metrics of similarity or strength of connection between vertices. They fall into two broad classes, agglomerative and divisive [19], depending on whether they focus on the addition or removal of edges to or from the network. In an agglomerative method, similarities are calculated by one method or another between vertex pairs, and edges are then added to an initially empty network (n vertices with no edges) starting with the vertex pairs with highest similarity. The procedure can be halted at any point, and the resulting components in the network are taken to be the communities. Alternatively, the entire progression of the algorithm from empty graph to complete graph can be represented in the form of a tree or dendrogram such as that shown in Fig. 2. Horizontal cuts through the tree represent the communities appropriate to different halting points. Agglomerative methods based on a wide variety of similarity measures have been applied to different networks. Some networks have natural similarity metrics built in. For example, in the widely studied network of collaborations between film actors [21, 22], in which two actors are connected if they have appeared in the same film, one could quantify similarity by how many films actors have appeared in together [23]. Other networks have no natural metric, but suitable ones can be devised using correlation coefficients, path lengths, or matrix methods. A well known example of an agglomerative clustering method is the Concor algorithm of Breiger et al. [24]. Agglomerative methods have their problems however. One concern is that they fail with some frequency to find the correct communities in networks were the community structure is known, which makes it difficult to place much trust in them in other cases. Another is their tenFIG. 3: Agglomerative clustering methods are typically good at discovering the strongly linked cores of communities (bold vertices and edges) but tend to leave out peripheral vertices, even when, as here, most of them clearly belong to one community or another. dency to find only the cores of communities and leave out the periphery. The core nodes in a community often have strong similarity, and hence are connected early in the agglomerative process, but peripheral nodes that have no strong similarity to others tend to get neglected, leading to structures like that shown in Fig. 3. In this figure, there are a number of peripheral nodes whose community membership is obvious to the eye—in most cases they have only a single link to a specific community— but agglomerative methods often fail to place such nodes correctly. In this paper, therefore, we focus on divisive methods. These methods have been relatively little studied in the previous literature, either in social network theory or elsewhere, but, as we will see, seem to offer a lot of promise. In a divisive method, we start with the network of interest and attempt to find the least similar connected pairs of vertices and then remove the edges between them. By doing this repeatedly, we divide the network into smaller and smaller components, and again we can stop the process at any stage and take the components at that stage to be the network communities. Again, the process can be represented as a dendrogram depicting the successive splits of the network into smaller and smaller groups. The approach we take follows roughly these lines, but adopts a somewhat different philosophical viewpoint. Rather than looking for the most weakly connected vertex pairs, our approach will be to look for the edges in the network that are most “between” other vertices, meaning that the edge is, in some sense, responsible for connecting many pairs of others. Such edges need not be weak at all in the similarity sense. How this idea works out in practice will become clear in the course of the presentation. Briefly then, the outline of this paper is as follows. In Sec. II we describe the crucial concepts behind our methods for finding community structure in networks and show how these concepts can be turned into a concrete prescription for performing calculations. In Sec. III we describe in detail the implementation of our methods. In
Sec IV we consider ways of determining when a particu- 2. The shortest-path betweenness can be thought of lar division of a network into communities is a good one in terms of signals traveling through a network. allowing us to quantify the success of our community If signals travel from source to destination finding algorithms. And in Sec. v we give a number geodesic network paths, and all vertices send sig- of applications of our algorithms to particular networks nals at the same constant rate to all others. then both real and artificial. In Sec. Vi we give our conclu- the betweenness is a measure of the rate at which sions. a brief report of some of the work contained in signals pass along each edge. Suppose however that this paper has appeared previously as Ref. 25 signals do not travel along geodesic paths, but in- stead just perform a random walk about the net- work until they reach their destination. This gives IL. FINDING COMMUNITIES IN A NETWORK us another measure on edges, the random-walk be- tweenness: we calculate the expected net number of times that a random walk between a particular In this paper we present a class of new algorithms for network clustering, i. e, the discovery of community pair of vertices will pass down a particular edge structure in networks. Our discussion focuses primarily and sum over all vertex pairs. The random-walk betweenness can be calculated using matrix meth- on networks with only a single type of vertex and a single ods. as described in Sec. III C type of undirected, unweighted edge, although general izations to more com licated network types are certainly 3. Another betweenness measure is motivated by ideas from elementary circuit theory. We consider the There are two central features that distinguish our al- circuit created by placing a unit resistance on each gorithms from those that have preceded them. First, our edge of the network and unit current source and algorithms are divisive rather than agglomerative. Di- sink at a particular pair of vertices. The resulting visive algorithms have occasionally been studied in the current How in the network will travel from source past, but, as discussed in the introduction, ours differ to sink along a multitude of paths, those with least in focusing not on removing the edges between vertex resistance carrying the greatest fraction of the cur pairs with lowest similarity, but on finding edges with rent. The current-flow betweenness for an edge the highest "betweenness, where betweenness is some we define to be the absolute value of the current measure that favors edges that lie between communities long the edge summed over all source/ sink pairs and disfavors those that lie inside communities It can be calculated using Kirchhoffs laws, as de- To make things more concrete, we give some examples scribed in Sec. III B. In fact, as we will of the types of betweenness measures we will be looking current-flow betweenness turns out to be exactly at, all of them are based on the same idea. If two com- the same as the random walk betweenness of the munities are joined by only a few inter-community edges, previous paragraph, but we nonetheless consider it then all paths through the network from vertices in one parately since it leads to a simpler derivation of community to vertices in the other must pass along one of those few edges. Given a suitable set of paths, one can count how many go along each edge in the graph, and These measures are only suggestions; many others are his number we then expect to be largest for the inter- possible and may well be appropriate for specific applica- ommunity edges, thus providing a method for identify- tions. Measures(1)and (2)are in some sense extremes in ing them. Our different measures correspond to various the spectrum of possibilities, one corresponding to signals mplementations of this idea. that know exactly where they are going, and the other to signals that have no idea where they are going. As 1. The simplest example of such a betweenness mea- we will see, however, these two measures actually give find the shortest paths between all pairs of vertices of betweenness measure may not, at least for the types nd count how many run along each edge. To the of applications considered here, be that important est of our know ledge this measure was first intro- The second way in which our methods differ from pre- duced by Anthonisse in a never-published technical vious ones is in the inclusion of a"recalculation step"in report in 1971 26. Anthonisse called it "rush, the algorithm. If we were to perform a standard divisive but we prefer the term edge betweenness, since the clustering based on edge betweenness we would calculate quantity is a natural generalization to edges of the the edge betweenness for all edges in the network and well-known(vertex) betweenness measure of Free- then remove edges in decreasing order of betweenness to 27, which was the inspiration for our ap- produce a dendrogram like that of Fig. 2, showing the proach. When we need to distinguish it from the order in which the network split up other betweenness measures considered in this pa- However, once the first edge in the network is removed per, we will refer to it as shortest-path betweenness. in such an algorithm, the betweenness values for the re- A fast algorithm for calculating the shortest-path maining edges will no longer reflect the network as it now betweenness is given in Sec. III A is. This can give rise to unwanted behaviors. For exam-
3 Sec. IV we consider ways of determining when a particular division of a network into communities is a good one, allowing us to quantify the success of our community- finding algorithms. And in Sec. V we give a number of applications of our algorithms to particular networks, both real and artificial. In Sec. VI we give our conclusions. A brief report of some of the work contained in this paper has appeared previously as Ref. 25. II. FINDING COMMUNITIES IN A NETWORK In this paper we present a class of new algorithms for network clustering, i.e., the discovery of community structure in networks. Our discussion focuses primarily on networks with only a single type of vertex and a single type of undirected, unweighted edge, although generalizations to more complicated network types are certainly possible. There are two central features that distinguish our algorithms from those that have preceded them. First, our algorithms are divisive rather than agglomerative. Divisive algorithms have occasionally been studied in the past, but, as discussed in the introduction, ours differ in focusing not on removing the edges between vertex pairs with lowest similarity, but on finding edges with the highest “betweenness,” where betweenness is some measure that favors edges that lie between communities and disfavors those that lie inside communities. To make things more concrete, we give some examples of the types of betweenness measures we will be looking at. All of them are based on the same idea. If two communities are joined by only a few inter-community edges, then all paths through the network from vertices in one community to vertices in the other must pass along one of those few edges. Given a suitable set of paths, one can count how many go along each edge in the graph, and this number we then expect to be largest for the intercommunity edges, thus providing a method for identifying them. Our different measures correspond to various implementations of this idea. 1. The simplest example of such a betweenness measure is that based on shortest (geodesic) paths: we find the shortest paths between all pairs of vertices and count how many run along each edge. To the best of our knowledge this measure was first introduced by Anthonisse in a never-published technical report in 1971 [26]. Anthonisse called it “rush,” but we prefer the term edge betweenness, since the quantity is a natural generalization to edges of the well-known (vertex) betweenness measure of Freeman [27], which was the inspiration for our approach. When we need to distinguish it from the other betweenness measures considered in this paper, we will refer to it as shortest-path betweenness. A fast algorithm for calculating the shortest-path betweenness is given in Sec. III A. 2. The shortest-path betweenness can be thought of in terms of signals traveling through a network. If signals travel from source to destination along geodesic network paths, and all vertices send signals at the same constant rate to all others, then the betweenness is a measure of the rate at which signals pass along each edge. Suppose however that signals do not travel along geodesic paths, but instead just perform a random walk about the network until they reach their destination. This gives us another measure on edges, the random-walk betweenness: we calculate the expected net number of times that a random walk between a particular pair of vertices will pass down a particular edge and sum over all vertex pairs. The random-walk betweenness can be calculated using matrix methods, as described in Sec. III C. 3. Another betweenness measure is motivated by ideas from elementary circuit theory. We consider the circuit created by placing a unit resistance on each edge of the network and unit current source and sink at a particular pair of vertices. The resulting current flow in the network will travel from source to sink along a multitude of paths, those with least resistance carrying the greatest fraction of the current. The current-flow betweenness for an edge we define to be the absolute value of the current along the edge summed over all source/sink pairs. It can be calculated using Kirchhoff’s laws, as described in Sec. III B. In fact, as we will show, the current-flow betweenness turns out to be exactly the same as the random walk betweenness of the previous paragraph, but we nonetheless consider it separately since it leads to a simpler derivation of the measure. These measures are only suggestions; many others are possible and may well be appropriate for specific applications. Measures (1) and (2) are in some sense extremes in the spectrum of possibilities, one corresponding to signals that know exactly where they are going, and the other to signals that have no idea where they are going. As we will see, however, these two measures actually give rather similar results, indicating that the precise choice of betweenness measure may not, at least for the types of applications considered here, be that important. The second way in which our methods differ from previous ones is in the inclusion of a “recalculation step” in the algorithm. If we were to perform a standard divisive clustering based on edge betweenness we would calculate the edge betweenness for all edges in the network and then remove edges in decreasing order of betweenness to produce a dendrogram like that of Fig. 2, showing the order in which the network split up. However, once the first edge in the network is removed in such an algorithm, the betweenness values for the remaining edges will no longer reflect the network as it now is. This can give rise to unwanted behaviors. For exam-
ple, if two communities are joined by two edges, but, for that we discussed in Ref. 25 47. The other versions we one reason or another, most paths between the two flow discuss, while being of some pedagogical interest, make along just one of those edges, then that edge will have a greater computational demands, and in practice seem high betweenness score and the other will not. An algo- give results no better than the shortest-path method rithm that calculated betweennesses only once and then removed edges in betweenness order would remove the first edge early in the course of its operation, but the IIL. IMPLEMENTATION second might not get removed until much later. Thus he obvious division of the network into two parts might In theory, the descriptions of the last section com- not be discovered by the algorithm. In the worst case the pletely define the methods we consider in this paper, but two parts themselves might be individually broken up be- in practice there are a number of tricks to their imple- fore the division between the two is made. In practice, mentation that are important for turning the description problems like this crop up in real networks with some into a workable computer algorithm regularity and render algorithms of this type ineffective Essentially all of the work in the algorithm is in the for the discovery of community structure The solution, luckily, is obvious. We simply recalc alculation of the betweenness scores for the edges; the job of finding and removing the highest-scoring edge is late our betweenness measure after the removal of each trivial and not computationally demanding. Let us tackle dge. This certainly adds to the computational effort of our three suggested betweenness measures in turn performing the calculation, but its effect on the results so desirable that we consider the price worth paying Thus the general form of our community structure find- A. Shortest-path betweenness ing algorithm is as follows 1. Calculate betweenness scores for all edges in the At first sight, it appears that calculating the edge be tweenness measure based on geodesic paths for all edges will take O(mn2)operations on a graph with m edges 2. Find the edge with the highest score and remove it and n vertices: calculating the shortest path between a from the network particular pair of vertices can be done using breadth-first search in time O(m)[28, 29, and there are O(n)ver- 3. Recalculate betweenness for all remaining edges. tex pairs. Recently however new algorithms have been 4. Repeat from step 2 proposed by Newman 30 and independently by Bran- des 31 that can perform the calculation faster than this fact, it appears that the recalculation step is the finding all betweennesses in O(mn)time. Both Newman most important feature of the algorithm, as far as getting and Brandes gave algorithms for the standard Freeman satisfactory results is concerned. As mentioned above vertex betweenness, but it is trivial to adapt their algo- our studies indicate that, once one hits on the idea of us- rithms for edge betweenness. We describe the resulting ing betweenness measures to weight edges, the exact mea- method here for the algorithm of Newman. sure one uses appears not to influence the results highly Breadth-first search can find shortest paths from a sin- The recalculation step, on the other hand, is absolutely gle vertex s to all others in time o(m). In the simplest crucial to the operation of our methods. This step was case, when there is only a single shortest path from the missing from previous attempts at solving the cluster- source vertex to any other(we will consider other cases oblem divisive algorithms, and yet without in a moment) the resulting set of paths forms a shortest- it the results are very poor indeed, failing to find known path tree--see Fig 4a. We can now use this tree to calcu community structure even in the simplest of cases. In late the contribution to betweenness for each edge from Sec. VB we give an example comparing the performance this set of paths as follows. We find first the leaves"of of the algorithm on a particular network with and with- the tree, i.e., those nodes such that no shortest paths to ut the recalculation step other nodes pass through them, and we assign a score of 1 In the following sections we discuss implementation to the single edge that connects each to the rest of the and give examples of our algorithms for finding commu- tree, as shown in the figure. Then, starting with those nity structure. For the reader who merely wants to know edges that are farthest from the source vertex on the tree, what algorithm they should use for their own problem, i.e., lowest in Fig 4a, we work upwards, assigning a score t us give an immediate answer: for most problems, we to each edge that is 1 plus the sum of the scores on the recommend the algorithm with betweenness scores cal- neighboring edges immediately below it. When we have culated using the shortest-path betweenness measure(1) gone though all edges in the tree, the resulting scores above. This to work well and is the are the betweenness counts for the paths from vertex s quickest to calculate--as described in Sec. III A, it can Repeating the process for all possible vertices s and sum- e calculated for all edges in time O(mn), where m is ming the scores, we arrive at the full betweenness scores the number of edges in the graph and n is the number for shortest paths between all pairs. The breadth-first of vertices. This is the only version of the algorithm search and the process of working up through the tree
4 ple, if two communities are joined by two edges, but, for one reason or another, most paths between the two flow along just one of those edges, then that edge will have a high betweenness score and the other will not. An algorithm that calculated betweennesses only once and then removed edges in betweenness order would remove the first edge early in the course of its operation, but the second might not get removed until much later. Thus the obvious division of the network into two parts might not be discovered by the algorithm. In the worst case the two parts themselves might be individually broken up before the division between the two is made. In practice, problems like this crop up in real networks with some regularity and render algorithms of this type ineffective for the discovery of community structure. The solution, luckily, is obvious. We simply recalculate our betweenness measure after the removal of each edge. This certainly adds to the computational effort of performing the calculation, but its effect on the results is so desirable that we consider the price worth paying. Thus the general form of our community structure finding algorithm is as follows: 1. Calculate betweenness scores for all edges in the network. 2. Find the edge with the highest score and remove it from the network. 3. Recalculate betweenness for all remaining edges. 4. Repeat from step 2. In fact, it appears that the recalculation step is the most important feature of the algorithm, as far as getting satisfactory results is concerned. As mentioned above, our studies indicate that, once one hits on the idea of using betweenness measures to weight edges, the exact measure one uses appears not to influence the results highly. The recalculation step, on the other hand, is absolutely crucial to the operation of our methods. This step was missing from previous attempts at solving the clustering problem using divisive algorithms, and yet without it the results are very poor indeed, failing to find known community structure even in the simplest of cases. In Sec. V B we give an example comparing the performance of the algorithm on a particular network with and without the recalculation step. In the following sections we discuss implementation and give examples of our algorithms for finding community structure. For the reader who merely wants to know what algorithm they should use for their own problem, let us give an immediate answer: for most problems, we recommend the algorithm with betweenness scores calculated using the shortest-path betweenness measure (1) above. This measure appears to work well and is the quickest to calculate—as described in Sec. III A, it can be calculated for all edges in time O(mn), where m is the number of edges in the graph and n is the number of vertices. This is the only version of the algorithm that we discussed in Ref. 25 [47]. The other versions we discuss, while being of some pedagogical interest, make greater computational demands, and in practice seem to give results no better than the shortest-path method. III. IMPLEMENTATION In theory, the descriptions of the last section completely define the methods we consider in this paper, but in practice there are a number of tricks to their implementation that are important for turning the description into a workable computer algorithm. Essentially all of the work in the algorithm is in the calculation of the betweenness scores for the edges; the job of finding and removing the highest-scoring edge is trivial and not computationally demanding. Let us tackle our three suggested betweenness measures in turn. A. Shortest-path betweenness At first sight, it appears that calculating the edge betweenness measure based on geodesic paths for all edges will take O(mn2 ) operations on a graph with m edges and n vertices: calculating the shortest path between a particular pair of vertices can be done using breadth-first search in time O(m) [28, 29], and there are O(n 2 ) vertex pairs. Recently however new algorithms have been proposed by Newman [30] and independently by Brandes [31] that can perform the calculation faster than this, finding all betweennesses in O(mn) time. Both Newman and Brandes gave algorithms for the standard Freeman vertex betweenness, but it is trivial to adapt their algorithms for edge betweenness. We describe the resulting method here for the algorithm of Newman. Breadth-first search can find shortest paths from a single vertex s to all others in time O(m). In the simplest case, when there is only a single shortest path from the source vertex to any other (we will consider other cases in a moment) the resulting set of paths forms a shortestpath tree—see Fig. 4a. We can now use this tree to calculate the contribution to betweenness for each edge from this set of paths as follows. We find first the “leaves” of the tree, i.e., those nodes such that no shortest paths to other nodes pass through them, and we assign a score of 1 to the single edge that connects each to the rest of the tree, as shown in the figure. Then, starting with those edges that are farthest from the source vertex on the tree, i.e., lowest in Fig. 4a, we work upwards, assigning a score to each edge that is 1 plus the sum of the scores on the neighboring edges immediately below it. When we have gone though all edges in the tree, the resulting scores are the betweenness counts for the paths from vertex s. Repeating the process for all possible vertices s and summing the scores, we arrive at the full betweenness scores for shortest paths between all pairs. The breadth-first search and the process of working up through the tree
2. Every vertex i adjacent to s is given distance di 3. For each vertex j adjacent to one of those vertices i of three (a)If j has not yet been assigned a distance, it is assigned distance d=di+ l and weight le (b)If j has already been assigned a distance and di di +1, then the vertexs weight FIG. 4: Calculation of shortest-path betweenness:(a) Wher (c) If j has already been assigned a distance and there is only a single shortest path from a source vertex s d< di+ l, we do nothing (top) to all other reachable vertices, those paths necessarily form a tree. which makes the calculation of the contribution 4. Repeat from step 3 until no vertices remain that o betweenness from this set of paths particularly simple, as have assigned distances but whose neighbors do not describe in the text. (b) For cases in which there is more than have assigned distances one shortest path to some vertices, the calculation is more complex. First we must calculate the number of paths from In practice, this algorithm can be implemented most ef- the source to each other vertex (numbers on vertices), and ficiently using a queue or first-in/first-out buffer to store then these are used to weight the path counts appropriately. the vertices that have been assigned a distance, just as In either case, we can check the results by confirming that the in the standard breadth-first search sum of the betweennesses of the edges connected to the source Physically, the weight on a vertex i represents the num- vertex is equal to the total number of reachable vertices-six ber of distinct paths from the source vertex to i.These in each of the cases illustrated here e,ghts are precisely what we need to calculate our edge tweennesses, because if two vertices i and j are con- both take worst-case time o(m) and there are n ver nected, with j farther tha fraction of a geodesic path from j through i to s is given tices total, so the entire calculation takes time o(mn)as by w/w;. Thus, to calculate the contribution to edge be- claimed behind the algorithm. In general, however. it is not the only carry out the following steps Earting at s, we need This simple case serves to illustrate the basic principle tweenness from all shortest paths case that there is only a single shortest path between any 1. Find every " leaf" vertex t, i.e., a vertex such that pair of vertices. Most networks have at least some vertex no paths from s to other vertices go though t. pairs between which there are several geodesic paths of equal length. Figure 4b shows a simple example of a 2. For each vertex i neighboring t assign a score to the shortest path tree"for a network with this property edge from t to i of wi /wt The resulting structure is in fact no longer a tree. and in such cases an extra step is required in the algorithm to 3. Now, starting with the edges that are farthest from calculate the betweenness correctly the source vertex s-lower down in a diagram such In the traditional definition of vertex betweenness 27 as Fig. 4b-work up towards s. To the edge from multiple shortest paths between a pair of vertices are vertex i to vertex j, with j being farther from s given equal weights summing to 1. For example, if there than i, assign a score that is 1 plus the sum of are three shortest paths, each will be given weight 3 he scores on the neighboring edges immediately We adopt the same definition for our edge betweenness below it(i.e, those with which it shares a common (as did Anthonisse in his original work [26, although vertex), all multiplied by wi/wj other definitions are possible Note that the paths 4. Repeat from step 3 until vertex s is reached. may run along the same edge or edges for some part of their length, resulting in edges with greater weight. To Now repeating this process for all n source vertices s and culate correctly what fraction of the paths flow along summing the resulting scores on the edges gives us the each edge in the network, we generalize the breadth-first total betweenness for all edges in time o(mn search part of the calculation, as follows We now have to repeat this calculation for each edge Consider Fig. 4b and suppose we are performing a removed from the network, of which there are m, and breadth-first search starting at vertex s. We carry out hence the complete community structure algorithm based the following steps: on shortest-path betweenness operates in worst-case time O(mn), or O(n )time on a sparse graph. In our experi 1. The initial vertex s is given distance d,=0 and a ence, this typically makes it tractable for networks of up
5 2 3 11 6 25 6 (a) 1 1 2 1 3 1 1 3 1 5 5 6 7 6 3 (b) 1 s s 1 4 2 1 leaves 1 2 FIG. 4: Calculation of shortest-path betweenness: (a) When there is only a single shortest path from a source vertex s (top) to all other reachable vertices, those paths necessarily form a tree, which makes the calculation of the contribution to betweenness from this set of paths particularly simple, as describe in the text. (b) For cases in which there is more than one shortest path to some vertices, the calculation is more complex. First we must calculate the number of paths from the source to each other vertex (numbers on vertices), and then these are used to weight the path counts appropriately. In either case, we can check the results by confirming that the sum of the betweennesses of the edges connected to the source vertex is equal to the total number of reachable vertices—six in each of the cases illustrated here. both take worst-case time O(m) and there are n vertices total, so the entire calculation takes time O(mn) as claimed. This simple case serves to illustrate the basic principle behind the algorithm. In general, however, it is not the case that there is only a single shortest path between any pair of vertices. Most networks have at least some vertex pairs between which there are several geodesic paths of equal length. Figure 4b shows a simple example of a shortest path “tree” for a network with this property. The resulting structure is in fact no longer a tree, and in such cases an extra step is required in the algorithm to calculate the betweenness correctly. In the traditional definition of vertex betweenness [27] multiple shortest paths between a pair of vertices are given equal weights summing to 1. For example, if there are three shortest paths, each will be given weight 1 3 . We adopt the same definition for our edge betweenness (as did Anthonisse in his original work [26], although other definitions are possible [32]). Note that the paths may run along the same edge or edges for some part of their length, resulting in edges with greater weight. To calculate correctly what fraction of the paths flow along each edge in the network, we generalize the breadth-first search part of the calculation, as follows. Consider Fig. 4b and suppose we are performing a breadth-first search starting at vertex s. We carry out the following steps: 1. The initial vertex s is given distance ds = 0 and a weight ws = 1. 2. Every vertex i adjacent to s is given distance di = ds + 1 = 1, and weight wi = ws = 1. 3. For each vertex j adjacent to one of those vertices i we do one of three things: (a) If j has not yet been assigned a distance, it is assigned distance dj = di + 1 and weight wj = wi . (b) If j has already been assigned a distance and dj = di + 1, then the vertex’s weight is increased by wi , that is wj ← wj + wi . (c) If j has already been assigned a distance and dj < di + 1, we do nothing. 4. Repeat from step 3 until no vertices remain that have assigned distances but whose neighbors do not have assigned distances. In practice, this algorithm can be implemented most ef- ficiently using a queue or first-in/first-out buffer to store the vertices that have been assigned a distance, just as in the standard breadth-first search. Physically, the weight on a vertex i represents the number of distinct paths from the source vertex to i. These weights are precisely what we need to calculate our edge betweennesses, because if two vertices i and j are connected, with j farther than i from the source s, then the fraction of a geodesic path from j through i to s is given by wi/wj . Thus, to calculate the contribution to edge betweenness from all shortest paths starting at s, we need only carry out the following steps: 1. Find every “leaf” vertex t, i.e., a vertex such that no paths from s to other vertices go though t. 2. For each vertex i neighboring t assign a score to the edge from t to i of wi/wt. 3. Now, starting with the edges that are farthest from the source vertex s—lower down in a diagram such as Fig. 4b—work up towards s. To the edge from vertex i to vertex j, with j being farther from s than i, assign a score that is 1 plus the sum of the scores on the neighboring edges immediately below it (i.e., those with which it shares a common vertex), all multiplied by wi/wj . 4. Repeat from step 3 until vertex s is reached. Now repeating this process for all n source vertices s and summing the resulting scores on the edges gives us the total betweenness for all edges in time O(mn). We now have to repeat this calculation for each edge removed from the network, of which there are m, and hence the complete community structure algorithm based on shortest-path betweenness operates in worst-case time O(m2n), or O(n 3 ) time on a sparse graph. In our experience, this typically makes it tractable for networks of up