复旦大学：《网络科学导论 Introduction to Network Science》教学参考文献_Community detaction-fast unfolding.pdf

Fast unfolding of communities in large networks Contents 1. Introduction 2. Method 236 3. Application to large networks 4. Conclusion and discussion Acknowledgments References 1. Introduction Social, technological and information systems can often be described in terms of complex networks that have a topology of interconnected nodes combining organization and 三 and these scales demand new methods to retrieve comprehensive information from their structure. a promising approach consists in decomposing the networks into sub-units or communities, which are sets of highly interconnected nodes [ 3. The identification of these communities is of crucial importance as they may help to uncover a priori unknown functional modules such as topics in information networks or cyber-communities in social networks. Moreover, the resulting meta-network, whose nodes are the communities, may then be used to visualize the original network structure The problem of community detection requires the partition of a network into communities of densely connected nodes, with the nodes belonging to different been proposed to find reasonably good partitions in a reasonably fast way. This search for fast algorithms has attracted much interest in recent years due to the increasing availability of large network datasets and the impact of networks on everyday life. One can distinguish several types of community detection algorithms: divisive algorithms O detect inter-community links and remove them from the network [4]-[61, agglomerative oo algorithms merge similar nodes/communities recursively 7 and optimization methods are based on the maximization of an objective function [8 -[10. The quality of the partitions resulting from these methods is often measured by the so-called modularity of the partition. The modularity of a partition is a scalar value between -1 and 1 that measures the density of links inside communities as compared to links between communities 5, 11. In the case of weighted networks(weighted networks are networks that have weights on their links. such as the number of communications between two mobile phone users), it is defined as [12 n∑4-w0]6(a.9) doi:10.1088/1742-5468/2008/10/P10008

J. Stat. Mech. (2008) P10008 Fast unfolding of communities in large networks Contents 1. Introduction 2 2. Method 3 3. Application to large networks 6 4. Conclusion and discussion 10 Acknowledgments 11 References 11 1. Introduction Social, technological and information systems can often be described in terms of complex networks that have a topology of interconnected nodes combining organization and randomness [1, 2]. The typical size of large networks such as social network services, mobile phone networks or the web is now counted in millions, if not billions, of nodes and these scales demand new methods to retrieve comprehensive information from their structure. A promising approach consists in decomposing the networks into sub-units or communities, which are sets of highly interconnected nodes [3]. The identification of these communities is of crucial importance as they may help to uncover a priori unknown functional modules such as topics in information networks or cyber-communities in social networks. Moreover, the resulting meta-network, whose nodes are the communities, may then be used to visualize the original network structure. The problem of community detection requires the partition of a network into communities of densely connected nodes, with the nodes belonging to different communities being only sparsely connected. Precise formulations of this optimization problem are known to be computationally intractable. Several algorithms have therefore been proposed to find reasonably good partitions in a reasonably fast way. This search for fast algorithms has attracted much interest in recent years due to the increasing availability of large network datasets and the impact of networks on everyday life. One can distinguish several types of community detection algorithms: divisive algorithms detect inter-community links and remove them from the network [4]–[6], agglomerative algorithms merge similar nodes/communities recursively [7] and optimization methods are based on the maximization of an objective function [8]–[10]. The quality of the partitions resulting from these methods is often measured by the so-called modularity of the partition. The modularity of a partition is a scalar value between −1 and 1 that measures the density of links inside communities as compared to links between communities [5, 11]. In the case of weighted networks (weighted networks are networks that have weights on their links, such as the number of communications between two mobile phone users), it is defined as [12] Q = 1 2m i,j Aij − kikj 2m δ(ci, cj ), (1) doi:10.1088/1742-5468/2008/10/P10008 2

Fast unfolding of communities in large networks where Aii represents the weight of the edge between i and j, ki A is the the weights of the edges attached to vertex i, ci is the community to which vertex i is assigned, the 8 function d(a, u)is 1 if u= v and 0 otherwise and m=52iiAi Modularity has been used to compare the quality of the partitions obtained by different methods, but also as an objective function to optimize [13. Unfortunately, exact modularity optimization is a problem that is computationally hard [14 and approximation algorithms are necessary when dealing with large networks. The fastest approximation algorithm for optimizing modularity on large networks was proposed by Clauset et al [8. That method consists in recurrently merging communities that optimize the production of modularity. Unfortunately, this greedy algorithm may produce values of modularity that are significantly lower than what can be found by using, for instance, simulated annealing [15]. Moreover, the method proposed in [8 has a tendency to produce super-communities that contain a large fraction of the nodes, even on synthetic networks D that have no significant community structure. This artefact also has the disadvantage of slowing down the algorithm considerably and makes it inapplicable to networks of more than a million nodes. This undesired effect has been circumvented by introducing tricks in order to balance the size of the communities being merged, thereby speeding up the running time and making it possible to deal with networks that have a few million 三 nodes 16 The largest networks that have been dealt with so far in the literature are a protein- protein interaction network of 30 739 nodes [17], a network of about 400000 items on sale on the website of a large on-line retailer 8 and a Japanese social networking system of about 5.5 million users [16. These sizes still leave considerable room for improvement [18] considering that, as of today, the social networking service Facebook has about 64 million active users, the mobile network operator Vodafone has about 200 million customers and Google indexes several billion web-pages. Let us also notice that in most large networks such as those listed above there are several natural organization levels-communities divide themselves into sub-communities-and it is thus desirable to obtain community detection methods that reveal this hierarchical structure [19 2. Method We now introduce our algorithm that finds high modularity partitions of large networks in a short time and that unfolds a complete hierarchical community structure for the O network, thereby giving access to different resolutions of community detection. Contrary o to all the other community detection algorithms, the network size limits that we are facing with our algorithm are due to limited storage capacity rather than limited computation time: identifying communities in a 118 million nodes network took only 152 mint Our algorithm is divided into two phases that are repeated iteratively. Assume that we start with a weighted network of N nodes. First, we assign a different community to ach node of the network. So, in this initial partition there are as many communities as there are nodes. Then, for each node i we consider the neighbours j of i and we evaluate the gain of modularity that would take place by removing i from its community and by All methods described here have been compiled and tested on the same machine: a bi-opteron 22k with 24 GB ofmemoryThecodeisfreelyavailablefordownloadontheweb-pagehttp://findcommunities.googlepages doi:10.1088/1742-5468/2008/10/P10008

J. Stat. Mech. (2008) P10008 Fast unfolding of communities in large networks where Aij represents the weight of the edge between i and j, ki = j Aij is the sum of the weights of the edges attached to vertex i, ci is the community to which vertex i is assigned, the δ function δ(u, v) is 1 if u = v and 0 otherwise and m = 1 2 ij Aij . Modularity has been used to compare the quality of the partitions obtained by different methods, but also as an objective function to optimize [13]. Unfortunately, exact modularity optimization is a problem that is computationally hard [14] and so approximation algorithms are necessary when dealing with large networks. The fastest approximation algorithm for optimizing modularity on large networks was proposed by Clauset et al [8]. That method consists in recurrently merging communities that optimize the production of modularity. Unfortunately, this greedy algorithm may produce values of modularity that are significantly lower than what can be found by using, for instance, simulated annealing [15]. Moreover, the method proposed in [8] has a tendency to produce super-communities that contain a large fraction of the nodes, even on synthetic networks that have no significant community structure. This artefact also has the disadvantage of slowing down the algorithm considerably and makes it inapplicable to networks of more than a million nodes. This undesired effect has been circumvented by introducing tricks in order to balance the size of the communities being merged, thereby speeding up the running time and making it possible to deal with networks that have a few million nodes [16]. The largest networks that have been dealt with so far in the literature are a protein– protein interaction network of 30 739 nodes [17], a network of about 400 000 items on sale on the website of a large on-line retailer [8] and a Japanese social networking system of about 5.5 million users [16]. These sizes still leave considerable room for improvement [18] considering that, as of today, the social networking service Facebook has about 64 million active users, the mobile network operator Vodafone has about 200 million customers and Google indexes several billion web-pages. Let us also notice that in most large networks such as those listed above there are several natural organization levels—communities divide themselves into sub-communities—and it is thus desirable to obtain community detection methods that reveal this hierarchical structure [19]. 2. Method We now introduce our algorithm that finds high modularity partitions of large networks in a short time and that unfolds a complete hierarchical community structure for the network, thereby giving access to different resolutions of community detection. Contrary to all the other community detection algorithms, the network size limits that we are facing with our algorithm are due to limited storage capacity rather than limited computation time: identifying communities in a 118 million nodes network took only 152 min4. Our algorithm is divided into two phases that are repeated iteratively. Assume that we start with a weighted network of N nodes. First, we assign a different community to each node of the network. So, in this initial partition there are as many communities as there are nodes. Then, for each node i we consider the neighbours j of i and we evaluate the gain of modularity that would take place by removing i from its community and by 4 All methods described here have been compiled and tested on the same machine: a bi-opteron 2.2k with 24 GB of memory. The code is freely available for download on the web-page http://findcommunities.googlepages.com. doi:10.1088/1742-5468/2008/10/P10008 3

Fast unfolding of communities in large networks placing it in the community of 3. The node i is then placed in the community for which this gain is maximum(in the case of a tie we use a breaking rule), but only if this gain is positive. If no positive gain is possible, i stays in its original community. This process is applied repeatedly and sequentially for all nodes until no further improvement can be achieved and the first phase is then complete. Let us insist on the fact that a node may be and often is, considered several times. This first phase stops when a local maxima of the modularity is attained, i. e. when no individual move can improve the modularity. One should also note that the output of the algorithm depends on the order in which the nodes are considered. Preliminary results on several test cases seem to indicate that the orderin of the nodes does not have a significant influence on the modularity that is obtained, while it may affect the computation time, but the reasons for this dependence are not clear. In particular, taking the nodes in a natural order related to the community structure itself (e.g. the order given by a previous community computation, or the postcode)does not D give clear improvement( see section 3). The problem of choosing an order is thus worth studying since it could give good heuristics to enhance the computation time Part of the algorithm's efficiency results from the fact that the gain in modularity AQ obtained by moving an isolated node i into a community C can easily be computed 三 Q 2m )-[需-()-(篇 where > in is the sum of the weights of the links inside C, tot is the sum of the weights node i, kin is the sum of the weights of the links from i to nodes in C and m is the sum N of the weights of all the links in the network. A similar expression is used in order to evaluate the change of modularity when i is removed from its community. In practice, one therefore evaluates the change of modularity by removing i from its community andO then by moving it into a neighbouring community. The second phase of the algorithm consists in building a new network whose nodes are now the communities found during the first phase. To do so, the weights of the links U between the new nodes are given by the sum of the weight of the links between nodes in the corresponding two communities 20. Links between nodes of the same community lead to self-loops for this community in the new network. Once this second phase is completed, it is then possible to reapply the first phase of the algorithm to the resulting weighted network and to iterate. Let us denote by pass'a combination of these two phases. By construction. the number of meta-communities decreases at each pass, and o as a consequence most of the computing time is used in the first pass. The passes are iterated(see figure 1) until there are no more changes and a maximum of modularity is attained. The algorithm is reminiscent of the self-similar nature of complex networks 21 and naturally incorporates a notion of hierarchy, as communities of communities are built during the process. The height of the hierarchy that is constructed is determined by the number of passes and is generally a small number, as will be shown in some examples below In order to decrease the overall running time of the method it is possible to introduce a threshold and then stop the first phase as soon as the relative gain in modularity does not exceed this threshold. The numerical results reported here have been obtained with this minor modification doi:10.1088/1742-5468/2008/10/P10008

J. Stat. Mech. (2008) P10008 Fast unfolding of communities in large networks placing it in the community of j. The node i is then placed in the community for which this gain is maximum (in the case of a tie we use a breaking rule), but only if this gain is positive. If no positive gain is possible, i stays in its original community. This process is applied repeatedly and sequentially for all nodes until no further improvement can be achieved and the first phase is then complete. Let us insist on the fact that a node may be, and often is, considered several times. This first phase stops when a local maxima of the modularity is attained, i.e. when no individual move can improve the modularity5. One should also note that the output of the algorithm depends on the order in which the nodes are considered. Preliminary results on several test cases seem to indicate that the ordering of the nodes does not have a significant influence on the modularity that is obtained, while it may affect the computation time, but the reasons for this dependence are not clear. In particular, taking the nodes in a natural order related to the community structure itself (e.g. the order given by a previous community computation, or the postcode) does not give clear improvement (see section 3). The problem of choosing an order is thus worth studying since it could give good heuristics to enhance the computation time. Part of the algorithm’s efficiency results from the fact that the gain in modularity ΔQ obtained by moving an isolated node i into a community C can easily be computed by ΔQ = in +2ki,in 2m − tot +ki 2m 2 − in 2m − tot 2m 2 − ki 2m 2 , (2) where in is the sum of the weights of the links inside C, tot is the sum of the weights of the links incident to nodes in C, ki is the sum of the weights of the links incident to node i, ki,in is the sum of the weights of the links from i to nodes in C and m is the sum of the weights of all the links in the network. A similar expression is used in order to evaluate the change of modularity when i is removed from its community. In practice, one therefore evaluates the change of modularity by removing i from its community and then by moving it into a neighbouring community. The second phase of the algorithm consists in building a new network whose nodes are now the communities found during the first phase. To do so, the weights of the links between the new nodes are given by the sum of the weight of the links between nodes in the corresponding two communities [20]. Links between nodes of the same community lead to self-loops for this community in the new network. Once this second phase is completed, it is then possible to reapply the first phase of the algorithm to the resulting weighted network and to iterate. Let us denote by ‘pass’ a combination of these two phases. By construction, the number of meta-communities decreases at each pass, and as a consequence most of the computing time is used in the first pass. The passes are iterated (see figure 1) until there are no more changes and a maximum of modularity is attained. The algorithm is reminiscent of the self-similar nature of complex networks [21] and naturally incorporates a notion of hierarchy, as communities of communities are built during the process. The height of the hierarchy that is constructed is determined by the number of passes and is generally a small number, as will be shown in some examples below. 5 In order to decrease the overall running time of the method it is possible to introduce a threshold and then stop the first phase as soon as the relative gain in modularity does not exceed this threshold. The numerical results reported here have been obtained with this minor modification. doi:10.1088/1742-5468/2008/10/P10008 4