Fast Construction of Overlay Networks Dana Angluin James Aspnes Jiang Chen Dept.Comp.Sci Dept.Comp.Sci Dept.Comp.Sci Yale University Yale University Yale University New Haven,CT 06520 New Haven,CT 06520 New Haven,CT 06520 dana.angluin@yale.edu aspnes@cs.yale.edu jiang.chen@yale.edu Yinghua Wu Yitong Yin Dept.Comp.Sci. Dept.Comp.Sci. Yale University Yale University New Haven,CT 06520 New Haven,CT 06520 y.wu@yale.edu yitong.yin@yale.edu ABSTRACT General Terms An asynchronous algorithm is described for rapidly con- Algorithms Theory structing an overlay network in a peer-to-peer system where all nodes can in principle communicate with each other di- rectly through an underlying network,but each participat- Keywords ing node initially has pointers to only a handful of other Overlay networks,asynchronous,merging,Patricia trees participants.The output of the mechanism is a linked list of all participants sorted by their identifiers,which can be used as a foundation for building various linear overlay net- 1 INTRODUCTION works such as Chord or skip graphs.Assuming the initial Consider the problem of rapidly building a peer-to-peer pointer graph is weakly-connected with maximum degree d system with a ring or line structure such as Chord [19]or and the length of a node identifier is W,the mechanism skip graphs 3.The default assumption in both systems ap- constructs a binary search tree of nodes of depth O(W)in pears to be that nodes will be inserted sequentially,giving expected O(W log n)time using expected O((d+W)nlogn) a construction time of O(nlog-n)for Chord and O(nlog n) messages of size O(W)each.Furthermore,the algorithm for skip graphs.But how quickly can we build such a sys has low contention:at any time there are only O(d)unde- tem if we do so in parallel,assuming that initially each node livered messages for any given recipient.A lower bound of only knows about a few other nodes in the system?At min- (d+log n)is given for the running time of any procedure imum,we need to be able to construct the bottom ring of in a related synchronous model that vields a sorted list from the system,which consists of all of the nodes sorted by their a degree-d weakly-connected graph of n nodes.We conjec- identifiers(randomly chosen in Chord,based on keys in skip ture that this lower bound is tight and could be attained by graphs).Constructing such a system thus depends on be- further improvements to our algorithms. ing able to sort nodes quickly;having done so,rebuilding the rest of the system takes little additional time.If build- Categories and Subject Descriptors ing a peer-to-peer system from scratch can be done quickly enough,the payoff is high:we can instantly deploy peer- C.2.4 [Computer-Communication Networks]:Distributed to-peer networks as needed as a tool in more complex dis- Systems-Distributed applications;F.2.2 Analysis of Al- tributed algorithms,and we can drop the cumbersome repair gorithms and Problem Complexity]:Nonnumerical Al- mechanisms found in many practical structured peer-to-peer gorithms and Problems systems in favor of a policy of periodic mass destruction and renewal. Supported in part by NSF grants CCR-0098078,CNS- 0305258.and CNS-0435201. In our model,we assume that any node can communi- TSupported by NSF grants CCR-0098078 and CNS-0305258. cate with any other node once it knows the other's address, and that initially,the nodes are organized into a weakly- +Supported by NSF grants CCR-0098078 and CNS-0305258. connected knowledge graph of bounded degree d,where a (directed)edge from u to v means that u knows u's ad- dress.Our algorithm proceeds by reorganizing this weakly- connected graph as a low-depth binary search tree where Permission to make digital or hard copies of all or part of this work for each node supplies both a leaf and at most one internal personal or classroom use is granted without fee provided that copies are node;the sorted list can then be read off of the leaves.Our not made or distributed for profit or commercial advantage and that copies algorithm has low contention:each node receives at most bear this notice and the full citation on the first page.To copy otherwise,to O(d)messages in a single round (in the synchronous version republish,to post on servers or to redistribute to lists,requires prior specific permission and/or a fee. of the algorithm)or has at most O(d)pending undelivered SPA4'05.July 18-20,2005,Las Vegas,Nevada,USA. messages at any time (in the asynchronous version).It also Copyright2005ACM1-58113-986-1/05/0007..$5.00 uses only short messages,of size proportional to the node
Fast Construction of Overlay Networks Dana Angluin Dept. Comp. Sci. Yale University New Haven, CT 06520 dana.angluin@yale.edu James Aspnes ∗ Dept. Comp. Sci. Yale University New Haven, CT 06520 aspnes@cs.yale.edu Jiang Chen Dept. Comp. Sci. Yale University New Haven, CT 06520 jiang.chen@yale.edu Yinghua Wu † Dept. Comp. Sci. Yale University New Haven, CT 06520 y.wu@yale.edu Yitong Yin ‡ Dept. Comp. Sci. Yale University New Haven, CT 06520 yitong.yin@yale.edu ABSTRACT An asynchronous algorithm is described for rapidly constructing an overlay network in a peer-to-peer system where all nodes can in principle communicate with each other directly through an underlying network, but each participating node initially has pointers to only a handful of other participants. The output of the mechanism is a linked list of all participants sorted by their identifiers, which can be used as a foundation for building various linear overlay networks such as Chord or skip graphs. Assuming the initial pointer graph is weakly-connected with maximum degree d and the length of a node identifier is W, the mechanism constructs a binary search tree of nodes of depth O(W) in expected O(W log n) time using expected O((d+W)n log n) messages of size O(W) each. Furthermore, the algorithm has low contention: at any time there are only O(d) undelivered messages for any given recipient. A lower bound of Ω(d + log n) is given for the running time of any procedure in a related synchronous model that yields a sorted list from a degree-d weakly-connected graph of n nodes. We conjecture that this lower bound is tight and could be attained by further improvements to our algorithms. Categories and Subject Descriptors C.2.4 [Computer-Communication Networks]: Distributed Systems—Distributed applications; F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems ∗ Supported in part by NSF grants CCR-0098078, CNS- 0305258, and CNS-0435201. † Supported by NSF grants CCR-0098078 and CNS-0305258. ‡ Supported by NSF grants CCR-0098078 and CNS-0305258. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SPAA’05, July 18–20, 2005, Las Vegas, Nevada, USA. Copyright 2005 ACM 1-58113-986-1/05/0007 ...$5.00. General Terms Algorithms Theory Keywords Overlay networks, asynchronous, merging, Patricia trees 1. INTRODUCTION Consider the problem of rapidly building a peer-to-peer system with a ring or line structure such as Chord [19] or skip graphs [3]. The default assumption in both systems appears to be that nodes will be inserted sequentially, giving a construction time of O(n log2 n) for Chord and O(n log n) for skip graphs. But how quickly can we build such a system if we do so in parallel, assuming that initially each node only knows about a few other nodes in the system? At minimum, we need to be able to construct the bottom ring of the system, which consists of all of the nodes sorted by their identifiers (randomly chosen in Chord, based on keys in skip graphs). Constructing such a system thus depends on being able to sort nodes quickly; having done so, rebuilding the rest of the system takes little additional time. If building a peer-to-peer system from scratch can be done quickly enough, the payoff is high: we can instantly deploy peerto-peer networks as needed as a tool in more complex distributed algorithms, and we can drop the cumbersome repair mechanisms found in many practical structured peer-to-peer systems in favor of a policy of periodic mass destruction and renewal. In our model, we assume that any node can communicate with any other node once it knows the other’s address, and that initially, the nodes are organized into a weaklyconnected knowledge graph of bounded degree d, where a (directed) edge from u to v means that u knows v’s address. Our algorithm proceeds by reorganizing this weaklyconnected graph as a low-depth binary search tree where each node supplies both a leaf and at most one internal node; the sorted list can then be read off of the leaves. Our algorithm has low contention: each node receives at most O(d) messages in a single round (in the synchronous version of the algorithm) or has at most O(d) pending undelivered messages at any time (in the asynchronous version). It also uses only short messages, of size proportional to the node
identities,and requires storing only O(d)identities per node leaves of the tree to communicate with other supern- Our algorithm is constructed from three components: odes,with internal nodes aggregating the incoming data for delivery to the root.This increases the tree's 1.A randomized pairing algorithm that,starting from effective bandwidth proportional to the size of the tree a degree-d weakly-connected graph,pairs off a con- while keeping the contention down to the minimum stant fraction of the nodes on average in O(1)time. O(d)necessitated by the degree of the original knowl- This algorithm is described in Section 3.1.The prob- edge graph.The cost of this strategy is that incoming lem solved is similar to the problem of constructing a messages are effectively delayed by the depth w of distributed matching 11,20,except that there is no the Patricia tree,adding a factor of W slowdown to requirement that paired nodes be joined by an edge in the algorithm.Using the entire tree also means that the original knowledge graph.A complication is that the simulated supernode must wait for a merge to com- since the original knowledge graph is directed,at any plete before starting a new iteration of the pairing al- time a node may learn of the existence of new neigh- gorithm,which prevents its from taking advantage of bors,and care needs to be taken to prevent deadlocks. pipelined merges.We discuss some ideas for how such The output of the pairing algorithm is used to join in- bottlenecks might be avoided in Section 5. dividual nodes into simulated supernodes that then participate in subsequent iterations of the pairing algo- We also consider how to use the tree to build a ring (Sec rithm.These supernodes are in turn joined into larger tion 3.4)and the effects of churn (Section 3.5). supernodes,until only a single supernode (consisting In an asynchronous model,the total time for the expected of all the nodes in the system)remains,after an ex- O(log n)iterations of pairing multiplied by the O(W)merg- pected O(log n)iterations.For this construction to ing and communications costs of each iteration is O(W log n) work,we need two additional mechanisms In a synchronous model,this can be improved by using the simple pairing algorithm described above to construct a bal- 2.A distributed merging algorithm for combining bal- anced tree of depth O(log n)in O((d+log n)log n)time (the anced trees of nodes.In a synchronous model,this d factor vanishes if nodes are allowed to send more than one algorithm can be very simple:because all the trees outgoing message per time unit),and the nodes can then be constructed after k rounds will have depth O(k),it is sorted using pipelined Patricia tree merges in an additional enough to recruit a new root node to join two trees O(W +logn)time for a total of O(W+(d+logn)logn) together into a tree of depth O(k+1).In an asyn- time.All of these algorithms have contention at most O(d). chronous model,an adversarial scheduler can arrange use messages of size O(W),and store O(dW)bits of state for particularly fast nodes to merge early followed by a per node. slower succession of singleton nodes,leading to a tree These limits compare favorably to previously known algo- of depth (n)using the simple algorithm.Instead. rithms in this model for resource discovery 2,10,13,14,16 we describe an algorithm obtained by parallelizing the or leader election [4,which also construct trees over nodes sequential Patricia tree merge procedure of Okasaki that initially form a weakly-connected graph.In these al- and Gill [18]:this algorithm,described in Section 3.2 gorithms,a single participant may receive messages from a assigns a single leaf node and at most one internal very large number of other participants in a short amount of node of the Patricia tree to each physical node in the time;messages are often impractically long,conveying in the system,and merges two Patricia trees of depth W in worst case the identities of every node in the system;and the time O(W).where W is the length of a node identity resulting trees have very high degree,which not only leads to Though we do not use this fact in our main result,our high contention in any algorithm that uses them but limits merging algorithm can be pipelined:a depth-k tree of performance if nodes are also limited in how many messages up to 2merge operations can be carried out in parallel they can send per time unit.We discuss these results further in O(k+W)time with O(1)contention. in Section 1.1. Note that because Patricia trees are a form of binary Though our algorithm is reasonably efficient,we do not search tree,a consequence of using Patricia trees to believe that it is optimal.The best lower bound we know represent supernodes is that the leaves are automati- how to prove for constructing a sorted list of nodes start- cally sorted.We use this fact to generate the sorted ing from a weakly-connected graph with maximum degree d ring of physical nodes that our main result promises, in the synchronous model is (d+log n);here the d term but the ability to rapidly generate a binary search tree comes from the assumption that a node can only send to one with low contention starting from a weakly-connected recipient per round,and the logn term comes from the time knowledge graph may also be useful for other applica- to communicate from one end of a length-n line to the other tions. using pointer jumping (see Section 4 for details).Our sus- picion is that this lower bound is in close to the true upper 3.A supernode simulation that allows trees of ordi- bound,and that an algorithm that interleaved pairing and nary physical nodes to simulate a single supernode in merging operations more tightly could achieve something the pairing algorithm (Section 3.3).Though the es- very close to it with high probability. sential idea of this simulation is simple-have the root Due to space limitations,most proofs in this extended of each tree simulate the supernode-some care needs abstract are omitted or only briefly sketched. to be taken to keep the root from being swamped with information.The actual simulation algorithm uses the 11 Related Work IIn a synchronous model,time is measured in rounds.In an The problem of organizing a weakly-connected set of nodes asynchronous model,time is measured by assuming a con- into a useful data structure combines aspects of both sorting stant maximum message delay,and assigning all events the and resource discovery.We discuss the extensive prior lit- latest possible time consistent with this assumption.Details erature on resource discovery first,and then consider some are given in Section 2. other algorithms that solve problems closer to ours
identities, and requires storing only O(d) identities per node. Our algorithm is constructed from three components: 1. A randomized pairing algorithm that, starting from a degree-d weakly-connected graph, pairs off a constant fraction of the nodes on average in O(1) time.1 This algorithm is described in Section 3.1. The problem solved is similar to the problem of constructing a distributed matching [11, 20], except that there is no requirement that paired nodes be joined by an edge in the original knowledge graph. A complication is that since the original knowledge graph is directed, at any time a node may learn of the existence of new neighbors, and care needs to be taken to prevent deadlocks. The output of the pairing algorithm is used to join individual nodes into simulated supernodes that then participate in subsequent iterations of the pairing algorithm. These supernodes are in turn joined into larger supernodes, until only a single supernode (consisting of all the nodes in the system) remains, after an expected O(log n) iterations. For this construction to work, we need two additional mechanisms. 2. A distributed merging algorithm for combining balanced trees of nodes. In a synchronous model, this algorithm can be very simple: because all the trees constructed after k rounds will have depth O(k), it is enough to recruit a new root node to join two trees together into a tree of depth O(k + 1). In an asynchronous model, an adversarial scheduler can arrange for particularly fast nodes to merge early followed by a slower succession of singleton nodes, leading to a tree of depth Θ(n) using the simple algorithm. Instead, we describe an algorithm obtained by parallelizing the sequential Patricia tree merge procedure of Okasaki and Gill [18]; this algorithm, described in Section 3.2, assigns a single leaf node and at most one internal node of the Patricia tree to each physical node in the system, and merges two Patricia trees of depth W in time O(W), where W is the length of a node identity. Though we do not use this fact in our main result, our merging algorithm can be pipelined: a depth-k tree of up to 2k merge operations can be carried out in parallel in O(k + W) time with O(1) contention. Note that because Patricia trees are a form of binary search tree, a consequence of using Patricia trees to represent supernodes is that the leaves are automatically sorted. We use this fact to generate the sorted ring of physical nodes that our main result promises, but the ability to rapidly generate a binary search tree with low contention starting from a weakly-connected knowledge graph may also be useful for other applications. 3. A supernode simulation that allows trees of ordinary physical nodes to simulate a single supernode in the pairing algorithm (Section 3.3). Though the essential idea of this simulation is simple—have the root of each tree simulate the supernode—some care needs to be taken to keep the root from being swamped with information. The actual simulation algorithm uses the 1 In a synchronous model, time is measured in rounds. In an asynchronous model, time is measured by assuming a constant maximum message delay, and assigning all events the latest possible time consistent with this assumption. Details are given in Section 2. leaves of the tree to communicate with other supernodes, with internal nodes aggregating the incoming data for delivery to the root. This increases the tree’s effective bandwidth proportional to the size of the tree while keeping the contention down to the minimum O(d) necessitated by the degree of the original knowledge graph. The cost of this strategy is that incoming messages are effectively delayed by the depth W of the Patricia tree, adding a factor of W slowdown to the algorithm. Using the entire tree also means that the simulated supernode must wait for a merge to complete before starting a new iteration of the pairing algorithm, which prevents its from taking advantage of pipelined merges. We discuss some ideas for how such bottlenecks might be avoided in Section 5. We also consider how to use the tree to build a ring (Section 3.4) and the effects of churn (Section 3.5). In an asynchronous model, the total time for the expected O(log n) iterations of pairing multiplied by the O(W) merging and communications costs of each iteration is O(W log n). In a synchronous model, this can be improved by using the simple pairing algorithm described above to construct a balanced tree of depth O(log n) in O((d+log n) log n) time (the d factor vanishes if nodes are allowed to send more than one outgoing message per time unit), and the nodes can then be sorted using pipelined Patricia tree merges in an additional O(W + log n) time for a total of O(W + (d + log n) log n) time. All of these algorithms have contention at most O(d), use messages of size O(W), and store O(dW) bits of state per node. These limits compare favorably to previously known algorithms in this model for resource discovery [2,10,13,14,16] or leader election [4], which also construct trees over nodes that initially form a weakly-connected graph. In these algorithms, a single participant may receive messages from a very large number of other participants in a short amount of time; messages are often impractically long, conveying in the worst case the identities of every node in the system; and the resulting trees have very high degree, which not only leads to high contention in any algorithm that uses them but limits performance if nodes are also limited in how many messages they can send per time unit. We discuss these results further in Section 1.1. Though our algorithm is reasonably efficient, we do not believe that it is optimal. The best lower bound we know how to prove for constructing a sorted list of nodes starting from a weakly-connected graph with maximum degree d in the synchronous model is Ω(d + log n); here the d term comes from the assumption that a node can only send to one recipient per round, and the log n term comes from the time to communicate from one end of a length-n line to the other using pointer jumping (see Section 4 for details). Our suspicion is that this lower bound is in close to the true upper bound, and that an algorithm that interleaved pairing and merging operations more tightly could achieve something very close to it with high probability. Due to space limitations, most proofs in this extended abstract are omitted or only briefly sketched. 1.1 Related Work The problem of organizing a weakly-connected set of nodes into a useful data structure combines aspects of both sorting and resource discovery. We discuss the extensive prior literature on resource discovery first, and then consider some other algorithms that solve problems closer to ours
1.1.1 The Resource Discovery Problem eral links,analogous to the addition of edges in a knowledge Harchol-Balter,Leighton,and Lewin 10 introduced the graph.Assuming n nodes in an initial knowledge graph that Resource Discovery Problem to model the situation in which is connected and undirected,they give a deterministic algo all the processes in an initial weakly connected knowledge rithm for leader election in O(n)messages and time O(n). graph learn the identities of all the other processes,prelim- Upon termination,the corresponding knowledge graph con- inary to cooperating in a distributed computation 10.In tains an edge from the leader to every process in the network. terms of the knowledge graph,the goal is to construct the and a path of length at most O(log n)from each non-leader complete graph from an arbitrary weakly connected graph to the leader,solving the Ad-hoc Resource Discovery Prob- Subsequently,the problem definition was relaxed to require lem as defined by Abraham and Dolev [2].The time bound that one process become the leader and learn the identities reflects the fact that there may be long chains of merges that of all the other processes,which each learn the identity of happen sequentially. the leader [14].This implies that the final knowledge graph contains a star(with bi-directional edges)on all the vertices 1.1.3 Parallel Sorting The problem has been considered in both synchronous and Algorithms for sorting on parallel machines have been asynchronous models of computation.In the synchronous studied extensively.The closest such algorithm to our work model,in each round each node may contact one or more of is that of Goodrich and Kosaraju 9 for a parallel pointer its neighbors in the current knowledge graph and exchange machine,which also proceeds by building a binary tree over some subset of its neighbor list.Harchol-Balter.Leighton nodes and then merging components according to the tree. and Lewin give a simple randomized algorithm to solve the Their algorithm makes use of a clever parallel linked-list complete-graph version of the problem with high probability merge operation that allows consecutive merging phases to in O(log n)rounds,O(nlogn)messages,and O(n log"n) be pipelined,giving an O(log n)total time.We believe that bit complexity.Law and Siu give another randomized al- essentially the same merging algorithm could be adapted to gorithm that improves each of these bounds by a factor of our model.although improvements in other parts of our al- log n 16 gorithm would be necessary for this change to improve our Kutten,Peleg,and Vishkin give a deterministic algorithm overall running time. to solve the star-graph version of the problem in O(log n) 1.1.4 Tree structures in previous work rounds,O(nlog n)messages,and O(Eo log-n)bit com- plexity,where Eo is the set of edges of the initial knowledge A distributed trie is used as a search structure for P2P sys- graph [14].A single additional round in which the leader tems in several previous papers.Karl Aberer [1]designed sends the whole list to all processes achieves a complete a scalable access structure named P-Grid based on a multi- graph,and adds O(n)to message complexity and O(n2log n) level trie with each node representing one specific interval to bit complexity. of the key space.And at each level every peer is associ- In the asynchronous model of computation,all messages ated with exactly one tree node and maintains references to sent will eventually arrive after a finite but unbounded time, cover the other side of the search tree so that a search can be and messages from u to v are delivered to v in the order in started at any peer.The convergence of P-Grid construc- which u sent them.Kutten and Peleg give a determinis- tion strongly depends on the assumption that peers meet tic algorithm to solve the star-graph version of the prob- frequently and randomly pairwise which is not so achievable lem in O(nlog n)messages and O(Eol log2n)bit complex- in application.Although the paper provides some simula- ity 13.Abraham and Dolev also consider asynchronous tion results,it doesn't give further proof.Michael J.Freed- computation,and generalize the problem to finding a leader man and Radek Vingralek [7]presented a similar approach in each weakly connected component of the initial knowl- while taking advantage of information piggybacking during edge graph 2.They show that knowledge of n,the size lookups to achieve dynamic partitioning and lazy updates. of a node's component,affects the achievable bounds.In However,the performance of the algorithm depends on the particular,when n is unknown,they give a lower bound of lookup pattern and the paper also lacks proof.Others have (n logn)on message complexity,and a deterministic algo- proposed using B-tree variants to index small numbers of rithm with message complexity O(n logn)and bit complex- nodes (hundreds)in distributed databases12.21. ity O(Eol logn nlog2n).When n is known,they give a deterministic algorithm with O(na(n))message complexity 2. MODEL and O(Eollogn +nlog'n)bit complexity,where a(n)is We assume that in the initial state each process knows the inverse Ackermann function.They also define the Ad- the identifiers of some number of other processes.This in- hoc Resource Discovery Problem,in which each non-leader formation forms a knowledge graph,a directed graph with must only have an identified path to its leader,rather than one vertex per process and an edge from u to v whenever a direct edge.For this problem,they show a lower bound u knows about v.The knowledge graph may evolve over on message complexity of (na(n)),and an algorithm that time as processes tell each other about other processes;if achieves message complexity O(na(n))and bit complexity u knows about both v and w,it may send a message to v O(Eol logn +n log2n).This algorithm also deals efficiently containing w's identifier,and when v receives this message with dynamic additions of nodes and links to the network. the edge vw is added to the graph.We assume throughout It is worth mentioning that our trees solve the Ad-hoc Re- that a process u can only send a message to another process source Discovery Problem,which means the Abraham and v if uv is present in the graph when the message is sent.We Dolev (na(n))lower bound for messages applies. assume that the initial knowledge graph Go is weakly con- nected and has maximum degree d,where the degree d(u) 1.1.2 Leader Election of a vertex u is defined to be the sum of its indegree d(u) Cidon.Gopal and Kutten 4 introduce a detailed and and its outdegree d(u),which are the number of incoming technologically motivated model in which processes in a net- and outgoing edges for u,respectively. work may use newly learned routes to send messages at Processes communicate by passing messages along edges cost O(1),although the physical route may consist of sev- of the knowledge graph.Formally,we assume that messages
1.1.1 The Resource Discovery Problem Harchol-Balter, Leighton, and Lewin [10] introduced the Resource Discovery Problem to model the situation in which all the processes in an initial weakly connected knowledge graph learn the identities of all the other processes, preliminary to cooperating in a distributed computation [10]. In terms of the knowledge graph, the goal is to construct the complete graph from an arbitrary weakly connected graph. Subsequently, the problem definition was relaxed to require that one process become the leader and learn the identities of all the other processes, which each learn the identity of the leader [14]. This implies that the final knowledge graph contains a star (with bi-directional edges) on all the vertices. The problem has been considered in both synchronous and asynchronous models of computation. In the synchronous model, in each round each node may contact one or more of its neighbors in the current knowledge graph and exchange some subset of its neighbor list. Harchol-Balter, Leighton and Lewin give a simple randomized algorithm to solve the complete-graph version of the problem with high probability in O(log2 n) rounds, O(n log2 n) messages, and O(n 2 log3 n) bit complexity. Law and Siu give another randomized algorithm that improves each of these bounds by a factor of log n [16]. Kutten, Peleg, and Vishkin give a deterministic algorithm to solve the star-graph version of the problem in O(log n) rounds, O(n log n) messages, and O(|E0| log2 n) bit complexity, where E0 is the set of edges of the initial knowledge graph [14]. A single additional round in which the leader sends the whole list to all processes achieves a complete graph, and adds O(n) to message complexity and O(n 2 log n) to bit complexity. In the asynchronous model of computation, all messages sent will eventually arrive after a finite but unbounded time, and messages from u to v are delivered to v in the order in which u sent them. Kutten and Peleg give a deterministic algorithm to solve the star-graph version of the problem in O(n log n) messages and O(|E0| log2 n) bit complexity [13]. Abraham and Dolev also consider asynchronous computation, and generalize the problem to finding a leader in each weakly connected component of the initial knowledge graph [2]. They show that knowledge of n, the size of a node’s component, affects the achievable bounds. In particular, when n is unknown, they give a lower bound of Ω(n log n) on message complexity, and a deterministic algorithm with message complexity O(n log n) and bit complexity O(|E0| log n + n log2 n). When n is known, they give a deterministic algorithm with O(nα(n)) message complexity and O(|E0| log n + n log2 n) bit complexity, where α(n) is the inverse Ackermann function. They also define the Adhoc Resource Discovery Problem, in which each non-leader must only have an identified path to its leader, rather than a direct edge. For this problem, they show a lower bound on message complexity of Ω(nα(n)), and an algorithm that achieves message complexity O(nα(n)) and bit complexity O(|E0| log n+n log2 n). This algorithm also deals efficiently with dynamic additions of nodes and links to the network. It is worth mentioning that our trees solve the Ad-hoc Resource Discovery Problem, which means the Abraham and Dolev Ω(nα(n)) lower bound for messages applies. 1.1.2 Leader Election Cidon, Gopal and Kutten [4] introduce a detailed and technologically motivated model in which processes in a network may use newly learned routes to send messages at cost O(1), although the physical route may consist of several links, analogous to the addition of edges in a knowledge graph. Assuming n nodes in an initial knowledge graph that is connected and undirected, they give a deterministic algorithm for leader election in O(n) messages and time O(n). Upon termination, the corresponding knowledge graph contains an edge from the leader to every process in the network, and a path of length at most O(log n) from each non-leader to the leader, solving the Ad-hoc Resource Discovery Problem as defined by Abraham and Dolev [2]. The time bound reflects the fact that there may be long chains of merges that happen sequentially. 1.1.3 Parallel Sorting Algorithms for sorting on parallel machines have been studied extensively. The closest such algorithm to our work is that of Goodrich and Kosaraju [9] for a parallel pointer machine, which also proceeds by building a binary tree over nodes and then merging components according to the tree. Their algorithm makes use of a clever parallel linked-list merge operation that allows consecutive merging phases to be pipelined, giving an O(log n) total time. We believe that essentially the same merging algorithm could be adapted to our model, although improvements in other parts of our algorithm would be necessary for this change to improve our overall running time. 1.1.4 Tree structures in previous work A distributed trie is used as a search structure for P2P systems in several previous papers. Karl Aberer [1] designed a scalable access structure named P-Grid based on a multilevel trie with each node representing one specific interval of the key space. And at each level every peer is associated with exactly one tree node and maintains references to cover the other side of the search tree so that a search can be started at any peer. The convergence of P-Grid construction strongly depends on the assumption that peers meet frequently and randomly pairwise which is not so achievable in application. Although the paper provides some simulation results, it doesn’t give further proof. Michael J. Freedman and Radek Vingralek [7] presented a similar approach while taking advantage of information piggybacking during lookups to achieve dynamic partitioning and lazy updates. However, the performance of the algorithm depends on the lookup pattern and the paper also lacks proof. Others have proposed using B-tree variants to index small numbers of nodes (hundreds) in distributed databases [12, 21]. 2. MODEL We assume that in the initial state each process knows the identifiers of some number of other processes. This information forms a knowledge graph, a directed graph with one vertex per process and an edge from u to v whenever u knows about v. The knowledge graph may evolve over time as processes tell each other about other processes; if u knows about both v and w, it may send a message to v containing w’s identifier, and when v receives this message the edge vw is added to the graph. We assume throughout that a process u can only send a message to another process v if uv is present in the graph when the message is sent. We assume that the initial knowledge graph G0 is weakly connected and has maximum degree d, where the degree d(u) of a vertex u is defined to be the sum of its indegree d −(u) and its outdegree d +(u), which are the number of incoming and outgoing edges for u, respectively. Processes communicate by passing messages along edges of the knowledge graph. Formally, we assume that messages
are of the form (s,t,)or (s,t,o,U),where s is the sender, In Section 3.2,we show that Patricia trees 17,using a t is the receiver,o is a message type,and U (if present)is a parallelized version of the Patricia tree merge procedure of vector of O(1)process identifiers.The state of each process Okasaki and Gill 18,are a good choice for the balanced tree consists of a state variable g together with a set of successors data structure.Using Patricia trees,we obtain a sorted final S.Upon receiving a set M of one or more messages,a data structure in time O(W log n)(or O((d+W)log n)in the process s adds all process identifiers that appear in M to lower-bandwidth synchronous model).with O(d)contention S,and then executes a probabilistic transition function 6 and O((d+W)nlog n)messages. mapping (g,S,M)to (g',m),where g'is a new state and Finally,we briefly discuss constructing a ring (Section 3.4) m is either L (indicating no message sent)or a message and the effects of node departures and arrivals (Section 3.5). (s,t,o,u)where t is in S and u is in SU{).When and how this message is delivered depends on whether we are 3.1 Pairing in a synchronous or asynchronous model:we discuss both The pairing problem has some similarities to the problem variants below. of finding a matching,but because we are not restricted in In the synchronous model,the computation proceeds in which nodes we pair off-except for the limits imposed by rounds,and all messages sent to a process s in round i are communication along edges of the knowledge graph-our al- delivered simultaneously in round i+1.In other words gorithm can perform an initial pruning step that pairs off we assume the standard synchronous message-passing model many of the nodes deterministically.leaving only a degree- with the added restrictions that processes can only commu- 2 surviving subgraph.We then run a simple randomized nicate with "known"processes and can only send one mes- sage per round.This yields a model essentially identical to matching algorithm on this subgraph using a coin-flipping technique similar to that of Law and Siu16 to resolve con- the one used in the resource discovery literature,except that flicts. we have added a limitation on the number of identifiers that From a very high-level perspective,the algorithm proceeds can be sent in a single message.We are also interested in as follows.Start with an directed graph G with maximum minimizing contention.which we take to be the maximum degree d.Each node starts by sending a probe to all of its number of messages received by any single process in any successors.The recipient of such a probe responds by ac- single round of the computation cepting the first one and rejecting subsequent probes;in In the asynchronous model,messages arrive one at a time this way every node has at most one designated predecessor after a delay that may vary from message to message and producing a graph of designated predecessors Gi in which that is controlled by an adversary scheduler.It is assumed, every node has in-degree at most 1.This graph is further however,that no message is delayed by more than one time pruned by having each node with two or more successors unit and that messages between any two nodes are delivered pair them off,leaving a graph G2 in which every node has in FIFO order.Processing time is treated as zero. both in-degree and out-degree at most 1.Each component Defining contention for an asynchronous model can be in such a graph is either a line or a cycle,and a constant tricky,as the adversary could choose to deliver many mes- fraction of the nodes can be matched along edges by simply sages in a short period of time;we adopt a simple measure having each node that is not at the end of a line flip a coin of contention equal to the maximum number of distinct mes- to decide whether to pair with its remaining predecessor or sages with the same recipient that are in transit at any point successor,and pairing those adjacent nodes whose choices in time.-Assuming that each process sends at most one are consistent.A simple calculation shows that on average message to each neighbor in the knowledge graph before re- half of the nodes in G2(and all nodes in G-G2)are paired ceiving a reply,the contention is trivially bounded by the by this procedure,from which we can deduce that about half degree of the knowledge graph in both the synchronous and of the nodes are paired on average in the worst case where asynchronous models. G-G2 is empty. The algorithm sketched above can be implemented di- 3.ALGORITHMS rectly in a synchronous system where all nodes start simul- taneously,because after the initial probing phase there is This section contains our main results,a family of algo- no confusion about the structure of the graph,and after rithms for quickly constructing tree-structured overlay net- a phase consisting of a known number of rounds any un- works starting with a weakly-connected communication graph. matched nodes can simply restart the protocol along with We begin by describing (in Section 3.1)a randomized dis- any supernodes resulting from merges in the previous phase. tributed algorithm for pairing nodes:this produces a match- But in an asynchronous setting the situation is more com- ing on the set of nodes that includes a constant fraction of plicated.While some of the early pruning steps can still be the nodes on average,in time O(d),with O(d)contention used (in particular,we still have each node accept and re- and O(n)messages,each of size at most O(W),where W is spond to a single designated predecessor),the final matching the maximum identifier size.Paired nodes are then joined stages require more care. together into simulated combined nodes that are internally There are two main problems.The first is that no node organized as balanced trees (see Section 3.3).The partici- can detect when a phase of the pairing protocol has fin- pants in each combined node are carefully deployed so that ished,so that an unmatched node cannot detect the end of the pairing and joining algorithms in later rounds still pro- a pairing phase and restart the protocol.Instead,the best duce only O(d)contention;however,communication within an unmatched node can hope for is that the faithless suitor each subtree adds an factor to the cost of communication in who spurned it initially will return to accept its advances af- the pairing algorithm that depends on the depth of the tree. ter it finishes digesting luckier candidates.But this creates 2 An alternative assumption is that each process is only guar- the possibility of creating very long chains of nodes,each anteed to accept at least one message per time unit,with waiting for the next to finish a merge that is itself delayed other messages waiting in a delivery queue.This yields sim- by waiting for nodes further down the chain. ilar time bounds.except that the running time must be mul- This problem is compounded by the fact that a node that tiplied by the contention to account for queuing delays. has not yet received a probe cannot tell whether it has no
are of the form (s, t, σ) or (s, t, σ, U), where s is the sender, t is the receiver, σ is a message type, and U (if present) is a vector of O(1) process identifiers. The state of each process consists of a state variable q together with a set of successors S. Upon receiving a set M of one or more messages, a process s adds all process identifiers that appear in M to S, and then executes a probabilistic transition function δ mapping (q, S, M) to (q ′ , m), where q ′ is a new state and m is either ⊥ (indicating no message sent) or a message (s, t, σ, u) where t is in S and u is in S ∪ {⊥}. When and how this message is delivered depends on whether we are in a synchronous or asynchronous model; we discuss both variants below. In the synchronous model, the computation proceeds in rounds, and all messages sent to a process s in round i are delivered simultaneously in round i + 1. In other words, we assume the standard synchronous message-passing model with the added restrictions that processes can only communicate with “known” processes and can only send one message per round. This yields a model essentially identical to the one used in the resource discovery literature, except that we have added a limitation on the number of identifiers that can be sent in a single message. We are also interested in minimizing contention, which we take to be the maximum number of messages received by any single process in any single round of the computation. In the asynchronous model, messages arrive one at a time after a delay that may vary from message to message and that is controlled by an adversary scheduler. It is assumed, however, that no message is delayed by more than one time unit and that messages between any two nodes are delivered in FIFO order. Processing time is treated as zero. Defining contention for an asynchronous model can be tricky, as the adversary could choose to deliver many messages in a short period of time; we adopt a simple measure of contention equal to the maximum number of distinct messages with the same recipient that are in transit at any point in time.2 Assuming that each process sends at most one message to each neighbor in the knowledge graph before receiving a reply, the contention is trivially bounded by the degree of the knowledge graph in both the synchronous and asynchronous models. 3. ALGORITHMS This section contains our main results, a family of algorithms for quickly constructing tree-structured overlay networks starting with a weakly-connected communication graph. We begin by describing (in Section 3.1) a randomized distributed algorithm for pairing nodes; this produces a matching on the set of nodes that includes a constant fraction of the nodes on average, in time O(d), with O(d) contention and O(n) messages, each of size at most O(W), where W is the maximum identifier size. Paired nodes are then joined together into simulated combined nodes that are internally organized as balanced trees (see Section 3.3). The participants in each combined node are carefully deployed so that the pairing and joining algorithms in later rounds still produce only O(d) contention; however, communication within each subtree adds an factor to the cost of communication in the pairing algorithm that depends on the depth of the tree. 2An alternative assumption is that each process is only guaranteed to accept at least one message per time unit, with other messages waiting in a delivery queue. This yields similar time bounds, except that the running time must be multiplied by the contention to account for queuing delays. In Section 3.2, we show that Patricia trees [17], using a parallelized version of the Patricia tree merge procedure of Okasaki and Gill [18], are a good choice for the balanced tree data structure. Using Patricia trees, we obtain a sorted final data structure in time O(W log n) (or O((d+W) log n) in the lower-bandwidth synchronous model), with O(d) contention and O((d + W)n log n) messages. Finally, we briefly discuss constructing a ring (Section 3.4) and the effects of node departures and arrivals (Section 3.5). 3.1 Pairing The pairing problem has some similarities to the problem of finding a matching, but because we are not restricted in which nodes we pair off—except for the limits imposed by communication along edges of the knowledge graph—our algorithm can perform an initial pruning step that pairs off many of the nodes deterministically, leaving only a degree- 2 surviving subgraph. We then run a simple randomized matching algorithm on this subgraph using a coin-flipping technique similar to that of Law and Siu [16] to resolve con- flicts. From a very high-level perspective, the algorithm proceeds as follows. Start with an directed graph G with maximum degree d. Each node starts by sending a probe to all of its successors. The recipient of such a probe responds by accepting the first one and rejecting subsequent probes; in this way every node has at most one designated predecessor, producing a graph of designated predecessors G1 in which every node has in-degree at most 1. This graph is further pruned by having each node with two or more successors pair them off, leaving a graph G2 in which every node has both in-degree and out-degree at most 1. Each component in such a graph is either a line or a cycle, and a constant fraction of the nodes can be matched along edges by simply having each node that is not at the end of a line flip a coin to decide whether to pair with its remaining predecessor or successor, and pairing those adjacent nodes whose choices are consistent. A simple calculation shows that on average half of the nodes in G2 (and all nodes in G − G2) are paired by this procedure, from which we can deduce that about half of the nodes are paired on average in the worst case where G − G2 is empty. The algorithm sketched above can be implemented directly in a synchronous system where all nodes start simultaneously, because after the initial probing phase there is no confusion about the structure of the graph, and after a phase consisting of a known number of rounds any unmatched nodes can simply restart the protocol along with any supernodes resulting from merges in the previous phase. But in an asynchronous setting the situation is more complicated. While some of the early pruning steps can still be used (in particular, we still have each node accept and respond to a single designated predecessor), the final matching stages require more care. There are two main problems. The first is that no node can detect when a phase of the pairing protocol has finished, so that an unmatched node cannot detect the end of a pairing phase and restart the protocol. Instead, the best an unmatched node can hope for is that the faithless suitor who spurned it initially will return to accept its advances after it finishes digesting luckier candidates. But this creates the possibility of creating very long chains of nodes, each waiting for the next to finish a merge that is itself delayed by waiting for nodes further down the chain. This problem is compounded by the fact that a node that has not yet received a probe cannot tell whether it has no
predecessors or merely slow predecessors.If an unprobed For each node u,we assume that it maintains a neighbors node simply assumes that it has no predecessors and that it set N.,which induces an underlying graph G with edges should pair with any successor that accepts it,the adversary (u,v)for all pairs with vE Nu.Initially N consists of those can schedule events so that every node in a long chain only nodes whose identity u knows at the start of the protocol.It learns of its unrequited predecessor after it has committed is updated by adding any node that sends u a probe message. to its successor,creating the linear-time backlog described Neighbor sets are merged when two nodes join into a single above.On the other hand,if a node chooses to wait for supernode.A neighbor that refuses a proposal is removed: a predecessor to come,it may be left stuck in this state this prevents a slow node from being pestered by arbitrarily forever. large numbers of duplicate proposals from a faster neighbor, The solution is to retain the coin-flip by which a node since the faster neighbor will only try again after the slow chooses its orientation,but let the presence of a successor node has rejoined its neighbor set (by sending out a probe who wants to pair now override the wait for a predecessor message after completing a merge). that may never arrive.In addition,the successor-pairing The algorithm is described below.It has a main thread procedure is modified slightly:instead of pairing all succes- which is responsible for the main function of the pairing. sors,possibly leaving none,a process always saves the first and four daemon threads which are triggered by messages successor for itself and pairs off only subsequent pairs.This and responsible for state transitions.The execution of the may leave an odd successor that is not paired,but there is at daemon threads should be implemented to be atomic.which most one such node left out for each node that participates is quite reasonable because there is no waiting in the daemon in the (now very implicit)randomized matching protocol.If threads and our model ignores the running time of a process. this left-out node is waiting for its predecessor,it will even- tually be picked up after the predecessor merges with its For each node u: preferred successor. What makes all of this work is that the randomization 1.Let state-ISOLATED:let chosen be picked uni- breaks up long waiting chains:it is unlikely that a long chain formly at random from ipred,suce. of nodes will all be pointed the same way by their coin-flips. 2.For each v Eneighbors,send a message (u,v,probe) At the same time,opportunistic merging by nodes with the to v and wait for all replies; first available suitor prevents deadlocks in cycles,even if all nodes are pointed in the same direction,as some node's 3.Let v,v2,...v be the nodes that reply with 'ac- proposal gets in first. cept.'For each odd i less than k,send a mes- sage (u,vi,pair,v+1)to vi and (u,vi+1,pair,vi) 3.1.1 Details of the Algorithm to vi+1.If k is odd,let succ-vk,and send a Formally,each node can be in one of four different states, message (u,succ,no_pair)to the node succ;else depending on what messages it has received.The four dif let chosen-pred; ferent states and their attitudes towards incoming pairing 4.While (state=ISOLATED or state=PROBED)wait: proposals are described as follows: 5.If state=PROPOSING then: ISOLATED:The node has not yet received any probes Send a message (u,chosen,propose)to the and has no predecessor.So once it receives a proposal node chosen; from its successor,it can accept it immediately. If reply is (chosen,u,accept)then let state←-PAIRED and obj←-chosen: PROBED:The node has been probed,but not yet been told whether it is paired off by its predecessor, else if reply is (chosen,u,reject-propose)then nor it has a pairable successor.In this case,it waits let neighbors-neighbors-ichosen to find out what its predecessor will do with it.If else if reply is (chosen,u,paired)then it receives a proposal from its successor and its coin is do nothing but proceed; also pointed to its successor,it enters the PROPOSED 6.If state=PAIRED,then merge with obj: state.If the proposal conflicts with its coin,it refuses the proposal immediately. 7.Go to line 1. PROPOSED:The node has a waiting proposal,but Upon receiving message (v,u,probe)from node v do: has not yet been told whether it is paired off by its Let neighbors-neighborsUfv; predecessor.It defers responding to the proposal until If state=ISOLATED then: its state changes due to the notice from its predecessor. Let pred-v and state-PROBED: PROPOSING:The node has a predecessor and has Send a message (u,v,accept)to node v; been told by the predecessor that it is not paired off. else: So the node should actively send out a proposal in Send a message (u,v,reject)to node v. the direction indicated by its coin and accept immedi- ately any proposal that does not conflict with its coin. Upon receiving message (v,u,propose)from node v Proposals that conflict with its coin are refused imme- do the one of the following according to the value of state: diately. ISOLATED:Let state--PAIRED and obj+v; PAIRED:The node has been paired,either by its pre- reply with (u,v,accept); decessor or due to coins.It refuses any proposals (al- PROBED:If chosen=v,then let waiting-v and though it may later be available for new proposals once state-PROPOSED: it has completed a merge operation with its partner). if otherwise,reply with (u,v,reject-propose);
predecessors or merely slow predecessors. If an unprobed node simply assumes that it has no predecessors and that it should pair with any successor that accepts it, the adversary can schedule events so that every node in a long chain only learns of its unrequited predecessor after it has committed to its successor, creating the linear-time backlog described above. On the other hand, if a node chooses to wait for a predecessor to come, it may be left stuck in this state forever. The solution is to retain the coin-flip by which a node chooses its orientation, but let the presence of a successor who wants to pair now override the wait for a predecessor that may never arrive. In addition, the successor-pairing procedure is modified slightly: instead of pairing all successors, possibly leaving none, a process always saves the first successor for itself and pairs off only subsequent pairs. This may leave an odd successor that is not paired, but there is at most one such node left out for each node that participates in the (now very implicit) randomized matching protocol. If this left-out node is waiting for its predecessor, it will eventually be picked up after the predecessor merges with its preferred successor. What makes all of this work is that the randomization breaks up long waiting chains: it is unlikely that a long chain of nodes will all be pointed the same way by their coin-flips. At the same time, opportunistic merging by nodes with the first available suitor prevents deadlocks in cycles, even if all nodes are pointed in the same direction, as some node’s proposal gets in first. 3.1.1 Details of the Algorithm Formally, each node can be in one of four different states, depending on what messages it has received. The four different states and their attitudes towards incoming pairing proposals are described as follows: • ISOLATED: The node has not yet received any probes and has no predecessor. So once it receives a proposal from its successor, it can accept it immediately. • PROBED: The node has been probed, but not yet been told whether it is paired off by its predecessor, nor it has a pairable successor. In this case, it waits to find out what its predecessor will do with it. If it receives a proposal from its successor and its coin is also pointed to its successor, it enters the PROPOSED state. If the proposal conflicts with its coin, it refuses the proposal immediately. • PROPOSED: The node has a waiting proposal, but has not yet been told whether it is paired off by its predecessor. It defers responding to the proposal until its state changes due to the notice from its predecessor. • PROPOSING: The node has a predecessor and has been told by the predecessor that it is not paired off. So the node should actively send out a proposal in the direction indicated by its coin and accept immediately any proposal that does not conflict with its coin. Proposals that conflict with its coin are refused immediately. • PAIRED: The node has been paired, either by its predecessor or due to coins. It refuses any proposals (although it may later be available for new proposals once it has completed a merge operation with its partner). For each node u, we assume that it maintains a neighbors set Nu, which induces an underlying graph G with edges (u, v) for all pairs with v ∈ Nu. Initially Nu consists of those nodes whose identity u knows at the start of the protocol. It is updated by adding any node that sends u a probe message. Neighbor sets are merged when two nodes join into a single supernode. A neighbor that refuses a proposal is removed: this prevents a slow node from being pestered by arbitrarily large numbers of duplicate proposals from a faster neighbor, since the faster neighbor will only try again after the slow node has rejoined its neighbor set (by sending out a probe message after completing a merge). The algorithm is described below. It has a main thread which is responsible for the main function of the pairing, and four daemon threads which are triggered by messages and responsible for state transitions. The execution of the daemon threads should be implemented to be atomic, which is quite reasonable because there is no waiting in the daemon threads and our model ignores the running time of a process. For each node u: 1. Let state←ISOLATED; let chosen be picked uniformly at random from {pred, succ}. 2. For each v ∈neighbors, send a message (u, v, probe) to v and wait for all replies; 3. Let v1, v2, . . . vk be the nodes that reply with ‘accept.’ For each odd i less than k, send a message (u, vi, pair, vi+1) to vi and (u, vi+1, pair, vi) to vi+1. If k is odd, let succ← vk, and send a message (u, succ, no pair) to the node succ; else let chosen←pred; 4. While (state=ISOLATED or state=PROBED) wait; 5. If state=PROPOSING then: Send a message (u, chosen, propose) to the node chosen; If reply is (chosen, u, accept) then let state←PAIRED and obj←chosen; else if reply is (chosen, u, reject propose) then let neighbors←neighbors−{chosen} else if reply is (chosen, u, paired) then do nothing but proceed; 6. If state=PAIRED, then merge with obj; 7. Go to line 1. Upon receiving message (v, u, probe) from node v do: Let neighbors←neighbors∪{v}; If state=ISOLATED then: Let pred← v and state←PROBED; Send a message (u, v, accept) to node v; else: Send a message (u, v, reject) to node v. Upon receiving message (v, u, propose) from node v do the one of the following according to the value of state: ISOLATED: Let state←PAIRED and obj←v; reply with (u, v, accept); PROBED: If chosen=v, then let waiting← v and state←PROPOSED; if otherwise, reply with (u, v, reject propose);