Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications lon Stoica; Robert morris, david Karger, M. Frans Kaashoek, Hari Balakrishnan? MIT Laboratory for Computer Science chord lcs mit. edu Abstract and involves relatively little movement of keys when nodes join to efficiently locate the node that stores a particular data item. This Previous work on consistent hashing assumed that nodes were paper presents Chord, a distributed lookup protocol that addresses aware of most other nodes in the system, making it impractical to this problem. Chord provides support for just one operation: given scale to large number of nodes. In contrast, each Chord node needs a key, it maps the key onto a node. Data location can be easily routing "information about only a few other nodes. Because the implemented on top of Chord by associating a key with each data routing table is distributed, a node resolves the hash function by communicating with a few other nodes. In the steady state, key maps. Chord adapts efficiently as nodes join and leave the an N-node system, each node maintains information only about system, and can answer queries even if the system is continuously O(log N)other nodes, and resolves all lookups via O(log N)mes- hanging. Results from theoretical analysis, simulations, and ex- ages to other nodes. Chord maintains its routing information as periments show that Chord is scalable, with communication cost nodes join and leave the system; with high probability each such and the state maintained by each node scaling logarithmically with event results in no more than O(logN)messages the number of chord nodes Three features that distinguish Chord from many other peer-to- peer lookup protocols are its simplicity, provable correctness, and 1. Introduction provable performance. Chord is simple, eer-to-peer systems and applications are distributed systems node requires information about O(log N)other nodes for efficient without any centralized control or hierarchical organization, where routing, but performance degrades gracefully when that informa- the software running at each node is equivalent in functional tion is out of date. This is important in practice because nodes will A review of the features of recent peer-to-peer applications yield join and leave arbitrarily, and consistency of even O(log N)state a long list: redundant storage, permanence, selection of nearby may be hard to maintain. Only one piece information per node need servers, anonymity, search, authentication, and hierarchical nam- be correct in order for Chord to guarantee correct (though slow) ing. Despite this rich set of features, the core operation in most routing of queries; Chord has a simple algorithm for maintaining eer-to-peer systems is efficient location of data items. The contri- this information in a dynamic environment bution of this paper is a scalable protocol for lookup in a dynamic The rest of this paper is structured as follows. Section 2 com- peer-to-peer system with frequent node arrivals and departures ares Chord to related work. Section 3 presents the system model The Chord protocol supports just one operation: given a key that motivates the Chord protocol. Section 4 presents the base it maps the key onto a node. Depending on the application using Chord protocol and proves several of its properties, while Section 5 Chord, that node might be responsible for storing a value associated presents extensions to handle concurrent joins and failures. Sec- with the key. Chord uses a variant of consistent hashing [11]to tion 6 demonstrates our claims about Chord's performance through assign keys to Chord nodes. Consistent hashing tends to balance simulation and experiments on a deployed prototype. Finally,we load, since each node receives roughly the same number of keys, outline items for future work in Section 7 and summarize our con University of California, Berkeley. istoica(@cs. berkeley. edu tributions in Section 8 This research was sponsored by the Defense Advanced Research 2. Related Work ce and Naval Warfare Sys- ems Center, San Diego, under ce 166001-00-1-8933 While Chord maps keys onto nodes, traditional name and lo- cation services provide a direct mapping between keys and val- ues. A value can be an address, a document, or an arbitrary data item. Chord can easily implement this functionality by storing each ital or hard copies of all or this work for key/value pair at the node to which that key maps. For this use is granted without fee p that copies are and to make the comparison clearer, the rest of this section assumes not made or dis d that copies a Chord-based service that maps keys onto values bear this notice and the full citation on the first page otherwise. to publish, to post on servers or to redistribute to lists DNS provides a host name to IP address mapping [ 15]. Chord can pro\ de the same service with the name representing the key S/GCOMMOL, August 27-31, 2001, San Diego, Califomia, USA. and the associated IP address representing the value. Chord re- Copyright 2001 ACM 1-58113-41 1-8/01/00 quires no special servers, while dNS relies on a set of special root 149
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica , Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnany MIT Laboratory for Computer Science chord@lcs.mit.edu http://pdos.lcs.mit.edu/chord/ Abstract A fundamental problem that confronts peer-to-peer applications is to efficiently locate the node that stores a particular data item. This paper presents Chord, a distributed lookup protocol that addresses this problem. Chord provides support for just one operation: given a key, it maps the key onto a node. Data location can be easily implemented on top of Chord by associating a key with each data item, and storing the key/data item pair at the node to which the key maps. Chord adapts efficiently as nodes join and leave the system, and can answer queries even if the system is continuously changing. Results from theoretical analysis, simulations, and experiments show that Chord is scalable, with communication cost and the state maintained by each node scaling logarithmically with the number of Chord nodes. 1. Introduction Peer-to-peer systems and applications are distributed systems without any centralized control or hierarchical organization, where the software running at each node is equivalent in functionality. A review of the features of recent peer-to-peer applications yields a long list: redundant storage, permanence, selection of nearby servers, anonymity, search, authentication, and hierarchical naming. Despite this rich set of features, the core operation in most peer-to-peer systems is efficient location of data items. The contribution of this paper is a scalable protocol for lookup in a dynamic peer-to-peer system with frequent node arrivals and departures. The Chord protocol supports just one operation: given a key, it maps the key onto a node. Depending on the application using Chord, that node might be responsible for storing a value associated with the key. Chord uses a variant of consistent hashing [11] to assign keys to Chord nodes. Consistent hashing tends to balance load, since each node receives roughly the same number of keys, University of California, Berkeley. istoica@cs.berkeley.edu y Authors in reverse alphabetical order. This research was sponsored by the Defense Advanced Research Projects Agency (DARPA) and the Space and Naval Warfare Systems Center, San Diego, under contract N66001-00-1-8933. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGCOMM’01, August 27-31, 2001, San Diego, California, USA. Copyright 2001 ACM 1-58113-411-8/01/0008 ...$5.00. and involves relatively little movement of keys when nodes join and leave the system. Previous work on consistent hashing assumed that nodes were aware of most other nodes in the system, making it impractical to scale to large number of nodes. In contrast, each Chord node needs “routing” information about only a few other nodes. Because the routing table is distributed, a node resolves the hash function by communicating with a few other nodes. In the steady state, in an N-node system, each node maintains information only about O(log N) other nodes, and resolves all lookups via O(log N) messages to other nodes. Chord maintains its routing information as nodes join and leave the system; with high probability each such event results in no more than O(log2 N) messages. Three features that distinguish Chord from many other peer-topeer lookup protocols are its simplicity, provable correctness, and provable performance. Chord is simple, routing a key through a sequence of O(log N) other nodes toward the destination. A Chord node requires information about O(log N) other nodes for efficient routing, but performance degrades gracefully when that information is out of date. This is important in practice because nodes will join and leave arbitrarily, and consistency of even O(log N) state may be hard to maintain. Only one piece information per node need be correct in order for Chord to guarantee correct (though slow) routing of queries; Chord has a simple algorithm for maintaining this information in a dynamic environment. The rest of this paper is structured as follows. Section 2 compares Chord to related work. Section 3 presents the system model that motivates the Chord protocol. Section 4 presents the base Chord protocol and proves several of its properties, while Section 5 presents extensions to handle concurrent joins and failures. Section 6 demonstrates our claims about Chord’s performance through simulation and experiments on a deployed prototype. Finally, we outline items for future work in Section 7 and summarize our contributions in Section 8. 2. Related Work While Chord maps keys onto nodes, traditional name and location services provide a direct mapping between keys and values. A value can be an address, a document, or an arbitrary data item. Chord can easily implement this functionality by storing each key/value pair at the node to which that key maps. For this reason and to make the comparison clearer, the rest of this section assumes a Chord-based service that maps keys onto values. DNS provides a host name to IP address mapping [15]. Chord can provide the same service with the name representing the key and the associated IP address representing the value. Chord requires no special servers, while DNS relies on a set of special root 149
servers. DNS names are structured to reflect administrative bound- possess [17], and the lack of scalability that systems like gnutella aries; Chord imposes no naming structure. DNS is specialized to display because of their widespread use of broadcasts [10] the task of finding named hosts or services, while Chord can also be used to find data objects that are not tied to particular machines. 3. System Model The Freenet peer-to-peer storage system [4, 5], like Chord, is Chord simplifies the design of peer-to-peer systems and applica leave and join. Freenet does not assign responsibility for docu- tions based on it by addressing these difficult problems ments to specific servers; instead, its lookups take the form of searches for cached copies. This allows Freenet to provide a degree Load balance: Chord acts as a distributed hash function, f anonymity, but prevents it from guaranteeing retrieval of existing spreading keys evenly over the nodes, this provides a degree documents or from providing low bounds on retrieval costs. Chord natural load balance loes not provide anonymity, but its lookup operation runs in pre- Decentralization: Chord is fully distributed: no node is dictable time and always results in success or definitive failure ore important than any other. This improves robustness and The Ohaha system uses a consistent hashing- like algorithm for mapping documents to nodes, and Freenet-style query routing 18 makes Chord appropriate for loosely-organized peer-to-peer applications As a result, it shares some of the weaknesses of Freenet. Archival Intermemory uses an off-line computed tree to map logical ad- Scalability: The cost of a Chord lookup grows as the log of dresses to machines that store the data 3 e num so even very large systems are feasible The Globe system [2]has a wide-area location service to map ob- No parameter tuning is required to achieve this scaling ject identifiers to the locations of moving objects. Globe arranges the Internet as a hierarchy of geographical, topological, or adminis- Availability: Chord automatically adjusts its internal tables trative domains, effectively constructing a static world-wide search to refiect newly joined nodes as well as node failures, ensur tree, much like DNS. Information about an object is stored in a ing that, barring major failures in the underlying network, the articular leaf domain, and pointer caches provide search short node responsible for a key can always be found. This is true cuts [22]. The Globe system handles high load on the logical root n if the system is in by partitioning objects among multiple physical root servers us- Flexible naming: Chord places no constraints on the struc- ing hash-like techniques. Chord performs this hash function wel ture of the keys it looks up: the Chord key-space is flat. This enough that it can achieve scalability without also involving any gives applications a large amount of fexibility in how they hierarchy, though Chord does not exploit network locality as well map their own names to Chord keys as Globe The distributed data location protocol developed by Plaxton er The Chord software takes the form of a library to be linked with al. [191, a variant of which is used in Ocean Store [12], is perhaps the client and server applications that use it. The application in- closest algorithm to the Chord pre oco tracts with Chord in two main ways. First, Chord provides a guarantees than Chord: like Chord it guarantees that queries make lookup (key)algorithm that yields the IP address of the node a logarithmic number hops and that keys are well balanced, but the responsible for the key. Second, the Chord software on each node Plaxton protocol also ensures, subject to assumptions about net- notifies the application of changes in the set of keys that the node work topology, that queries never travel further in network distance is responsible for. This allows the application software to, for ex- than the node where the key is stored. The advantage of Chord ample, move corresponding values to their new homes when a new that it is substantially less complicated and handles concurrent node node joins and failures well. The Chord protocol is also similar to The application using Chord is responsible for providing any de- Pastry, the location algorithm used in PAST [8]. However, Pastry sired authentication, caching, replication, and user-friendly naming is a prefix-based routing protocol, and differs in other details from of data. Chord s flat key space eases the implementation of these features. For example, an application could authenticate data by CAN uses a d-dimensional Cartesian coordinate space(for some storing it under a Chord key derived from a cryptographic hash of fixed d) to implement a distributed hash table that maps keys onto the data. Similarly, an application could replicate data by storing it values[20]. Each node maintains O(d) state, and the lookup cost under two distinct Chord keys derived from the datas application- is O(dN/d). Thus, in contrast to Chord, the state maintained by a level identifier CAn node does not depend on the network size N, but the lookt he following are examples of applications for which Chord ost increases faster than log N. If d= log N, CAn lookup times would provide a good foundation and storage needs match Chords. However Can is not designed to vary d as N(and thus log N)varies, so this match will only occur Cooperative Mirroring, as outlined in a recent proposal [6] for the"right"N corresponding to the fixed d. CAn requires an nagine a set of software developers, each of whom wishes additional maintenance protocol to periodically remap the identifier publish a distribution. Demand for each distribution might space onto nodes. Chord also has the advantage that its correctness ary dramatically, from very popular just after a new release relatively unpopular between releases. An efficient ap- Chord's routing procedure may be thought of as a one- proach for this would be for the developers to cooperatively dimensional analogue of the Grid location system[ 14]. Grid relies mirror each others'distributions. Ideally, the mirrorin on real-world geographic location information to route its queries, tem would balance the load across all servers, replicate and Chord maps its nodes to an artificial one-dimensional space with ache the data, and ensure authenticity. Such a system should which routing is carried out by an algorithm similar to grid's be fully decentralized in the interests of reliability, and be- Chord can be used as a lookup service to implement a var here is no natural central administration of systems, as discussed in Section 3. In particular, it can hel avoid single points of failure or control that systems like Napster Time-Shared Storage for nodes with intermittent conne a person wishes some data to be always available, but their 150
servers. DNS names are structured to reflect administrative boundaries; Chord imposes no naming structure. DNS is specialized to the task of finding named hosts or services, while Chord can also be used to find data objects that are not tied to particular machines. The Freenet peer-to-peer storage system [4, 5], like Chord, is decentralized and symmetric and automatically adapts when hosts leave and join. Freenet does not assign responsibility for documents to specific servers; instead, its lookups take the form of searches for cached copies. This allows Freenet to provide a degree of anonymity, but prevents it from guaranteeing retrieval of existing documents or from providing low bounds on retrieval costs. Chord does not provide anonymity, but its lookup operation runs in predictable time and always results in success or definitive failure. The Ohaha system uses a consistent hashing-like algorithm for mapping documents to nodes, and Freenet-style query routing [18]. As a result, it shares some of the weaknesses of Freenet. Archival Intermemory uses an off-line computed tree to map logical addresses to machines that store the data [3]. The Globe system [2] has a wide-area location service to map object identifiers to the locations of moving objects. Globe arranges the Internet as a hierarchy of geographical, topological, or administrative domains, effectively constructing a static world-wide search tree, much like DNS. Information about an object is stored in a particular leaf domain, and pointer caches provide search short cuts [22]. The Globe system handles high load on the logical root by partitioning objects among multiple physical root servers using hash-like techniques. Chord performs this hash function well enough that it can achieve scalability without also involving any hierarchy, though Chord does not exploit network locality as well as Globe. The distributed data location protocol developed by Plaxton et al. [19], a variant of which is used in OceanStore [12], is perhaps the closest algorithm to the Chord protocol. It provides stronger guarantees than Chord: like Chord it guarantees that queries make a logarithmic number hops and that keys are well balanced, but the Plaxton protocol also ensures, subject to assumptions about network topology, that queries never travel further in network distance than the node where the key is stored. The advantage of Chord is that it is substantially less complicated and handles concurrent node joins and failures well. The Chord protocol is also similar to Pastry, the location algorithm used in PAST [8]. However, Pastry is a prefix-based routing protocol, and differs in other details from Chord. CAN uses a d-dimensional Cartesian coordinate space (for some fixed d) to implement a distributed hash table that maps keys onto values [20]. Each node maintains O(d) state, and the lookup cost is O(dN1=d). Thus, in contrast to Chord, the state maintained by a CAN node does not depend on the network size N, but the lookup cost increases faster than log N. If d = log N, CAN lookup times and storage needs match Chord’s. However, CAN is not designed to vary d as N (and thus log N) varies, so this match will only occur for the “right” N corresponding to the fixed d. CAN requires an additional maintenance protocol to periodically remap the identifier space onto nodes. Chord also has the advantage that its correctness is robust in the face of partially incorrect routing information. Chord’s routing procedure may be thought of as a onedimensional analogue of the Grid location system [14]. Grid relies on real-world geographic location information to route its queries; Chord maps its nodes to an artificial one-dimensional space within which routing is carried out by an algorithm similar to Grid’s. Chord can be used as a lookup service to implement a variety of systems, as discussed in Section 3. In particular, it can help avoid single points of failure or control that systems like Napster possess [17], and the lack of scalability that systems like Gnutella display because of their widespread use of broadcasts [10]. 3. System Model Chord simplifies the design of peer-to-peer systems and applications based on it by addressing these difficult problems: Load balance: Chord acts as a distributed hash function, spreading keys evenly over the nodes; this provides a degree of natural load balance. Decentralization: Chord is fully distributed: no node is more important than any other. This improves robustness and makes Chord appropriate for loosely-organized peer-to-peer applications. Scalability: The cost of a Chord lookup grows as the log of the number of nodes, so even very large systems are feasible. No parameter tuning is required to achieve this scaling. Availability: Chord automatically adjusts its internal tables to reflect newly joined nodes as well as node failures, ensuring that, barring major failures in the underlying network, the node responsible for a key can always be found. This is true even if the system is in a continuous state of change. Flexible naming: Chord places no constraints on the structure of the keys it looks up: the Chord key-space is flat. This gives applications a large amount of flexibility in how they map their own names to Chord keys. The Chord software takes the form of a library to be linked with the client and server applications that use it. The application interacts with Chord in two main ways. First, Chord provides a lookup(key) algorithm that yields the IP address of the node responsible for the key. Second, the Chord software on each node notifies the application of changes in the set of keys that the node is responsible for. This allows the application software to, for example, move corresponding values to their new homes when a new node joins. The application using Chord is responsible for providing any desired authentication, caching, replication, and user-friendly naming of data. Chord’s flat key space eases the implementation of these features. For example, an application could authenticate data by storing it under a Chord key derived from a cryptographic hash of the data. Similarly, an application could replicate data by storing it under two distinct Chord keys derived from the data’s applicationlevel identifier. The following are examples of applications for which Chord would provide a good foundation: Cooperative Mirroring, as outlined in a recent proposal [6]. Imagine a set of software developers, each of whom wishes to publish a distribution. Demand for each distribution might vary dramatically, from very popular just after a new release to relatively unpopular between releases. An efficient approach for this would be for the developers to cooperatively mirror each others’ distributions. Ideally, the mirroring system would balance the load across all servers, replicate and cache the data, and ensure authenticity. Such a system should be fully decentralized in the interests of reliability, and because there is no natural central administration. Time-Shared Storage for nodes with intermittent connectivity. If a person wishes some data to be always available, but their 150
File System Block stor Chord Chord Client Server Figure 2: An identifier cirele consisting of the three nodes 0, 1, and 3. In this example, key l is located at node l, key 2 at node 3. and key 6 at node 0. igure 1: Structure of an example Chord-based distribute storage system. Chord improves the scalability of consistent hashing by avoid- ng the requirement that every node know about every other node machine is only occasionally available they can offer to store others' data while they are up, in return for having their dat A Chord node needs only a small amount of"routing"informa- la tion about other nodes. Because this information is distributed a stored elsewhere when they are down. The datas name can node resolves the hash function by communicating with a few other serve as a key to identify the(live)Chord node responsible nodes. In an N-node network, each node maintains information for storing the data item at any given time. Many of the only about O(log N)other nodes, and a lookup requires O(log N) same issues arise as in the Cooperative Mirroring applica- messages. tion, though the focus here is on availability rather than load Chord must update the routing information when a node joins or leaves the network; a join or leave requires O(log N) messages Distributed Indexes to support Gnutella-or 4.2 Consistent hashi search. a key in this application could ived from the desired keywords, while values could The consistent hash function assigns each node and key an m-bit offering documents with those keywords identifier using a base hash function such as SHA-1 [9]. A node's identifier is chosen by hashing the nodes IP address, while a key Large-Scale Combinatorial Search, such as code breaking. In identifier is produced by hashing the key. We will use the term this case keys are candidate solutions to the problem(such to refer to both the original key and its image un cryptographic keys); Chord maps these keys to the machines nction, as the meaning will be clear from context. Similarly, the esponsible for testing them as solutions term"node will refer to both the node and its identifier under the hash function. The identifier length m must be large enough to Figure 1 shows a possible three-layered software structure for a make the probability of two nodes or keys hashing to the same iden- cooperative mirror system. The highest layer would provide a file- tifier negligible like interface to users, including user-friendly ng and authenti Consistent hashing assigns keys to nodes as follows. Identifiers cation. This"file system"layer might implement named directories are ordered in an identifier circle modulo 2. Key k is assigned to and files, mapping operations on them to lower-level block opera- the first node whose identifier is equal to or follows(the identifier tions.The next layer, a"block storage"layer, would implement of) k in the identifier space. This node is called the successor node the block operations. It would take care of storage, caching, and of key k, denoted by successor( k). If identifiers are represented as eplication of blocks. The block storage layer would use Chord to a circle of numbers from 0 to 2m-1, then successor(k)is identify the node responsible for storing a block, and then talk to first node clockwise from k the block storage server on that node to read or write the block Figure 2 shows an identifier circle with m 3. The circle has three nodes: 0. 1. and 3. The successor of identifier I is node 1. so 4. The Base chord Protocol key I would be located at node l. Similarly, key 2 would be located at node 3, and key 6 at node 0 The Chord protocol specifies how to find the locations of keys, Consistent hashing is designed to let nodes enter and leave the how new nodes join the system, and how to recover from the failure network with minimal disruption. To maintain the consistent hash- (or planned departure)of existing nodes. This section describes a ing mapping when a node n joins the network, certain keys previ implified version of the protocol that does not handle concurrent ously assigned to n's successor now become assigned to n. When joins or failures. Section 5 describes enhancements to the base pro- node n leaves the network, all of its assigned keys are reassigned tocol to handle concurrent joins and failures. to n's successor. No other changes in assignment 4.1 Overview need occur. In the example above, if a node were tifier 7, it would capture the key with identifier 6 from At its heart, Chord provides fast distributed computation of with identifier 0 hash function mapping keys to nodes responsible for them. It uses The following results are proven in the papers that introduced consistent hashing [11, 13], which has several good properti consistent hashing [11, 1 With high probability the hash function balances load(all nodes receive roughly the same number of keys). Also with high prob- THEOREM 1. For any set of N nodes and K keys, with high ability, when an Nth node joins(or leaves)the network, only an O(1=)fraction of the keys are moved to a different location- this is clearly the minimum necessary to maintain a balanced load. 1. Each node is responsible for at most(1 +eK=Nkeys
Server Chord Chord Chord File System Block Store Block Store Block Store Client Server Figure 1: Structure of an example Chord-based distributed storage system. machine is only occasionally available, they can offer to store others’ data while they are up, in return for having their data stored elsewhere when they are down. The data’s name can serve as a key to identify the (live) Chord node responsible for storing the data item at any given time. Many of the same issues arise as in the Cooperative Mirroring application, though the focus here is on availability rather than load balance. Distributed Indexes to support Gnutella- or Napster-like keyword search. A key in this application could be derived from the desired keywords, while values could be lists of machines offering documents with those keywords. Large-Scale Combinatorial Search, such as code breaking. In this case keys are candidate solutions to the problem (such as cryptographic keys); Chord maps these keys to the machines responsible for testing them as solutions. Figure 1 shows a possible three-layered software structure for a cooperative mirror system. The highest layer would provide a filelike interface to users, including user-friendly naming and authentication. This “file system” layer might implement named directories and files, mapping operations on them to lower-level block operations. The next layer, a “block storage” layer, would implement the block operations. It would take care of storage, caching, and replication of blocks. The block storage layer would use Chord to identify the node responsible for storing a block, and then talk to the block storage server on that node to read or write the block. 4. The Base Chord Protocol The Chord protocol specifies how to find the locations of keys, how new nodes join the system, and how to recover from the failure (or planned departure) of existing nodes. This section describes a simplified version of the protocol that does not handle concurrent joins or failures. Section 5 describes enhancements to the base protocol to handle concurrent joins and failures. 4.1 Overview At its heart, Chord provides fast distributed computation of a hash function mapping keys to nodes responsible for them. It uses consistent hashing [11, 13], which has several good properties. With high probability the hash function balances load (all nodes receive roughly the same number of keys). Also with high probability, when an Nth node joins (or leaves) the network, only an O(1=N) fraction of the keys are moved to a different location— this is clearly the minimum necessary to maintain a balanced load. 0 6 1 2 3 4 5 6 7 1 2 successor(2) = 3 successor(6) = 0 successor(1) = 1 Figure 2: An identifier circle consisting of the three nodes 0, 1, and 3. In this example, key 1 is located at node 1, key 2 at node 3, and key 6 at node 0. Chord improves the scalability of consistent hashing by avoiding the requirement that every node know about every other node. A Chord node needs only a small amount of “routing” information about other nodes. Because this information is distributed, a node resolves the hash function by communicating with a few other nodes. In an N-node network, each node maintains information only about O(log N) other nodes, and a lookup requires O(log N) messages. Chord must update the routing information when a node joins or leaves the network; a join or leave requires O(log2 N) messages. 4.2 Consistent Hashing The consistent hash function assigns each node and key an m-bit identifier using a base hash function such as SHA-1 [9]. A node’s identifier is chosen by hashing the node’s IP address, while a key identifier is produced by hashing the key. We will use the term “key” to refer to both the original key and its image under the hash function, as the meaning will be clear from context. Similarly, the term “node” will refer to both the node and its identifier under the hash function. The identifier length m must be large enough to make the probability of two nodes or keys hashing to the same identifier negligible. Consistent hashing assigns keys to nodes as follows. Identifiers are ordered in an identifier circle modulo 2m . Key k is assigned to the first node whose identifier is equal to or follows (the identifier of) k in the identifier space. This node is called the successor node of key k, denoted by successor(k). If identifiers are represented as a circle of numbers from 0 to 2m 1, then successor(k) is the first node clockwise from k. Figure 2 shows an identifier circle with m = 3. The circle has three nodes: 0, 1, and 3. The successor of identifier 1 is node 1, so key 1 would be located at node 1. Similarly, key 2 would be located at node 3, and key 6 at node 0. Consistent hashing is designed to let nodes enter and leave the network with minimal disruption. To maintain the consistent hashing mapping when a node n joins the network, certain keys previously assigned to n’s successor now become assigned to n. When node n leaves the network, all of its assigned keys are reassigned to n’s successor. No other changes in assignment of keys to nodes need occur. In the example above, if a node were to join with identifier 7, it would capture the key with identifier 6 from the node with identifier 0. The following results are proven in the papers that introduced consistent hashing [11, 13]: THEOREM 1. For any set of N nodes and K keys, with high probability: 1. Each node is responsible for at most (1 + )K=N keys 151
2. When an(N+1) node joins or leaves the netork, respon- sibility for O(K/N)keys changes hands(and only to or fror the joining or leaving node) When consistent hashing is implemented as described above the theorem proves a bound of E= O(log N). The consistent hashing successor the next node on the identifier circle paper shows that e can be reduced to an arbitrarily small constant finger(1node by having each node run O(log N)"virtual nodes "each with its own identifier. The phrase"with high probability"bears some discussion. A key Table 1: Definition of variables for node n, using m-bit identi en, which is plausible in a non-adversarial model of the world. hers. he probability distribution is then over random choices of keys and nodes, and says that such a random choice is unlikely to pro- duce an unbalanced distribution. One might worry, however, about points to the successor nodes of identifiers(1+2 )mod 2=2, an adversary who intentionally chooses keys to all hash to the same 1+2)mod 2=3, and(1+22)mod 2= 5, respectively identifier, destroying the load balancing property. The consistent The successor of identifier 2 is node 3, as this is the first node that hashing paper uses"k-universal hash functions"to provide certain follows 2, the successor of identifier 3 is( trivially)node 3, and the guarantees even in the case of nonrandom keys of 5 is node o Rather than using a k-universal hash function, we chose to use This scheme has two important characteristics. First, each node the standard SHA-I function as our base hash function. This makes stores information about only a small number of other nodes, and our protocol deterministic, so that the claims of"high probability knows more about nodes closely following it on the identifier circle no longer make sense. However, producing a set of keys that collide than about nodes farther away. Second, a node's finger table gener under SHA-I can be seen, in some sense, as inverting, or"decrypt ally does not contain enough information to determine the succes- ing "the SHA-I function. This is believed to be hard to do. Thus, sor of an arbitrary key k. For example, node 3 in Figure 3 does not instead of stating that our theorems hold with high probability, we know the successor of 1, as 1s successor(node 1)does not appear can claim that they hold"based on standard hardness assumptions. in node 3s finger tabl For simplicity(primarily of presentation), we dispense with the What happens when a node n does not know the successor of a use of virtual nodes In this case, the load on a node may exceed the key k? If n can find a node whose Id is closer than its own to k average by (at most)an O(log N) factor with high probability(or that node will know more about the identifier circle in the region in our case, based on standard hardness assumptions ) One reason of k than n does. Thus n searches its finger table for the node to avoid virtual nodes is that the number needed is determined by whose d most immediately precedes k, and asks j for the node it the number of nodes in the system, which may be difficult to deter- knows whose Id is closest to k. By repeating this process, n learns mine. Ofcourse, one may choose to use an a priori upper bound on about nodes with IDs closer and closer to k the number of nodes in the system;; for example, we could postulate The pseudocode that implements the search process is shown in at most one Chord server per IPv4 address. In this case running 32 Figure 4. The notation n food stands for the function food be virtual nodes per physical node would provide good load balance ing invoked at and executed on node n. Remote calls and variable 4.3 Scalable Key location references are preceded by the remote node identifier, while local variable references and procedure calls omit the local node. Thus e.A very small amount of routing information suffices to imple- nfood denotes a remote procedure call on node n, while n bar, nt consistent hashing in a distributed environment. Each node without parentheses, is an RPC to lookup a variable bar on node n. need only be aware of its successor node on the circle. Queries find-successor works by finding the immediate predecessor node for a given identifier can be passed around the circle via these suc- of the desired identifier, the successor of that node must be the essor pointers until they first encounter a node that succeeds the successor of the identifier. We implement find-predecessor explic- identifier, this is the node the query maps to. a portion of the Chord y, because it is used later to implement the join operation(Sec- rotocol maintains these successor pointers, thus ensuring that all tion 4. 4) ookups are resolved correctly. However, this resolution scheme is When node n executes find-predecessor, it contacts a series of inefficient: it may require traversing all N nodes to find the ap- nodes moving forward around the Chord circle towards id If node propriate mapping. To accelerate this process, Chord maintains n contacts a node n such that id falls between n and the successor essential for correctness, which is achieved as long as the successor asks n for the node n knows about that most closely precedes ie nformation is maintained correctly Thus the algorithm always makes progress towards the precedessor As before let m be the number of bits in the key /node identifiers of id Each node, n, maintains a routing table with(at most)m entries As an example, consider the Chord ring in Figure 3(b). Suppose called the finger table. The i entry in the table at node n contains node 3 wants to find the successor of identifier 1. Since 1 belongs the identity of the first node, s, that succeeds n by at least 2- to the circular interval [7, 3), it belongs to 3. finger[3. interval; node the identifier circle, i. e, 8= successor(n 3 therefore checks the third entry in its finger table, which is 0 2<m(and all arithmetic is modulo 2). We call node s the i Because 0 precedes 1, node 3 will ask node 0 to find the finger of node n, and denote it by n finger[] node(see Table 1). of 1. In turn, node 0 will infer from its finger table that 1'ssuccessor A finger table entry includes both the Chord identifier and the IP is the node 1 itself and return node 1 to node 3 address(and port number ) of the relevant node. Note that the first The finger pointers at repeatedly doubling distances around the finger of n is its immediate successor on the circle, for convenience circle cause each iteration of the loop in find-predecessor to halve we often refer to it as the successor rather than the first finger the distance to the target identifier. from this intuition follows a In the example shown in Figure 3(b), the finger table of node 1 theorem
2. When an (N +1)st node joins or leaves the network, responsibility for O(K=N) keys changes hands (and only to or from the joining or leaving node). When consistent hashing is implemented as described above, the theorem proves a bound of = O(log N). The consistent hashing paper shows that can be reduced to an arbitrarily small constant by having each node run O(log N) “virtual nodes” each with its own identifier. The phrase “with high probability” bears some discussion. A simple interpretation is that the nodes and keys are randomly chosen, which is plausible in a non-adversarial model of the world. The probability distribution is then over random choices of keys and nodes, and says that such a random choice is unlikely to produce an unbalanced distribution. One might worry, however, about an adversary who intentionally chooses keys to all hash to the same identifier, destroying the load balancing property. The consistent hashing paper uses “k-universal hash functions” to provide certain guarantees even in the case of nonrandom keys. Rather than using a k-universal hash function, we chose to use the standard SHA-1 function as our base hash function. This makes our protocol deterministic, so that the claims of “high probability” no longer make sense. However, producing a set of keys that collide under SHA-1 can be seen, in some sense, as inverting, or “decrypting” the SHA-1 function. This is believed to be hard to do. Thus, instead of stating that our theorems hold with high probability, we can claim that they hold “based on standard hardness assumptions.” For simplicity (primarily of presentation), we dispense with the use of virtual nodes. In this case, the load on a node may exceed the average by (at most) an O(log N) factor with high probability (or in our case, based on standard hardness assumptions). One reason to avoid virtual nodes is that the number needed is determined by the number of nodes in the system, which may be difficult to determine. Of course, one may choose to use an a priori upper bound on the number of nodes in the system; for example, we could postulate at most one Chord server per IPv4 address. In this case running 32 virtual nodes per physical node would provide good load balance. 4.3 Scalable Key Location A very small amount of routing information suffices to implement consistent hashing in a distributed environment. Each node need only be aware of its successor node on the circle. Queries for a given identifier can be passed around the circle via these successor pointers until they first encounter a node that succeeds the identifier; this is the node the query maps to. A portion of the Chord protocol maintains these successor pointers, thus ensuring that all lookups are resolved correctly. However, this resolution scheme is inefficient: it may require traversing all N nodes to find the appropriate mapping. To accelerate this process, Chord maintains additional routing information. This additional information is not essential for correctness, which is achieved as long as the successor information is maintained correctly. As before, let m be the number of bits in the key/node identifiers. Each node, n, maintains a routing table with (at most) m entries, called the finger table. The i th entry in the table at node n contains the identity of the first node, s, that succeeds n by at least 2 i1 on the identifier circle, i.e., s = successor(n + 2i1 ), where 1 i m (and all arithmetic is modulo 2m ). We call node s the i th finger of node n, and denote it by n:finger[i]:node (see Table 1). A finger table entry includes both the Chord identifier and the IP address (and port number) of the relevant node. Note that the first finger of n is its immediate successor on the circle; for convenience we often refer to it as the successor rather than the first finger. In the example shown in Figure 3(b), the finger table of node 1 Notation Definition finger[k]:start (n + 2k1 ) mod 2m , 1 k m :interval [finger[k]:start; finger[k + 1]:start) :node first node n:finger[k]:start successor the next node on the identifier circle; finger[1]:node predecessor the previous node on the identifier circle Table 1: Definition of variables for node n, using m-bit identi- fiers. points to the successor nodes of identifiers (1 + 20 ) mod 2 3 = 2, (1 + 21 ) mod 2 3 = 3, and (1 + 22 ) mod 2 3 = 5, respectively. The successor of identifier 2 is node 3, as this is the first node that follows 2, the successor of identifier 3 is (trivially) node 3, and the successor of 5 is node 0. This scheme has two important characteristics. First, each node stores information about only a small number of other nodes, and knows more about nodes closely following it on the identifier circle than about nodes farther away. Second, a node’s finger table generally does not contain enough information to determine the successor of an arbitrary key k. For example, node 3 in Figure 3 does not know the successor of 1, as 1’s successor (node 1) does not appear in node 3’s finger table. What happens when a node n does not know the successor of a key k? If n can find a node whose ID is closer than its own to k, that node will know more about the identifier circle in the region of k than n does. Thus n searches its finger table for the node j whose ID most immediately precedes k, and asks j for the node it knows whose ID is closest to k. By repeating this process, n learns about nodes with IDs closer and closer to k. The pseudocode that implements the search process is shown in Figure 4. The notation n.foo() stands for the function foo() being invoked at and executed on node n. Remote calls and variable references are preceded by the remote node identifier, while local variable references and procedure calls omit the local node. Thus n.foo() denotes a remote procedure call on node n, while n.bar, without parentheses, is an RPC to lookup a variable bar on node n. find successor works by finding the immediate predecessor node of the desired identifier; the successor of that node must be the successor of the identifier. We implement find predecessor explicitly, because it is used later to implement the join operation (Section 4.4). When node n executes find predecessor, it contacts a series of nodes moving forward around the Chord circle towards id. If node n contacts a node n0 such that id falls between n0 and the successor of n0 , find predecessor is done and returns n0 . Otherwise node n asks n0 for the node n0 knows about that most closely precedes id. Thus the algorithm always makes progress towards the precedessor of id. As an example, consider the Chord ring in Figure 3(b). Suppose node 3 wants to find the successor of identifier 1. Since 1 belongs to the circular interval [7; 3), it belongs to 3:finger[3]:interval; node 3 therefore checks the third entry in its finger table, which is 0. Because 0 precedes 1, node 3 will ask node 0 to find the successor of 1. In turn, node 0 will infer from its finger table that 1’s successor is the node 1 itself, and return node 1 to node 3. The finger pointers at repeatedly doubling distances around the circle cause each iteration of the loop in find predecessor to halve the distance to the target identifier. From this intuition follows a theorem: 152
finger(2]int 2]start, tinger(3].start Figure 3:(a)The finger intervals associated with node 1.(b)Finger tables and key locations for a net with nodes 0, 1, and 3, and keys 1, 2, and 6 THEOREM 2. With high pro sumptions), the number for under standard hard- / ask node n to(lessor that must be contacted find a successor in an N-h is O(log N) n=find predecessor(id) eturn n .successor. /ask node n to find id's predecessor PROOF. Suppose that node n wishes to resolve a query for the n. find-predecessor(id) successor of k. Let p be the node that immediately precedes k We analyze the number of query steps to reach p nile (id 8 Recall that if n* p, then n forwards its query to the closest predecessor of k in its finger table. Suppose that node p is mne2's m=noclosest-preceding-finger(id) nger interval of node n. Then since this interval is not empty, node / return closest finger preceding id will finger some node f in this interval. The distance(number of -preceding -finger(id) identifiers)between n and f is at least 2 -. Butf and p are both fori = m downto I nsi finger interval, which means the distance between them is if finger[] node E(n, id) at most 2- This means f is closer to p than to n, or equivalent return finger[i nod that the distance from f to p is at most half the distance from n return n: If the distance between the node handling the query and the pre- decessor p halves in each step, and is tially, the Figure 4: The pseudocode to find the successor node of an iden- ithin m steps the distance will be one, meaning we have arrived fier id. Remote procedure calls and variable lookups are pre- ceded by the remote node. In fact, as discussed above, we assume that node and key identi fiers are random. In this case, the number of forwardings necessary will be O(log N)with high probability. After log N forwardings, a node failure. Before describing the join operation, we summa- the distance between the current query node and the key k will be rize its performance(the proof of this theorem is in the companion reduced to at most 2m/N. The expected number of node identi- technical report (2ID) fiers landing in a range of this size is 1, and it is O(log N) with gh probability. Thus, even if the remaining steps advance by only THEOREM 3. With high probability, ining or leav- one node at a time, they will cross the entire remaining interval and ing an N-node Chord newwork will use reach key k within another O(log N)steps. D re-establish the Chord routing invariants er tables will observe(and justify) thar perimental results(Section 6), we maintains a predecessor pointer. A node's predecessor pointer con- cup time is tains the Chord identifier and IP address of the immediate predeces- 4. 4 Node joins sor of that node and can be used to walk counterclockwise around the identifier circle In a dynamic network, an join(and leave)at any time The main challenge in implementing these operations is preserving To preserve the invariants stated above, Chord must perform the ability to locate every key in the network. To achieve this goal, three tasks when a node ns the network. Initialize the predecessor and fingers of node n 1. Each node's successor is correctly maintained 2. Update the fingers and predecessors of existing nodes to re- flect the addition of 2. For every key k, node successor (k) is responsible for k. In order for lookups to be fast, it is also desirable for the finger 3. Notify the higher layer software so that it can transfer state tables to be correc (e.g. values) associated with keys that node n is now respon- This section shows how to maintain these invariants whe sible for gle node joins. We defer the discussion of multiple nod ode learns the identity of an existing imultaneously to Section 5, which also discusses how to handle Chord node n by rnal mechanism. Node n uses nto 153
0 1 2 3 4 5 6 7 finger[1].interval = [finger[1].start, finger[2].start) finger[2].interval = [finger[2].start, finger[3].start) finger[3].interval = [finger[3].start, 1) finger[1].start = 2 finger[2].start = 3 finger[3].start = 5 (a) 0 1 [1,2) 1 2 [2,4) 3 4 [4,0) 0 start int. succ. finger table keys 6 1 2 3 4 5 6 7 2 [2,3) 3 3 [3,5) 3 5 [5,1) 0 start int. succ. finger table keys 1 4 [4,5) 0 5 [5,7) 0 7 [7,3) 0 start int. succ. finger table keys 2 (b) Figure 3: (a) The finger intervals associated with node 1. (b) Finger tables and key locations for a net with nodes 0, 1, and 3, and keys 1, 2, and 6. THEOREM 2. With high probability (or under standard hardness assumptions), the number of nodes that must be contacted to find a successor in an N-node network is O(log N). PROOF. Suppose that node n wishes to resolve a query for the successor of k. Let p be the node that immediately precedes k. We analyze the number of query steps to reach p. Recall that if n 6= p, then n forwards its query to the closest predecessor of k in its finger table. Suppose that node p is in the i th finger interval of node n. Then since this interval is not empty, node n will finger some node f in this interval. The distance (number of identifiers) between n and f is at least 2 i1 . But f and p are both in n’s i th finger interval, which means the distance between them is at most 2 i1 . This means f is closer to p than to n, or equivalently, that the distance from f to p is at most half the distance from n to p. If the distance between the node handling the query and the predecessor p halves in each step, and is at most 2m initially, then within m steps the distance will be one, meaning we have arrived at p. In fact, as discussed above, we assume that node and key identi- fiers are random. In this case, the number of forwardings necessary will be O(log N) with high probability. After log N forwardings, the distance between the current query node and the key k will be reduced to at most 2m =N. The expected number of node identi- fiers landing in a range of this size is 1, and it is O(log N) with high probability. Thus, even if the remaining steps advance by only one node at a time, they will cross the entire remaining interval and reach key k within another O(log N) steps. In the section reporting our experimental results (Section 6), we will observe (and justify) that the average lookup time is 1 2 log N. 4.4 Node Joins In a dynamic network, nodes can join (and leave) at any time. The main challenge in implementing these operations is preserving the ability to locate every key in the network. To achieve this goal, Chord needs to preserve two invariants: 1. Each node’s successor is correctly maintained. 2. For every key k, node successor(k) is responsible for k. In order for lookups to be fast, it is also desirable for the finger tables to be correct. This section shows how to maintain these invariants when a single node joins. We defer the discussion of multiple nodes joining simultaneously to Section 5, which also discusses how to handle // ask node n to find id’s successor n:nd successor(id) n0 = find predecessor(id); return n0 :successor; // ask node n to find id’s predecessor n:nd predecessor(id) n0 = n; while (id =2 (n0 ; n0 :successor]) n0 = n0 :closest preceding finger(id); return n0 ; // return closest finger preceding id n:closest preceding nger(id) for i = m downto 1 if (finger[i]:node 2 (n; id)) return finger[i]:node; return n; Figure 4: The pseudocode to find the successor node of an identifier id. Remote procedure calls and variable lookups are preceded by the remote node. a node failure. Before describing the join operation, we summarize its performance (the proof of this theorem is in the companion technical report [21]): THEOREM 3. With high probability, any node joining or leaving an N-node Chord network will use O(log2 N) messages to re-establish the Chord routing invariants and finger tables. To simplify the join and leave mechanisms, each node in Chord maintains a predecessor pointer. A node’s predecessor pointer contains the Chord identifier and IP address of the immediate predecessor of that node, and can be used to walk counterclockwise around the identifier circle. To preserve the invariants stated above, Chord must perform three tasks when a node n joins the network: 1. Initialize the predecessor and fingers of node n. 2. Update the fingers and predecessors of existing nodes to re- flect the addition of n. 3. Notify the higher layer software so that it can transfer state (e.g. values) associated with keys that node n is now responsible for. We assume that the new node learns the identity of an existing Chord node n0 by some external mechanism. Node n uses n0 to 153