Making gnutella-like P2P Systems Scalable Yatin Chawathe Ratnasamy Lee breslau at&t labs-Research tel research At&t labs-Research yatin@research.att.com sylvia@intel-research net breslau@research.att.com Nick Lanham Scott Shenker UC Berkeley nickl@cs. berkeley. edu shenker@icsi. berkeley. edu ABSTRACT over, such P2P file-sharing systems are self-scaling in that as more peers join the system to look for files, they add to the aggregate Napster pioneered the idea of peer-to-peer file sharing, and sup- download capability as well. 2 ported it with a centralized file search facility. Subsequent P2P sys ms like gnutella adopted decentralized search algorithms. How- However, to make use of this self-scaling behavior, a node looking ever, Gnutella's notoriously poor scaling led some to propose d for files must find the peers that have the desired content. Napster tributed hash table solutions to the wide-area file search problem. used a centralized search facility based on file lists Contrary to that trend, we advocate retaining gnutella's simplicity peer. By centralizing search(which does not re while proposing new mechanisms that greatly improve its scalabil-. width) while distributing download (which does), ity. Building upon prior research [1, 12, 22], we propose several a highly functional hybrid design modifications to Gnutella's design that dynamically adapt the over. The resulting system was widely acknowledged as"the fastest lay topology and the search algorithms in order to accommodate the growing Internet application ever"[4]. But RIAA,s lawsuit forced natural heterogeneity present in most peer-to-peer systems. We test Napster to shut down, and its various centralized-search successors our design through simulations and the results show three to five or- have faced similar legal challenges. These centralized systems have ders of magnitude improvement in total system capacity. We also re- been replaced by new decentralized systems such as gnutella [81 port on a prototype implementation and its deployment on a testbed. that distribute both the download and search capabilities. These sys- tems establish an overlay network of peers. Queries are not sent to Categories and Subject Descriptors central site, but are instead distributed among the peers. gnut the first of such systems, uses an unstructured overlay network C2[Computer Communication Networks]: Distributed Systems that the topology of the overlay network and placement of files within General terms it is largely unconstrained. It floods each query across this overlay with a limited scope. Upon receiving a query, each peer sends a Algorithms, Design, Performance, Experimentation list of all content matching the query to the originating node, Be cause the load on each node grows linearly with the total number Keywords of queries, which in turn grows with system size, this approach is Peer-to-peer, distributed hash tables, Gnutella learly not scalabl Following Gnutella's lead, several other decentralized file-sharing 1. INTRODUCTION systems such as Kazaa [24]have become popular. Kazaa is base The peer-to-peer file-sharing revolution started with the intro on the proprietary Fasttrack technology which uses specially desig nated supernodes that have higher bandwidth connectivity. Pointe tion of Napster in 1999. Napster was the first system to recognize to each peers data are stored on an associated supernode, and all but instead could be handled by the many hosts, or peers, that al- queries are routed to supernodes. While this approach appears to ready possess the content. Such serverless peer-to-peer systems can offer better scaling than Gnutella, its design has been neither docu- chieve astounding aggregate download capacities without requir mented nor analyzed. Recently, there have been any additional expenditure for bandwidth or server farms.! More- porate this approach into the Gnutella network [7). Although some Gnutella clients now implement the supernode proposal, its scalabil- NSF grants ITR-0205519, ANI-0207399, ity has neither been measured nor been analyzed ITR-0121555,r 1698.ITR0225660 and ANI-0196514. That said, we believe that the supernode approach popularized more aggregate download capacity than a single server farm con- sharing systems. In this paper, we leverage this idea of exploit but make the selection of construction of the topology around them more dynamic and adap- ive. We present a new P2P file-sharing system, called Gia. Like is granted ot made or distribut e downbeat tee all or part of this work for Gnutella and KazaA, Gia is decentralized and unstructured. How ever, its unique design achieves an aggregate system capacity that is citation on the first 2This self-sc republish, to post on servers or to redistribute to lists, requires prior specific 时 ms(21 one extent by the free IGCOMM03. August 25-29. 2003 gianduia, which is the generic name for the hazelnut spread Copyright 2003 ACM 1-58113-735-4
Making Gnutella-like P2P Systems Scalable Yatin Chawathe AT&T Labs–Research yatin@research.att.com Sylvia Ratnasamy Intel Research sylvia@intel-research.net Lee Breslau AT&T Labs–Research breslau@research.att.com Nick Lanham UC Berkeley nickl@cs.berkeley.edu Scott Shenker∗ ICSI shenker@icsi.berkeley.edu ABSTRACT Napster pioneered the idea of peer-to-peer file sharing, and supported it with a centralized file search facility. Subsequent P2P systems like Gnutella adopted decentralized search algorithms. However, Gnutella’s notoriously poor scaling led some to propose distributed hash table solutions to the wide-area file search problem. Contrary to that trend, we advocate retaining Gnutella’s simplicity while proposing new mechanisms that greatly improve its scalability. Building upon prior research [1, 12, 22], we propose several modifications to Gnutella’s design that dynamically adapt the overlay topology and the search algorithms in order to accommodate the natural heterogeneity present in most peer-to-peer systems. We test our design through simulations and the results show three to five orders of magnitude improvement in total system capacity. We also report on a prototype implementation and its deployment on a testbed. Categories and Subject Descriptors C.2 [Computer Communication Networks]: Distributed Systems General Terms Algorithms, Design, Performance, Experimentation Keywords Peer-to-peer, distributed hash tables, Gnutella 1. INTRODUCTION The peer-to-peer file-sharing revolution started with the introduction of Napster in 1999. Napster was the first system to recognize that requests for popular content need not be sent to a central server but instead could be handled by the many hosts, or peers, that already possess the content. Such serverless peer-to-peer systems can achieve astounding aggregate download capacities without requiring any additional expenditure for bandwidth or server farms.1 More- ∗Supported in part by NSF grants ITR-0205519, ANI-0207399, ITR-0121555, ITR-0081698, ITR-0225660 and ANI-0196514. 1For instance, 100,000 peers all connected at 56kbps can provide more aggregate download capacity than a single server farm connected by two OC-48 links. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGCOMM’03, August 25–29, 2003, Karlsruhe, Germany. Copyright 2003 ACM 1-58113-735-4/03/0008 ...$5.00. over, such P2P file-sharing systems are self-scaling in that as more peers join the system to look for files, they add to the aggregate download capability as well.2 However, to make use of this self-scaling behavior, a node looking for files must find the peers that have the desired content. Napster used a centralized search facility based on file lists provided by each peer. By centralizing search (which does not require much bandwidth) while distributing download (which does), Napster achieved a highly functional hybrid design. The resulting system was widely acknowledged as “the fastest growing Internet application ever”[4]. But RIAA’s lawsuit forced Napster to shut down, and its various centralized-search successors have faced similar legal challenges. These centralized systems have been replaced by new decentralized systems such as Gnutella [8] that distribute both the download and search capabilities. These systems establish an overlay network of peers. Queries are not sent to a central site, but are instead distributed among the peers. Gnutella, the first of such systems, uses an unstructured overlay network in that the topology of the overlay network and placement of files within it is largely unconstrained. It floods each query across this overlay with a limited scope. Upon receiving a query, each peer sends a list of all content matching the query to the originating node. Because the load on each node grows linearly with the total number of queries, which in turn grows with system size, this approach is clearly not scalable. Following Gnutella’s lead, several other decentralized file-sharing systems such as KaZaA [24] have become popular. KaZaA is based on the proprietary Fasttrack technology which uses specially designated supernodes that have higher bandwidth connectivity. Pointers to each peer’s data are stored on an associated supernode, and all queries are routed to supernodes. While this approach appears to offer better scaling than Gnutella, its design has been neither documented nor analyzed. Recently, there have been proposals to incorporate this approach into the Gnutella network [7]. Although some Gnutella clients now implement the supernode proposal, its scalability has neither been measured nor been analyzed. That said, we believe that the supernode approach popularized by KaZaA is a step in the right direction for building scalable filesharing systems. In this paper, we leverage this idea of exploiting node heterogeneity, but make the selection of “supernodes” and construction of the topology around them more dynamic and adaptive. We present a new P2P file-sharing system, called Gia.3 Like Gnutella and KaZaA, Gia is decentralized and unstructured. However, its unique design achieves an aggregate system capacity that is 2This self-scaling property is mitigated to some extent by the free rider problem observed in such systems [2]. 3Gia is short for gianduia, which is the generic name for the hazelnut spread, Nutella. 407
three to five orders of magnitude better than that of gnutella as well as that of other attempts to improve gnutella [12, 24]. As such, it retains the simplicity of an unstructured system while offering vastly oved scalability The design of Gia builds on a substantial body of previous work As in the recent work by Lv et al.[12], Gia replaces Gnutella's ing with random walks. Following the work of Adamic et al.[ recognizes the implications of the overlay network's topology while sing random walks and therefore includes a topology adaptation al gorithm. Similarly, the lack of flow control has been recognized as a weakness in the original Gnutella design[16), and Gia introduces token-based fow control algorithm. Finally, like KazaA, Gia rec- ognizes that there is significant heterogeneity in peer bandwidth and Figure 1: Most download requests are for well-replicated files. incorporates heterogeneity into each aspect of our desigr While gia does build on these previous contributions, Gia is, to our knowledge the first open design that(a)combines all these el- and could easily overwhelm nodes with low-bandwidth dial-up con- ements, and(b)recognizes the fact that peers have capacity con- straints and adapts its protocols to account for these constraints nections Our simulations suggest that this results in a tremendous boost tor #2: Keyword searches are more prevalent, and more im- ment comes not just from a single design decision but from the syn- portant, than exact-match queries. DHTs excel at support- ergy among the various design features. ing exact-match lookups: given the exact name of a file, they trans- le discuss Gia's design in Section 3, its performance in Section late the name into a key and perform the corresponding lookup(key) 4, and a prototype implementation and associated practical issues in operation. However, DHTs are les adept at supporting keyword searches: given a sequence of keywords, find files that match them Section 5. However, before embarking on the description of Gia, we The current use of P2P file-sharing systems, which revolves around rst ask why not just use Distributed Hash Tables(dhts) sharing music and video, requires such key word matching. For ex- 2. WHY NOT DHTS ample, to find the song"Ray of Light by Madonna, a user typically submits a search of the form"madonna ray of light "and expects the Distributed Hash Tables are a class of recently-developed file-sharing system to locate files that match all of the keywords in 27]. Much(although not all) of the original rationale for DHTs was biguous naming convention for file names in p2P systems, and thus to provide a scalable replacement for unscalable Gnutella-like file often the same piece of content is stored by different nodes under sharing systems. The past few years has seen a veritable frenzy of several( slightly different)names research activity in this field, with many design proposals and sug Supporting such keyword searching on top of DHTs is a gested applications. All of these proposals use structured overlay trivial task. For example, the typical approach [11, 19, 26]of etworks where both the data placement and overlay topology tightly controlled. The hash-table -like lookup operation provided tain in the face of frequent node(and hence file)churn. This is only arison, Gnutella requires O(n)steps to reliably locate a specific avoid overloading nodes that store the index for popular keywords It is possible that some of these problems maybe addressable in Given this level of performance gain afforded by DHTs, it is natu- DHTS, as indicated by the deployment of the Overnet file sharing ral to ask why bother with Gia when DHTs are available. To answer application [15], which is based on the Kademlia DHT[14].Still this question, we review three relevant aspects of P2P file sharing. DHT-based solutions typically need to go to great lengths to incor porate query models beyond the simple exact-match search. In con- #l: P2P clients are extremely transient. Measured activ- trast, Gnutella and other similar systems effortlessly support key- ity in Gnutella and Napster indicates that the median up-time for a word searches and other complex queries since all such searches are node is 60 minutes [22].- For large systems of, say, 100,000 nodes, executed locally on a node-by-node basis this implies a churn rate of over 1600 nodes coming and going per ninute Churn causes little problem for Gnutella and other systems #3: Most queries are forhay, not needles. DHTs have exact that employ unstructured overlay networks as long as a peer doesn't recall, in that knowing the name of a file allows you to find it, even become disconnected by the loss of all of its neighbors, and even if there is only a single copy of that file in the system. In contrast, in that case the peer can merely repeat the bootstrap procedure Gnutella cannot reliably find single copies of files unless the flooded e-join the network. In contrast, churn does cause significant over- query reaches all nodes we call such files needles.However,we ead for DHTs. In order to preserve the efficiency and correctness expect that most queries in the popular P2P file-sharing systems are ofrouting, most DHTs require O(log n)repair operations after each for relatively well-replicated files, which we call hay. By the very failure. Graceless failures, where a node fails without beforehand nature of P2P file-sharing, if a file is requested frequently, then as informing its neighbors and transferring the relevant state, requ more and more requesters download the file to their machines, there more time and work in DHTs to(a)discover the failure and (b)re- will be many copies of it within the system. We call such systems replicate the lost data or pointers. If the churn rate is too high, the where most queries are for well-replicated content, mass-markerfile overhead caused by these repair operations can become substantial sharing systems Gnutella can easily find well-replicated files. Thus, if most searches are for hay, not needles, then gnutella's lack of exact recall is not a significant disadvantage. To verify our conjecture that most queries
three to five orders of magnitude better than that of Gnutella as well as that of other attempts to improve Gnutella [12, 24]. As such, it retains the simplicity of an unstructured system while offering vastly improved scalability. The design of Gia builds on a substantial body of previous work. As in the recent work by Lv et al. [12], Gia replaces Gnutella’s flooding with random walks. Following the work of Adamic et al. [1], Gia recognizes the implications of the overlay network’s topology while using random walks and therefore includes a topology adaptation algorithm. Similarly, the lack of flow control has been recognized as a weakness in the original Gnutella design [16], and Gia introduces a token-based flow control algorithm. Finally, like KaZaA, Gia recognizes that there is significant heterogeneity in peer bandwidth and incorporates heterogeneity into each aspect of our design. While Gia does build on these previous contributions, Gia is, to our knowledge, the first open design that (a) combines all these elements, and (b) recognizes the fact that peers have capacity constraints and adapts its protocols to account for these constraints. Our simulations suggest that this results in a tremendous boost for Gia’s system performance. Moreover, this performance improvement comes not just from a single design decision but from the synergy among the various design features. We discuss Gia’s design in Section 3, its performance in Section 4, and a prototype implementation and associated practical issues in Section 5. However, before embarking on the description of Gia, we first ask why not just use Distributed Hash Tables (DHTs). 2. WHY NOT DHTS? Distributed Hash Tables are a class of recently-developed systems that provide hash-table-like semantics at Internet scale [25, 18, 27]. Much (although not all) of the original rationale for DHTs was to provide a scalable replacement for unscalable Gnutella-like file sharing systems. The past few years has seen a veritable frenzy of research activity in this field, with many design proposals and suggested applications. All of these proposals use structured overlay networks where both the data placement and overlay topology are tightly controlled. The hash-table-like lookup() operation provided by DHTs typically requires only O(log n) steps, whereas in comparison, Gnutella requires O(n) steps to reliably locate a specific file. Given this level of performance gain afforded by DHTs, it is natural to ask why bother with Gia when DHTs are available. To answer this question, we review three relevant aspects of P2P file sharing. #1: P2P clients are extremely transient. Measured activity in Gnutella and Napster indicates that the median up-time for a node is 60 minutes [22].4 For large systems of, say, 100,000 nodes, this implies a churn rate of over 1600 nodes coming and going per minute. Churn causes little problem for Gnutella and other systems that employ unstructured overlay networks as long as a peer doesn’t become disconnected by the loss of all of its neighbors, and even in that case the peer can merely repeat the bootstrap procedure to re-join the network. In contrast, churn does cause significant overhead for DHTs. In order to preserve the efficiency and correctness of routing, most DHTs require O(log n) repair operations after each failure. Graceless failures, where a node fails without beforehand informing its neighbors and transferring the relevant state, require more time and work in DHTs to (a) discover the failure and (b) rereplicate the lost data or pointers. If the churn rate is too high, the overhead caused by these repair operations can become substantial 4We understand that there is some recently published work [3] that questions the exact numbers in this study, but the basic point remains that the peer population is still quite transient. 0 100 200 300 400 500 600 700 800 900 1 2 4 8 16 32 64 128 256 # download requests # available replicas Figure 1: Most download requests are for well-replicated files. and could easily overwhelm nodes with low-bandwidth dial-up connections. #2: Keyword searches are more prevalent, and more important, than exact-match queries. DHTs excel at supporting exact-match lookups: given the exact name of a file, they translate the name into a key and perform the corresponding lookup(key) operation. However, DHTs are les adept at supporting keyword searches: given a sequence of keywords, find files that match them. The current use of P2P file-sharing systems, which revolves around sharing music and video, requires such keyword matching. For example, to find the song “Ray of Light” by Madonna, a user typically submits a search of the form “madonna ray of light” and expects the file-sharing system to locate files that match all of the keywords in the search query. This is especially important since there is no unambiguous naming convention for file names in P2P systems, and thus often the same piece of content is stored by different nodes under several (slightly different) names. Supporting such keyword searching on top of DHTs is a nontrivial task. For example, the typical approach [11, 19, 26] of constructing an inverted index per keyword can be expensive to maintain in the face of frequent node (and hence file) churn. This is only further complicated by the additional caching algorithms needed to avoid overloading nodes that store the index for popular keywords. It is possible that some of these problems maybe addressable in DHTs, as indicated by the deployment of the Overnet file sharing application [15], which is based on the Kademlia DHT [14]. Still, DHT-based solutions typically need to go to great lengths to incorporate query models beyond the simple exact-match search. In contrast, Gnutella and other similar systems effortlessly support keyword searches and other complex queries since all such searches are executed locally on a node-by-node basis. #3: Most queries are for hay,not needles. DHTs have exact recall, in that knowing the name of a file allows you to find it, even if there is only a single copy of that file in the system. In contrast, Gnutella cannot reliably find single copies of files unless the flooded query reaches all nodes; we call such files needles. However, we expect that most queries in the popular P2P file-sharing systems are for relatively well-replicated files, which we call hay. By the very nature of P2P file-sharing, if a file is requested frequently, then as more and more requesters download the file to their machines, there will be many copies of it within the system. We call such systems, where most queries are for well-replicated content, mass-market filesharing systems. Gnutella can easily find well-replicated files. Thus, if most searches are for hay, not needles, then Gnutella’s lack of exact recall is not a significant disadvantage. To verify our conjecture that most queries 408
re indeed for hay, we gathered traces of queries and download 2. If a random walker query arrives at a node that is already over requests using an instrumented Gnutella client. Our tracing tool loaded with traffic, it may get queued for a long time before it crawled the gnutella network searching for files that match the top 50 query requests seen. After gathering the file names and the num ber of available copies of each of these files, the tool turned around Adamic et al. [1 addressed the first and offered the same files for download to other gnutella clients. We ing that instead of using purely random walks. the search protocol should bias its walks toward high-degree nodes. The intuition be then measured the number of download requests seen by the trac- hind this is that if we arrange for neighbors to be aware of each ing tool for this offered content. Figure I shows the distribution of other's shared files, high-degree nodes will have(pointers to)a large the download requests versus the number of available replicas. We number of files and hence will be more likely to have an answer that notice that most of the requests correspond to files that have a large matches the query. However, this approach ignores the problem of number of available replicas. For example, half of the requests were overloaded nodes. In fact, by always biasing the random walk to- for files with more than 100 replicas, and approximately 80% of the wards high-degree nodes, it can exacerbate the problem if the high- requests were for files with more than 80 replicas In summary, Gnutella-like designs are more robust in the face of ucgree node does not have the capacity to handle a large number of ransients and support general search facilities, both important prop- The design of Gia, on the other hand, explicitly takes into ac- queries. erties to p2P file sharing. They are less adept than DHTs at finding count the capacity constraints associated with each node in the P2P Thus, we conjecture that for mass-market hile-sharing applications, tors including its processing power, disk latencies, and access band turning to DhT-based systems, may be the better approach. width. It is well-documented that nodes in networks like gnutella exhibit significant heterogeneity in terms of their capacity to handle queries[22]. Yet, none of the prior work on scaling Gnutella-like 3. GIA DESIGN systems leverages this heterogeneity. In the design of Gia, we ex gnutella-like systems have one basic problem: when faced with plicitly accommodate(and even exploit) heterogeneity to achieve a high aggregate query rate, nodes quickly become overloaded and etter scaling. The four key components of our design are summa- the system ceases to function satisfactorily. Moreover, this prob- rized below. lem gets worse as the size of the system increases. Our first goal in designing Gia is to create a Gnutella-like P2P system that can a dynamic topology adaptation protocol that puts most node handle much higher aggregate query rates. Our second goal is to within short reach of high capacity nodes aptation ave Gia continue to function well with increasing system sizes. To protocol ensures that the well-connected (i.e this scalability, Gia strives to avoid overloading any of the nodes, which receive a large proportion of th nodes by explicitly accounting for their capacity constraints. In an lly have the capacity to handle those queries. earlier workshop paper [13, we presented a preliminary proposal An active fiow control scheme to avoid overloaded hot-spots or incorporating capacity awareness into Gnutella. In our current The flow control protocol explicitly acknowledges the exis- work, we refine those ideas and present a thorough design, tence of heterogeneity and adapts to it by assigning flow-control algorithms, and a prototype implementation of the new syst tokens to nodes based on available capacity begin with an overview of the reasoning behind our system design and then provide a detailed discussion of the various components One-hop replication of pointers to content. All nodes main- and protocols tain pointers to the content offered by their immediate neigh- bors. Since the topology adaptation algorithm ensures a con- 3.1 Design rationale gruence between high capacity nodes and high degree nodes, The gnutella protocol [6]uses a flooding-based search method to he one-hop replication guarantees that high capacity nodes find files within its P2P network. To locate a file, a node queries each are capable of providing answers to a greater number of querie of its neighbors, which in turn propagate the query to their neigh- A search protocol based on biased random walks that directs rs,and so on until the query reaches all of the clients within a queries towards high-capacity nodes, which are typically best certain radius from the original querier. Although this approach can able to answer the queries. locate files even if they are replicated at an extremely small number of nodes, it has obvious scaling problems. To address this issue, Lv 3.2 Detailed design al. [12] proposed replacing fooding with random walks. Random The framework for the gia client and protocols is modeled after walks are a well-known technique in which a query message is for- the current gnutella protocol [6]. Clients connect to each other using warded to a randomly chosen neighbor at each step until sufficient a three-way handshake protocol. All messages exchanged by clients responses to the query are found. Although they make better uti are tagged at their origin with a globally unique identifier or GUID zation of the P2P network than flooding, they have two associated which is a randomly generated sequence of 16 bytes. The GUID used to track the progress of a message through the Gia network and 1. A random walk is essentially a blind search in that at each to route responses back to the originating client. ery is forwarded to a random node We extend the gnutella protocol to take into account client capac ount any indication of how likely it is that the node ity and network heterogeneity. For this discussion, we assume that will have responses for the query client capacity is a quantity that represents the number of queries that the client can handle per second. In practice, the capacity will Note that since the tracing tool only captures the download requests have to be determined as a function of a clients access bandwidth, that came directly to it, we miss all of the requests that went to the processing power, disk speed, etc. We discuss the four protocol com- other nodes that also had copies of the same file. Thus our numbers lents in detail belo can only be a lower bound on how popular well-replicated content
are indeed for hay, we gathered traces of queries and download requests using an instrumented Gnutella client. Our tracing tool crawled the Gnutella network searching for files that match the top 50 query requests seen. After gathering the file names and the number of available copies of each of these files, the tool turned around and offered the same files for download to other Gnutella clients. We then measured the number of download requests seen by the tracing tool for this offered content. Figure 1 shows the distribution of the download requests versus the number of available replicas. We notice that most of the requests correspond to files that have a large number of available replicas.5 For example, half of the requests were for files with more than 100 replicas, and approximately 80% of the requests were for files with more than 80 replicas. In summary, Gnutella-like designs are more robust in the face of transients and support general search facilities, both important properties to P2P file sharing. They are less adept than DHTs at finding needles, but this may not matter since most P2P queries are for hay. Thus, we conjecture that for mass-market file-sharing applications, improving the scalability of unstructured P2P systems, rather than turning to DHT-based systems, may be the better approach. 3. GIA DESIGN Gnutella-like systems have one basic problem: when faced with a high aggregate query rate, nodes quickly become overloaded and the system ceases to function satisfactorily. Moreover, this problem gets worse as the size of the system increases. Our first goal in designing Gia is to create a Gnutella-like P2P system that can handle much higher aggregate query rates. Our second goal is to have Gia continue to function well with increasing system sizes. To achieve this scalability, Gia strives to avoid overloading any of the nodes by explicitly accounting for their capacity constraints. In an earlier workshop paper [13], we presented a preliminary proposal for incorporating capacity awareness into Gnutella. In our current work, we refine those ideas and present a thorough design, detailed algorithms, and a prototype implementation of the new system. We begin with an overview of the reasoning behind our system design and then provide a detailed discussion of the various components and protocols. 3.1 Design Rationale The Gnutella protocol [6] uses a flooding-based search method to find files within its P2P network. To locate a file, a node queries each of its neighbors, which in turn propagate the query to their neighbors, and so on until the query reaches all of the clients within a certain radius from the original querier. Although this approach can locate files even if they are replicated at an extremely small number of nodes, it has obvious scaling problems. To address this issue, Lv et al. [12] proposed replacing flooding with random walks. Random walks are a well-known technique in which a query message is forwarded to a randomly chosen neighbor at each step until sufficient responses to the query are found. Although they make better utilization of the P2P network than flooding, they have two associated problems: 1. A random walk is essentially a blind search in that at each step a query is forwarded to a random node without taking into account any indication of how likely it is that the node will have responses for the query. 5Note that since the tracing tool only captures the download requests that came directly to it, we miss all of the requests that went to the other nodes that also had copies of the same file. Thus our numbers can only be a lower bound on how popular well-replicated content is. 2. If a random walker query arrives at a node that is already overloaded with traffic, it may get queued for a long time before it is handled. Adamic et al. [1] addressed the first problem by recommending that instead of using purely random walks, the search protocol should bias its walks toward high-degree nodes. The intuition behind this is that if we arrange for neighbors to be aware of each other’s shared files, high-degree nodes will have (pointers to) a large number of files and hence will be more likely to have an answer that matches the query. However, this approach ignores the problem of overloaded nodes. In fact, by always biasing the random walk towards high-degree nodes, it can exacerbate the problem if the highdegree node does not have the capacity to handle a large number of queries. The design of Gia, on the other hand, explicitly takes into account the capacity constraints associated with each node in the P2P network. The capacity of a node depends upon a number of factors including its processing power, disk latencies, and access bandwidth. It is well-documented that nodes in networks like Gnutella exhibit significant heterogeneity in terms of their capacity to handle queries [22]. Yet, none of the prior work on scaling Gnutella-like systems leverages this heterogeneity. In the design of Gia, we explicitly accommodate (and even exploit) heterogeneity to achieve better scaling. The four key components of our design are summarized below: • A dynamic topology adaptation protocol that puts most nodes within short reach of high capacity nodes. The adaptation protocol ensures that the well-connected (i.e., high-degree) nodes, which receive a large proportion of the queries, actually have the capacity to handle those queries. • An active flow control scheme to avoid overloaded hot-spots. The flow control protocol explicitly acknowledges the existence of heterogeneity and adapts to it by assigning flow-control tokens to nodes based on available capacity. • One-hop replication of pointers to content. All nodes maintain pointers to the content offered by their immediate neighbors. Since the topology adaptation algorithm ensures a congruence between high capacity nodes and high degree nodes, the one-hop replication guarantees that high capacity nodes are capable of providing answers to a greater number of queries. • A search protocol based on biased random walks that directs queries towards high-capacity nodes, which are typically best able to answer the queries. 3.2 Detailed Design The framework for the Gia client and protocols is modeled after the current Gnutella protocol [6]. Clients connect to each other using a three-way handshake protocol. All messages exchanged by clients are tagged at their origin with a globally unique identifier or GUID, which is a randomly generated sequence of 16 bytes. The GUID is used to track the progress of a message through the Gia network and to route responses back to the originating client. We extend the Gnutella protocol to take into account client capacity and network heterogeneity. For this discussion, we assume that client capacity is a quantity that represents the number of queries that the client can handle per second. In practice, the capacity will have to be determined as a function of a client’s access bandwidth, processing power, disk speed, etc. We discuss the four protocol components in detail below. 409
Let Ci represent capacity of node i to accept the new node, we may need to drop an existing neighbor if num-nbrsx +1 smar_brs then ( we have room) lgorithm I shows the steps involved in making this determination ACCEPT Y: return The algorithm works as follows. If, upon accepting the new con- f we need to drop a neighb nection, the total number of neighbors would still be within a pre subset←ii∈ nbrsx such that c:≤C configured bound maz_nbrs. then the connection is automatically if no such neighbors exist then accepted. Otherwise, the node must see if it can find an appropriate REJECT eturn existing neighbor to drop and replace with the new connection. candidate Z + highest-degree neighbor from subset X always favors Y and drops an existing neighbor if Y has higher capacity than all of Xs current neighbors. Otherwise, it decides if( Cy >mar(Ci ViE nbrsx))(Y has higher capa whether to retain y or not as follows From all of Xs neighbors that or(num-nbrsz >num_nbrsy + H)Y has fewe have capacity less than or equal to that of y, we choose the neig then bor z that has the highest degree. This neighbor has the least to DROP Z: ACCEPT Y lose if x drops it in favor of y. The neighbor will be dropped f the new node y has fewer neighbors than z. this ensures REJECT Y we do not drop already poorly-connected neighbors(which Algorithm 1: pick-neighbor-to-drop(x, y get disconnected) in favor of well-connected ones. The topology When node X tries to add Y as a new neighbor, determine whether adaptation algorithm thus tries to ensure that the adaptation proces there is room for Y. If not, pick one of X's existing neighbors makes torward progress toward a stable state. Results from experi- to drop and replace it with r.(In the algorithm, H represents in Section 5.4 3. 2.2 Flow control 3.2.1 Topology Adaptation To avoid creating hot-spots or overloading any one node, Gia uses The topology adaptation algorithm is the core component that an active flow control scheme in which a sender is allowed to direct connects the Gia client to the rest of the network. In this section, thar ties to a neighbor only if that neighbor has notified the sender details of some of the specific mechanisms for discussion later in most proposed Gnutella flow-control mechanisms [16], which are Section 5. When a node starts up, it uses bootstrapping mechanisms eactive in nature: receivers drop packets when they start to become overloaded senders can infer the likelihood that a nei similar to those in Gnutella to locate other Gia nodes. Each Gia packets based on responses that they receive from the neighbor, but client maintains a host cache consisting of a list of other Gia nodes (their IP address, port number, and capacity ) The host cache is pop- there is no explicit feedback mechanism. These technique may be ulated throughout the lifetime of the client using a variety of ren- acceptable when queries are flooded across the network, because dezvous mechanisms including contacting well-known web-based even if a node drops a query, other copies of the query will prop- host caches [5] and exchanging host information with neighbors agate through the network. However, Gia uses random walks(to through PING-PONG messages [6]. Entries in the host cache are address scaling problems with flooding)to forward a single copy of se hosts fail. Dead entries are each query. Hence, arbitrarily dropping queries is not an appropriate periodically aged ou solution The goal of the topology adaptation algorithm is to ensure that To provide better fow control, each Gia client periodically as- capacity nodes are indeed the ones with high degree and that igns flow-control tokens to its neighbors. Each token represents a w capacity nodes are within short reach of higher capacity ones single query that the node is willing to accept. Thus, a node can To achieve this goal, each node independently computes a level of send a query to a neighbor only if it has received a token from that satisfaction(S). This is a quantity between 0 and I that represen neighbor, thus avoiding overloaded neighbors. In the aggregate, a how satisfied a node is with its current set of neighbors. A value node allocates tokens at the rate at which it can process queries. If ofs =0 means that the node is quite dissatisfied, while s=1 It receives queries faster than it can forward them(either because it suggests that the node is fully satisfied. As long as a node is not fully s overloaded or because it has not received enough tokens from its satisfied, the topology adaptation continues to search for appropriate neighbors, then it starts to ce ow of queries by lowering its leue up the excess queries. If this queue neighbors to improve the satisfaction level. Thus, when a node starts gets too long, it tries to reduc token allocation rate up and has fewer than some pre-configured minimum number of eighbors, it is in a dissatisfied state(S= 0). As it gathers more To provide an incentive for high-capacity nodes to advertise their neighbors, its satisfaction level rises, until it decides that its current borg, capacity, Gia clients assign tokens in proportion to the neigh- set of neighbors is sufficient to satisty its capacity, at which point the neighbors. Thus, a node that advertises high capacity to handle the details of the algorithm used to compute the satisfaction level. coming queries is in turn assigned more tokens for its own outgoing To add a new neighbor, a node(say X)randomly selects a small queries. We use a token assignment algorithm based on Start-time number of candidate entries from those in its host cache that are no cuing(SFQ)[9]. Each neighbor is assigned a fair-queuing marked dead and are not already neighbors. From these randomly weight equal to its capacity. Neighbors that are not using any of their hosen entries, x selects the node with maximum capacity greater assigned tokens are marked as inactive and the left-over capacity than its own capacity. If no such candidate entry exists, it selects automatically redistributed proportionally between the remaining one at random Node X then initiates a three-way handshake to the neighbors. As neighbors join and leave, the SFQ algorithm recon- To avoid having X flip back and forth between Y and we drop z and add y only if y has nan implementation, we set the value c here H represents the level 410
Let Ci represent capacity of node i if num nbrsX + 1 ≤ max nbrs then {we have room} ACCEPT Y ; return {we need to drop a neighbor} subset ← i ∀ i ∈ nbrsX such that Ci ≤ CY if no such neighbors exist then REJECT Y ; return candidate Z ←highest-degree neighbor from subset if (CY > max(Ci ∀ i ∈ nbrsX) ) {Y has higher capacity} or (num nbrsZ > num nbrsY + H) {Y has fewer nbrs} then DROP Z; ACCEPT Y else REJECT Y Algorithm 1: pick neighbor to drop(X, Y ): When node X tries to add Y as a new neighbor, determine whether there is room for Y . If not, pick one of X’s existing neighbors to drop and replace it with Y . (In the algorithm, H represents a hysteresis factor.) 3.2.1 Topology Adaptation The topology adaptation algorithm is the core component that connects the Gia client to the rest of the network. In this section, we provide an overview of the adaptation process, while leaving the details of some of the specific mechanisms for discussion later in Section 5. When a node starts up, it uses bootstrapping mechanisms similar to those in Gnutella to locate other Gia nodes. Each Gia client maintains a host cache consisting of a list of other Gia nodes (their IP address, port number, and capacity). The host cache is populated throughout the lifetime of the client using a variety of rendezvous mechanisms including contacting well-known web-based host caches [5] and exchanging host information with neighbors through PING-PONG messages [6]. Entries in the host cache are marked as dead if connections to those hosts fail. Dead entries are periodically aged out. The goal of the topology adaptation algorithm is to ensure that high capacity nodes are indeed the ones with high degree and that low capacity nodes are within short reach of higher capacity ones. To achieve this goal, each node independently computes a level of satisfaction (S). This is a quantity between 0 and 1 that represents how satisfied a node is with its current set of neighbors. A value of S = 0 means that the node is quite dissatisfied, while S = 1 suggests that the node is fully satisfied. As long as a node is not fully satisfied, the topology adaptation continues to search for appropriate neighbors to improve the satisfaction level. Thus, when a node starts up and has fewer than some pre-configured minimum number of neighbors, it is in a dissatisfied state (S = 0). As it gathers more neighbors, its satisfaction level rises, until it decides that its current set of neighbors is sufficient to satisfy its capacity, at which point the topology adaptation becomes quiescent. In Section 5.2, we describe the details of the algorithm used to compute the satisfaction level. To add a new neighbor, a node (say X) randomly selects a small number of candidate entries from those in its host cache that are not marked dead and are not already neighbors. From these randomly chosen entries, X selects the node with maximum capacity greater than its own capacity. If no such candidate entry exists, it selects one at random. Node X then initiates a three-way handshake to the selected neighbor, say Y . During the handshake, each node makes a decision whether or not to accept the other node as a new neighbor based upon the capacities and degrees of its existing neighbors and the new node. In order to accept the new node, we may need to drop an existing neighbor. Algorithm 1 shows the steps involved in making this determination. The algorithm works as follows. If, upon accepting the new connection, the total number of neighbors would still be within a preconfigured bound max nbrs, then the connection is automatically accepted. Otherwise, the node must see if it can find an appropriate existing neighbor to drop and replace with the new connection. X always favors Y and drops an existing neighbor if Y has higher capacity than all of X’s current neighbors. Otherwise, it decides whether to retain Y or not as follows. From all of X’s neighbors that have capacity less than or equal to that of Y , we choose the neighbor Z that has the highest degree. This neighbor has the least to lose if X drops it in favor of Y . The neighbor will be dropped only if the new node Y has fewer neighbors than Z. This ensures that we do not drop already poorly-connected neighbors (which could get disconnected) in favor of well-connected ones.6 The topology adaptation algorithm thus tries to ensure that the adaptation process makes forward progress toward a stable state. Results from experiments measuring the topology adaptation process are discussed later in Section 5.4. 3.2.2 Flow control To avoid creating hot-spots or overloading any one node, Gia uses an active flow control scheme in which a sender is allowed to direct queries to a neighbor only if that neighbor has notified the sender that it is willing to accept queries from the sender. This is in contrast to most proposed Gnutella flow-control mechanisms [16], which are reactive in nature: receivers drop packets when they start to become overloaded; senders can infer the likelihood that a neighbor will drop packets based on responses that they receive from the neighbor, but there is no explicit feedback mechanism. These technique may be acceptable when queries are flooded across the network, because even if a node drops a query, other copies of the query will propagate through the network. However, Gia uses random walks (to address scaling problems with flooding) to forward a single copy of each query. Hence, arbitrarily dropping queries is not an appropriate solution. To provide better flow control, each Gia client periodically assigns flow-control tokens to its neighbors. Each token represents a single query that the node is willing to accept. Thus, a node can send a query to a neighbor only if it has received a token from that neighbor, thus avoiding overloaded neighbors. In the aggregate, a node allocates tokens at the rate at which it can process queries. If it receives queries faster than it can forward them (either because it is overloaded or because it has not received enough tokens from its neighbors), then it starts to queue up the excess queries. If this queue gets too long, it tries to reduce the inflow of queries by lowering its token allocation rate. To provide an incentive for high-capacity nodes to advertise their true capacity, Gia clients assign tokens in proportion to the neighbors’ capacities, rather than distributing them evenly between all neighbors. Thus, a node that advertises high capacity to handle incoming queries is in turn assigned more tokens for its own outgoing queries. We use a token assignment algorithm based on Start-time Fair Queuing (SFQ) [9]. Each neighbor is assigned a fair-queuing weight equal to its capacity. Neighbors that are not using any of their assigned tokens are marked as inactive and the left-over capacity is automatically redistributed proportionally between the remaining neighbors. As neighbors join and leave, the SFQ algorithm recon- 6To avoid having X flip back and forth between Y and Z, we add a level of hysteresis: we drop Z and add Y only if Y has at least H fewer neighbors than Z, where H represents the level of hysteresis. In our simulations and implementation, we set the value of H to 5. 410
figures its token allocation accordingly. Token assignment notifica- ons ca be sent to neighbors either as separate control messages or by piggy-backing on other messages 3.2.3 One-hop replication 1000X 4.9% To improve the efficiency of the search process, each Gia node actively maintains an index of the content of each of its neighbors. These indices are exchanged when neighbors establish connections Table 1: Gnutella-like node capacity distributions. to each other, and periodically updated with any incremental changes Thus, when a node receives a query, it can respond not only with FLOOD: Search using TTL-scoped fiooding over random topolo- matches from its own content, but also provide matches from the gies. This represents the Gnutella model content offered by all of its neighbors. When a neighbor is lost, either because it leaves the system, or due to topology adaptation, RWRT: Search using random walks over random the index information for that neighbor gets fiushed. This ensures This represents the recommended search technique that all index information remains mostly up-to-date and consistent by lv et al. [12] for avoiding the scalability probl throughout the lifetime of the node 3. 2. 4 Search protocol SUPER: Sear supermode mechanisms [7, 24]. In this The combination of topology adaptation(whereby high capac- approach, we classify nodes as supernodes and non-supernodes ity nodes have more neighbors) and one-hop replication(whereb Queries are file nly between supernodes nodes keep an index of their neighbors' shared files) ensures that GIA: Search high capacity nodes can typically provide useful responses for a adaptation, e number of queries. Hence, the gia search protocol uses a ased random walks biased random walk: rather than forwarding incoming queries to randomly chosen neighbors, a Gia node selects the highest capacity We first describe our simulation model and the metrics used for evaluating the performance of our algorithms. Then we report the that neighbor. If it has no tokens from any neighbors, it queues the results from a range of simulations. Our experiments focus on the query until new tokens arrive We use TTLs to bound the duration of the biased random walks under a variety of conditions. We show how the individual com- keeping, each query is assigned a unique GUid by its originator e ts of our system(topology adaptation, flow control, one-hop node. A node remembers the neighbors to which it has already for- synergies between them affect the total system capacity. due to warded queries for a given GUID. If a query with the same GUID ace limitations, we do not present detailed results evaluating trade arrives back at the node, it is forwarded to a different neighbor. This offs within each design component reduces the likelihood that a query traverses the same path twice To 4.1 System Model ensure forward progress, if a node has already sent the query to al of its neighbors, it flushes the book-keeping state and starts re-using To capture the effect of query load on the neighbors. tor imposes capacity constraints on each of the nodes within the Each query has a MAX_ RESPONSES parameter, the maximum system. We model each node i as possessing a capacity Ci,which number of matching answers that the query should search for. In ad- represents the number of messages(such as queries and add/drop dition to the TTL, query duration is bounded by MAX RESPONSES. requests for topology adaptation) that it can process per unit time. If a node receives queries from its neighbors at a rate higher than its ments the MAX _RESPONSES in the query. Once MAX-RESPONSES capacity Ci(as can happen in the absence of flow control), then the hits zero, the query is discarded. Query responses are forwarded excess queries are modeled as being queued in connection buffers back to the originator along the reverse-path associated with the until the receiving node can read the queries from those buffers If the reverse-path is lost due to topology adaptation or if For most of our simulations, we assign capacities to nodes based lose or responses are dropped because of node failure, we rely on a distribution that is derived from the measured bandwidth distri- echanisms described later in Section 5.3 to handle the butions for Gnutella as reported by Saroiu et al. [22]. Our distribution has five levels of capacity, each separated by an order Finally, since a node can generate a response either for its own of magnitude as shown in Table I. As described in [221, this dis- files or for the files of one of its neighbors, we append to the for- tribution reflects the reality that a fair fraction of Gnutella clients warded query the addresses of the nodes that own those files. This have dial-up connections to the Internet, the majority are connected ensures that the query does not produce multiple redundant responses via cable-modem or DSL and a small number of participants have for the same instance of a file; a response is generated only if the high speed connections. For the SUPER experiments, nodes with node that owns the matching file is not already listed in the query capacities 1000x and 10000x are designated as supernodes In addition to its capacity, each node i is assigned a query gener ation rate gi, which is the number of queries that node i generates 4. SIMULATIONS per unit time. For our experiments, w In this section, we use simulations to evaluate Gia and erate queries at the same rate(bounded, of course by their capaci ties). When queries need to be buffered, they are held in queues. We its performance to two other unstructured P2P systems. Thus our model all incoming and outgoing queues as having infinite length simulations refer to the following four models: We realize that, in practice, queues are not infinite, but we make this assumption since the effect of dropping a query and adding it to found in 9] arbitrarily long queue is essentially the same
figures its token allocation accordingly.7 Token assignment notifications can be sent to neighbors either as separate control messages or by piggy-backing on other messages. 3.2.3 One-hop Replication To improve the efficiency of the search process, each Gia node actively maintains an index of the content of each of its neighbors. These indices are exchanged when neighbors establish connections to each other, and periodically updated with any incremental changes. Thus, when a node receives a query, it can respond not only with matches from its own content, but also provide matches from the content offered by all of its neighbors. When a neighbor is lost, either because it leaves the system, or due to topology adaptation, the index information for that neighbor gets flushed. This ensures that all index information remains mostly up-to-date and consistent throughout the lifetime of the node. 3.2.4 Search Protocol The combination of topology adaptation (whereby high capacity nodes have more neighbors) and one-hop replication (whereby nodes keep an index of their neighbors’ shared files) ensures that high capacity nodes can typically provide useful responses for a large number of queries. Hence, the Gia search protocol uses a biased random walk: rather than forwarding incoming queries to randomly chosen neighbors, a Gia node selects the highest capacity neighbor for which it has flow-control tokens and sends the query to that neighbor. If it has no tokens from any neighbors, it queues the query until new tokens arrive. We use TTLs to bound the duration of the biased random walks and book-keeping techniques to avoid redundant paths. With bookkeeping, each query is assigned a unique GUID by its originator node. A node remembers the neighbors to which it has already forwarded queries for a given GUID. If a query with the same GUID arrives back at the node, it is forwarded to a different neighbor. This reduces the likelihood that a query traverses the same path twice. To ensure forward progress, if a node has already sent the query to all of its neighbors, it flushes the book-keeping state and starts re-using neighbors. Each query has a MAX RESPONSES parameter, the maximum number of matching answers that the query should search for. In addition to the TTL, query duration is bounded by MAX RESPONSES. Every time a node finds a matching response for a query, it decrements the MAX RESPONSES in the query. Once MAX RESPONSES hits zero, the query is discarded. Query responses are forwarded back to the originator along the reverse-path associated with the query. If the reverse-path is lost due to topology adaptation or if queries or responses are dropped because of node failure, we rely on recovery mechanisms described later in Section 5.3 to handle the loss. Finally, since a node can generate a response either for its own files or for the files of one of its neighbors, we append to the forwarded query the addresses of the nodes that own those files. This ensures that the query does not produce multiple redundant responses for the same instance of a file; a response is generated only if the node that owns the matching file is not already listed in the query message. 4. SIMULATIONS In this section, we use simulations to evaluate Gia and compare its performance to two other unstructured P2P systems. Thus our simulations refer to the following four models: 7Details of the SFQ algorithm for proportional allocation can be found in [9]. Capacity level Percentage of nodes 1x 20% 10x 45% 100x 30% 1000x 4.9% 10000x 0.1% Table 1: Gnutella-like node capacity distributions. • FLOOD: Search using TTL-scoped flooding over random topologies. This represents the Gnutella model. • RWRT: Search using random walks over random topologies. This represents the recommended search technique suggested by Lv et al. [12] for avoiding the scalability problems with flooding. • SUPER: Search using supernode mechanisms [7, 24]. In this approach, we classify nodes as supernodes and non-supernodes. Queries are flooded only between supernodes. • GIA: Search using the Gia protocol suite including topology adaptation, active flow control, one-hop replication, and biased random walks. We first describe our simulation model and the metrics used for evaluating the performance of our algorithms. Then we report the results from a range of simulations. Our experiments focus on the aggregate system behavior in terms of its capacity to handle queries under a variety of conditions. We show how the individual components of our system (topology adaptation, flow control, one-hop replication, and searches based on biased random walks) and the synergies between them affect the total system capacity. Due to space limitations, we do not present detailed results evaluating tradeoffs within each design component. 4.1 System Model To capture the effect of query load on the system, the Gia simulator imposes capacity constraints on each of the nodes within the system. We model each node i as possessing a capacity Ci, which represents the number of messages (such as queries and add/drop requests for topology adaptation) that it can process per unit time. If a node receives queries from its neighbors at a rate higher than its capacity Ci (as can happen in the absence of flow control), then the excess queries are modeled as being queued in connection buffers until the receiving node can read the queries from those buffers. For most of our simulations, we assign capacities to nodes based on a distribution that is derived from the measured bandwidth distributions for Gnutella as reported by Saroiu et al. [22]. Our capacity distribution has five levels of capacity, each separated by an order of magnitude as shown in Table 1. As described in [22], this distribution reflects the reality that a fair fraction of Gnutella clients have dial-up connections to the Internet, the majority are connected via cable-modem or DSL and a small number of participants have high speed connections. For the SUPER experiments, nodes with capacities 1000x and 10000x are designated as supernodes. In addition to its capacity, each node i is assigned a query generation rate qi, which is the number of queries that node i generates per unit time. For our experiments, we assume that all nodes generate queries at the same rate (bounded, of course, by their capacities). When queries need to be buffered, they are held in queues. We model all incoming and outgoing queues as having infinite length. We realize that, in practice, queues are not infinite, but we make this assumption since the effect of dropping a query and adding it to an arbitrarily long queue is essentially the same. 411