Congestion Avoidance and Control Van jacobson Lawrence Berkeley Laboratory Michael J Karels+ University of California at Berkeley November 1988 Introduction Computer networks have experienced an explosive growth over the past few years and with that growth have come severe congestion problems. For example, it is now common to see internet gateways drop 10% of the incoming packets because of local buffer overflows Our investigation of some of these problems has shown that much of the cause lies in transport protocol implementations(not in the protocols themselves ): The obvious' way to implement a window-based transport protocol can result in exactly the wrong behavior in response to network congestion. We give examples of wrong behavior and describe some simple algorithms that can be used to make right things happen. The algorithms are rooted in the idea of achieving network stability by forcing the transport connection to obey a packet conservation principle. We show how the algorithms derive from this principle and what effect they have on traffic over congested networks In October of 86. the internet had the first of what became a series of ' congestion col- lapses. During this period, the data throughput from LBL to UC Berkeley(sites separated by 400 yards and two IMP hops)dropped from 32 Kbps to 40 bps. We were fascinated by this sudden factor-of-thousand drop in bandwidth and embarked on an investigation of why things had gotten so bad. In particular, we wondered if the 4.3BSD(Berkeley UNIX) TCP was mis-behaving or if it could be tuned to work better under abysmal network conditions The answer to both of these questions was yes Note: This is a very slightly revised version of a paper originally presented at SIGCOMM88 [12 eY RThis work was supported in part by the U.S. Department of Energy under Contract Number DE-ACO3- vish to reference this work, please cite [12] 76sF00098 *This work was supported by the U.S. Department of Commerce, National Bureau of Standards, under Grant Number 60NANB8D0830
Congestion Avoidance and Control∗ Van Jacobson† Lawrence Berkeley Laboratory Michael J. Karels‡ University of California at Berkeley November, 1988 Introduction Computer networks have experienced an explosive growth over the past few years and with that growth have come severe congestion problems. For example, it is now common to see internet gateways drop 10% of the incoming packets because of local buffer overflows. Our investigation of some of these problems has shown that much of the cause lies in transport protocol implementations (not in the protocols themselves): The ‘obvious’ ways to implement a window-based transport protocol can result in exactly the wrong behavior in response to network congestion. We give examples of ‘wrong’ behavior and describe some simple algorithms that can be used to make right things happen. The algorithms are rooted in the idea of achieving network stability by forcing the transport connection to obey a ‘packet conservation’ principle. We show how the algorithms derive from this principle and what effect they have on traffic over congested networks. In October of ’86, the Internet had the first of what became a series of ‘congestion collapses’. During this period, the data throughput from LBL to UC Berkeley (sites separated by 400 yards and two IMP hops) dropped from 32 Kbps to 40 bps. We were fascinated by this sudden factor-of-thousand drop in bandwidth and embarked on an investigation of why things had gotten so bad. In particular, we wondered if the 4.3BSD (Berkeley UNIX) TCP was mis-behaving or if it could be tuned to work better under abysmal network conditions. The answer to both of these questions was “yes”. ∗Note: This is a very slightly revised version of a paper originally presented at SIGCOMM ’88 [12]. If you wish to reference this work, please cite [12]. †This work was supported in part by the U.S. Department of Energy under Contract Number DE-AC03- 76SF00098. ‡This work was supported by the U.S. Department of Commerce, National Bureau of Standards, under Grant Number 60NANB8D0830. 1
I GETTING TO EQUILIBRIUM: SLOW-START Since that time, we have put seven new algorithms into the 4BSD TCP round-trip-time variance estimation (ii) exponential retransmit timer backoff (iii) slow-start (iv) more aggressive receiver ack policy (v) dynamic window sizing on congestion (vi Karn's clamped retransmit backoff (vii) fast retransmit Our measurements and the reports of beta testers suggest that the final product is fairly good at dealing with congested conditions on the Internet tr This paper is a brief description of (i)-(v)and the rationale behind them. (vi) is an algo- thm recently developed by Phil Karn of Bell Communications Research, described in [16] (vii) is described in a soon-to-be-published RFC(ARPANET Request for Comments) Algorithms()-(v) spring from one observation: The fow on a TCP connection(or ISO TP-4 or Xerox NS SPP connection) should obey a conservation of packets' principle And, if this principle were obeyed, congestion collapse would become the exception rather than the rule. Thus congestion control involves finding places that violate conservation and fixing them By 'conservation of packets' we mean that for a connection in equilibrium,1.e,run- ning stably with a full window of data in transit, the packet fow is what a physicist would call"conservative: A new packet isn t put into the network until an old packet leaves. The physics of flow predicts that systems with this property should be robust in the face of congestion. Observation of the Internet suggests that it was not particularly robust. Why the di There are only three ways for packet conservation to fail 1. The connection doesnt get to equilibrium,or 2. A sender injects a new packet before an old packet has exited, or 3. The equilibrium can t be reached because of resource limits along the path In the following sections we treat each of these in turn 1 Getting to Equilibrium: Slow-start Failure (1)has to be from a connection that is either starting or restarting after a packet loss. Another way to look at the conservation property is to say that the sender uses acks as a clock to strobe new packets into the network. Since the receiver can generate acks ne 'A conservative flow means that for any given time the integral of the packet density around the sender- receiver-sender loop is a constant. Since packets have to'diffuse'around this loop, the integral is suficiently ontinuous to be a Lyapunov function for the system. A constant function trivially meets the conditions for Lyapunov stability so the system is stable and any superposition of such systems is stable. (See [3], chap. 11
1 GETTING TO EQUILIBRIUM: SLOW-START 2 Since that time, we have put seven new algorithms into the 4BSD TCP: (i) round-trip-time variance estimation (ii) exponential retransmit timer backoff (iii) slow-start (iv) more aggressive receiver ack policy (v) dynamic window sizing on congestion (vi) Karn’s clamped retransmit backoff (vii) fast retransmit Our measurements and the reports of beta testers suggest that the final product is fairly good at dealing with congested conditions on the Internet. This paper is a brief description of (i) – (v) and the rationale behind them. (vi) is an algorithm recently developed by Phil Karn of Bell Communications Research, described in [16]. (vii) is described in a soon-to-be-published RFC (ARPANET “Request for Comments”). Algorithms (i) – (v) spring from one observation: The flow on a TCP connection (or ISO TP-4 or Xerox NS SPP connection) should obey a ‘conservation of packets’ principle. And, if this principle were obeyed, congestion collapse would become the exception rather than the rule. Thus congestion control involves finding places that violate conservation and fixing them. By ‘conservation of packets’ we mean that for a connection ‘in equilibrium’, i.e., running stably with a full window of data in transit, the packet flow is what a physicist would call ‘conservative’: A new packet isn’t put into the network until an old packet leaves. The physics of flow predicts that systems with this property should be robust in the face of congestion.1 Observation of the Internet suggests that it was not particularly robust. Why the discrepancy? There are only three ways for packet conservation to fail: 1. The connection doesn’t get to equilibrium, or 2. A sender injects a new packet before an old packet has exited, or 3. The equilibrium can’t be reached because of resource limits along the path. In the following sections, we treat each of these in turn. 1 Getting to Equilibrium: Slow-start Failure (1) has to be from a connection that is either starting or restarting after a packet loss. Another way to look at the conservation property is to say that the sender uses acks as a ‘clock’ to strobe new packets into the network. Since the receiver can generate acks no 1A conservative flow means that for any given time, the integral of the packet density around the sender– receiver–sender loop is a constant. Since packets have to ‘diffuse’ around this loop, the integral is sufficiently continuous to be a Lyapunov function for the system. A constant function trivially meets the conditions for Lyapunov stability so the system is stable and any superposition of such systems is stable. (See [3], chap. 11– 12 or [21], chap. 9 for excellent introductions to system stability theory.)
I GETTING TO EQUILIBRIUM: SLOW-START Figure 1: Window Flow Control Self-clocking nder Receiver H Ab- This is a schematic representation of a sender and receiver on high bandwidth networks connected by a slower, long-haul net. The sender is just starting and has shipped a win- dow's worth of packets, back-to-back. The ack for the first of those packets is about to arrive back at the sender(the vertical line at the mouth of the lower left funnel) The vertical dimension is bandwidth, the horizontal dimension is time. Each of the shaded boxes is a packet. Bandwidth x Time= Bits so the area of each box is the packet size. The number of bits doesn't change as a packet goes through the network so a packet squeezed into the smaller long-haul bandwidth must spread out in time. The time Pb represents the minimum packet spacing on the slowest link in the path(the bottleneck). As the packets leave the bottleneck for the destination net, nothing changes the inter-packet interval so on iver's net packet spacing Pr= Pb. If the receiver is the same for all packets, the spacing between acks on the receiver's net A,=Pr= Pb. If the time slot Pb was big enough for a packet, it's big enough for an ack so the ack spacing is preserved long the return path. Thus the ack spacing on the sender's net As= Pb So, if packets after the first burst are sent only in response to an ack, the sender's packet spacing will exactly match the packet time on the slowest link in the path faster than data packets can get through the network, the protocol is self clocking(fig. 1) Self clocking systems automatically adjust to bandwidth and delay variations and have a wide dynamic range(important considering that TCP spans a range from 800 Mbps Cray channels to 1200 bps packet radio links). But the same thing that makes a self-clocked system stable when it's running makes it hard to start-to get data fowing there must be acks to clock out packets but to get acks there must be data flowing To start the 'clock,, we developed a slow-start algorithm to gradually increase the amount of data in-transit. Although we flatter ourselves that the design of this algorithm is rather subtle, the implementation is trivial -one new state variable and three lines of code in the sender. 2Slow-start is quite similar to the CUTE algorithm described in[14]. We didn t know about CUTE at the time we were developing slow-start but we should have--CUTE preceded our work by several months When describing our algorithm at the Feb, 1987, Internet Engineering Task Force (IETF)meeting, we called it sofl-start, a reference to an electronics engineers technique to limit in-rush current. The name slow-start was coined by John Nagle in a message to the IETF mailing list in March, 87. This name was clearly superior to ours and we promptly adopted it
1 GETTING TO EQUILIBRIUM: SLOW-START 3 Figure 1: Window Flow Control ‘Self-clocking’ Pr As Ar Pb Sender Receiver Ab This is a schematic representation of a sender and receiver on high bandwidth networks connected by a slower, long-haul net. The sender is just starting and has shipped a window’s worth of packets, back-to-back. The ack for the first of those packets is about to arrive back at the sender (the vertical line at the mouth of the lower left funnel). The vertical dimension is bandwidth, the horizontal dimension is time. Each of the shaded boxes is a packet. Bandwidth × Time = Bits so the area of each box is the packet size. The number of bits doesn’t change as a packet goes through the network so a packet squeezed into the smaller long-haul bandwidth must spread out in time. The time Pb represents the minimum packet spacing on the slowest link in the path (the bottleneck). As the packets leave the bottleneck for the destination net, nothing changes the inter-packet interval so on the receiver’s net packet spacing Pr = Pb. If the receiver processing time is the same for all packets, the spacing between acks on the receiver’s net Ar = Pr = Pb. If the time slot Pb was big enough for a packet, it’s big enough for an ack so the ack spacing is preserved along the return path. Thus the ack spacing on the sender’s net As = Pb. So, if packets after the first burst are sent only in response to an ack, the sender’s packet spacing will exactly match the packet time on the slowest link in the path. faster than data packets can get through the network, the protocol is ‘self clocking’ (fig. 1). Self clocking systems automatically adjust to bandwidth and delay variations and have a wide dynamic range (important considering that TCP spans a range from 800 Mbps Cray channels to 1200 bps packet radio links). But the same thing that makes a self-clocked system stable when it’s running makes it hard to start — to get data flowing there must be acks to clock out packets but to get acks there must be data flowing. To start the ‘clock’, we developed a slow-start algorithm to gradually increase the amount of data in-transit.2 Although we flatter ourselves that the design of this algorithm is rather subtle, the implementation is trivial — one new state variable and three lines of code in the sender: 2Slow-start is quite similar to the CUTE algorithm described in [14]. We didn’t know about CUTE at the time we were developing slow-start but we should have—CUTE preceded our work by several months. When describing our algorithm at the Feb., 1987, Internet Engineering Task Force (IETF) meeting, we called it soft-start, a reference to an electronics engineer’s technique to limit in-rush current. The name slow-start was coined by John Nagle in a message to the IETF mailing list in March, ’87. This name was clearly superior to ours and we promptly adopted it
I GETTING TO EQUILIBRIUM: SLOW-START Figure 2: The Chronology of a Slow-start One Round Trip Time One Packet Time The horizontal direction is time. The continuous time line has been chopped into one- round-trip-time pieces stacked vertically with increasing time going down the page. The grey, numbered boxes are packets. The white numbered boxes are the corresponding acks As each ack arrives, two packets are generated: one for the ack(the ack says a packet has left the system so a new packet is added to take its place)and one because an ack opens the congestion window by one packet. It may be clear from the figure why an add-one- packet-to-window policy opens the window exponentially in time If the local net is much faster than the long haul net, the ack's two packets arrive at the bottleneck at essentially the same time. These two packets are shown stacked on top of one nother (indicating that one of them would have to occupy space in the gateway's outbound queue). Thus the short-term queue demand on the gateway is increasing exponentially and opening a window of size w packets will require w/2 packets of buffer capacity at the bottleneck Add a congestion window, cwnd, to the per-connection state When starting or restarting after a loss, set cwnd to one packet On each ack for new data, increase cwnd by one packet When sending, send the minimum of the receivers advertised window and cwnd Actually, the slow-start window increase isnt that slow: it takes time Rlog w where R is the round-trip-time and w is the window size in packets(fig. 2). This means the window opens quickly enough to have a negligible effect on performance, even on links with large bandwidth-delay product. And the algorithm guarantees that a connection will source data at a rate at most twice the maximum possible on the path. Without slow-start, by contrast, when 10 Mbps Ethernet hosts talk over the 56 Kbps Arpanet via IP gateways, the
1 GETTING TO EQUILIBRIUM: SLOW-START 4 Figure 2: The Chronology of a Slow-start 1 2 3 1 One Round Trip Time 0R 1R 2 4 5 3 6 7 2R 4 8 9 5 10 11 6 12 13 7 14 15 3R One Packet Time The horizontal direction is time. The continuous time line has been chopped into oneround-trip-time pieces stacked vertically with increasing time going down the page. The grey, numbered boxes are packets. The white numbered boxes are the corresponding acks. As each ack arrives, two packets are generated: one for the ack (the ack says a packet has left the system so a new packet is added to take its place) and one because an ack opens the congestion window by one packet. It may be clear from the figure why an add-onepacket-to-window policy opens the window exponentially in time. If the local net is much faster than the long haul net, the ack’s two packets arrive at the bottleneck at essentially the same time. These two packets are shown stacked on top of one another (indicating that one of them would have to occupy space in the gateway’s outbound queue). Thus the short-term queue demand on the gateway is increasing exponentially and opening a window of size W packets will require W/2 packets of buffer capacity at the bottleneck. • Add a congestion window, cwnd, to the per-connection state. • When starting or restarting after a loss, set cwnd to one packet. • On each ack for new data, increase cwnd by one packet. • When sending, send the minimum of the receiver’s advertised window and cwnd. Actually, the slow-start window increase isn’t that slow: it takes time Rlog2W where R is the round-trip-time and W is the window size in packets (fig. 2). This means the window opens quickly enough to have a negligible effect on performance, even on links with a large bandwidth–delay product. And the algorithm guarantees that a connection will source data at a rate at most twice the maximum possible on the path. Without slow-start, by contrast, when 10 Mbps Ethernet hosts talk over the 56 Kbps Arpanet via IP gateways, the
2 CONSERVATION AT EQUILIBRIUM: ROUND-TRIP TIMING Figure 3: Startup behavior of TCP without Slow-start 8 Send Time(sec) Trace data of the start of a TCP conversation between two Sun 3/50s running Sun os 3.5 (the 4.3BSD TCP). The two Suns were on different Ethernets connected by IP gateways driving a 230. 4 Kbps point-to-point link(essentially the setup shown in fig. 7). The win- dow size for the connection was 16KB (32 512-byte packets)and there were 30 packets of buffer available at the bottleneck gateway. The actual path contains six store-and-forward hops so the pipe plus gateway queue has enough capacity for a full window but the gateway alone does not Each dot is a 512 data-byte packet. The x-axis is the time the packet was sent. The y axis is the sequence number in the packet header. Thus a vertical array of dots indicate back-to-back packets and two dots with the same y but different x indicate a retransmit Desirable behavior on this graph would be a relatively smooth line of dots extending diagonally from the lower left to the upper right. The slope of this line would equal the available bandwidth. Nothing in this trace resembles desirable behavior The dashed line shows the 20 KBps bandwidth available for this connection. Only 35% of this bandwidth was used; the rest was wasted on retransmits. Almost everything is retransmitted at least once and data from 54 to 58 KB is sent five times first-hop gateway sees a burst of eight packets delivered at 200 times the path bandwidth This burst of packets often puts the connection into a persistent failure mode of continuous retransmissions(figures 3 and 4) 2 Conservation at equilibrium: round-trip timing Once data is flowing reliably, problems(2)and (3)should be addressed. Assuming that the protocol implementation is correct, (2)must represent a failure of sender's retransmit timer. A good round trip time estimator, the core of the retransmit timer, is the single most
2 CONSERVATION AT EQUILIBRIUM: ROUND-TRIP TIMING 5 Figure 3: Startup behavior of TCP without Slow-start • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • •• • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • Send Time (sec) Packet Sequence Number (KB) 0 2 4 6 8 10 0 10 20 30 40 50 60 70 Trace data of the start of a TCP conversation between two Sun 3/50s running Sun OS 3.5 (the 4.3BSD TCP). The two Suns were on different Ethernets connected by IP gateways driving a 230.4 Kbps point-to-point link (essentially the setup shown in fig. 7). The window size for the connection was 16KB (32 512-byte packets) and there were 30 packets of buffer available at the bottleneck gateway. The actual path contains six store-and-forward hops so the pipe plus gateway queue has enough capacity for a full window but the gateway queue alone does not. Each dot is a 512 data-byte packet. The x-axis is the time the packet was sent. The yaxis is the sequence number in the packet header. Thus a vertical array of dots indicate back-to-back packets and two dots with the same y but different x indicate a retransmit. ‘Desirable’ behavior on this graph would be a relatively smooth line of dots extending diagonally from the lower left to the upper right. The slope of this line would equal the available bandwidth. Nothing in this trace resembles desirable behavior. The dashed line shows the 20 KBps bandwidth available for this connection. Only 35% of this bandwidth was used; the rest was wasted on retransmits. Almost everything is retransmitted at least once and data from 54 to 58 KB is sent five times. first-hop gateway sees a burst of eight packets delivered at 200 times the path bandwidth. This burst of packets often puts the connection into a persistent failure mode of continuous retransmissions (figures 3 and 4). 2 Conservation at equilibrium: round-trip timing Once data is flowing reliably, problems (2) and (3) should be addressed. Assuming that the protocol implementation is correct, (2) must represent a failure of sender’s retransmit timer. A good round trip time estimator, the core of the retransmit timer, is the single most