2 CONSERVATION AT EQUILIBRIUM: ROUND-TRIP TIMING 6 Figure 4: Startup behavior of TCP with Slow-start 88 Send Time(sec) Same conditions as the previous figure(same time of day, same Suns, same network path, same buffer and window sizes), except the machines were running the 4.3*TCP with slow- start. No bandwidth is wasted on retransmits but two seconds is spent on the slow-start so the effective bandwidth of this part of the trace is 16 KBps- two times better than figure 3. (This is slightly misleading: Unlike the previous figure, the slope of the trace is 20 KBps and the effect of the 2 second offset decreases as the trace lengthens. E. g, if this trace had run a minute, the effective bandwidth would have been 19 KBps. The effective bandwidth without slow-start stays at 7 KBps no matter how long the trace. important feature of any protocol implementation that expects to survive heavy load. And it is frequently botched([26]and [13] describe typical problems) One mistake is not estimating the variation, OR, of the round trip time, R. From queuing theory we know that R and the variation in R increase quickly with load. If the load is p the ratio of average arrival rate to average departure rate), R and or scale like(1-p) To make this concrete, if the network is running at 75% of capacity, as the arpanet was in last Aprils collapse, one should expect round-trip-time to vary by a factor of sixteen(-2o to+2o) The TCP protocol specification[2 suggests estimating mean round trip time via the lov ass filter R←R+(1-∞)M where R is the average RTT estimate, M is a round trip time measurement from the most recently acked data packet, and a is a filter gain constant with a suggested value of 0.9 Once the R estimate is updated, the retransmit timeout interval, rto, for the next packet sent
2 CONSERVATION AT EQUILIBRIUM: ROUND-TRIP TIMING 6 Figure 4: Startup behavior of TCP with Slow-start • •• ••• ••••• •••••••• ••••••••••• ••••••••••••••••• •••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••• •••• •••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••• •••••••••• •••••••••••••••••••••• Send Time (sec) Packet Sequence Number (KB) 0 2 4 6 8 10 0 20 40 60 80 100 120 140 160 Same conditions as the previous figure (same time of day, same Suns, same network path, same buffer and window sizes), except the machines were running the 4.3+TCP with slowstart. No bandwidth is wasted on retransmits but two seconds is spent on the slow-start so the effective bandwidth of this part of the trace is 16 KBps — two times better than figure 3. (This is slightly misleading: Unlike the previous figure, the slope of the trace is 20 KBps and the effect of the 2 second offset decreases as the trace lengthens. E.g., if this trace had run a minute, the effective bandwidth would have been 19 KBps. The effective bandwidth without slow-start stays at 7 KBps no matter how long the trace.) important feature of any protocol implementation that expects to survive heavy load. And it is frequently botched ([26] and [13] describe typical problems). One mistake is not estimating the variation, σR, of the round trip time, R. From queuing theory we know that R and the variation in R increase quickly with load. If the load is ρ (the ratio of average arrival rate to average departure rate), R and σR scale like (1− ρ)−1. To make this concrete, if the network is running at 75% of capacity, as the Arpanet was in last April’s collapse, one should expect round-trip-time to vary by a factor of sixteen (−2σ to +2σ). The TCP protocol specification[2] suggests estimating mean round trip time via the lowpass filter R ← αR+ (1−α)M where R is the average RTT estimate, M is a round trip time measurement from the most recently acked data packet, and α is a filter gain constant with a suggested value of 0.9. Once the R estimate is updated, the retransmit timeout interval, rto, for the next packet sent is set to βR
2 CONSERVATION AT EQUILIBRIUM: ROUND-TRIP TIMING Figure 5: Performance of an rFC793 retransmit timer 9 Trace data showing per-packet round trip time on a well-behaved Arpanet connection. The X-axis is the packet number (packets were numbered sequentially, starting with one)and the y-axis is the elapsed time from the send of the packet to the sender's receipt of its ack During this portion of the trace, no packets were dropped or retransmitted The packets are indicated by a dot. a dashed line connects them to make the sequence eas- ier to follow. The solid line shows the behavior of a retransmit timer computed according to the rules of rFC793 The parameter B accounts for RTT variation(see [5], section 5). The suggested B=2 can adapt to loads of at most 30%. Above this point, a connection will respond to load increases by retransmitting packets that have only been delayed in transit. This forces the network to do useless work, wasting bandwidth on duplicates of packets that will eventually be delivered. at a time when it's known to be having trouble with useful work. I.e.. this is the network equivalent of pouring gasoline on a fire We developed a cheap method for estimating variation(see appendix A) and the re- ulting retransmit timer essentially eliminates spurious retransmissions. A pleasant side effect of estimating B rather than using a fixed value is that low load as well as high load performance improves, particularly over high delay paths such as satellite links( figures 5 nd 6) Another timer mistake is in the backoff after a retransmit: If a packet has to be retrans- mitted more than once, how should the retransmits be spaced? For a transport endpoint embedded in a network of unknown topology and with an unknown, unknowable and con- stantly changing population of competing conversations, only one scheme has any hope of working--exponential backoff-but a proof of this is beyond the scope of this paper. 3We are far from the first to recognize that transport needs to estimate both mean and variation. See, for example, [6]. But we do think our estimator is simpler than most. 4See [8]. Several authors have shown that backoffs'slower'than exponential are stable given finite popula- tions and knowledge of the global traffic. However, [17] shows that nothing slower than exponential behavior will work in the general case. To feed your intuition, consider that an IP gateway has essentially the same behavior as the 'ether in an ALOHA net or Ethernet. Justifying exponential retransmit backoff is the same as
2 CONSERVATION AT EQUILIBRIUM: ROUND-TRIP TIMING 7 Figure 5: Performance of an RFC793 retransmit timer • •••• • • •• •• •• • • •• • • • • • ••••• • • • • •••• •• • • • •••••• • • ••• • • • • ••• • • • • • • • • • ••• • • • • • • •• • •• • •• • • • • ••• •• • • • • • • •• Packet RTT (sec.) 0 10 20 30 40 50 60 70 80 90 100 110 0 2 4 6 8 10 12 Trace data showing per-packet round trip time on a well-behaved Arpanet connection. The x-axis is the packet number (packets were numbered sequentially, starting with one) and the y-axis is the elapsed time from the send of the packet to the sender’s receipt of its ack. During this portion of the trace, no packets were dropped or retransmitted. The packets are indicated by a dot. A dashed line connects them to make the sequence easier to follow. The solid line shows the behavior of a retransmit timer computed according to the rules of RFC793. The parameter β accounts for RTT variation (see [5], section 5). The suggested β = 2 can adapt to loads of at most 30%. Above this point, a connection will respond to load increases by retransmitting packets that have only been delayed in transit. This forces the network to do useless work, wasting bandwidth on duplicates of packets that will eventually be delivered, at a time when it’s known to be having trouble with useful work. I.e., this is the network equivalent of pouring gasoline on a fire. We developed a cheap method for estimating variation (see appendix A)3 and the resulting retransmit timer essentially eliminates spurious retransmissions. A pleasant side effect of estimating β rather than using a fixed value is that low load as well as high load performance improves, particularly over high delay paths such as satellite links (figures 5 and 6). Another timer mistake is in the backoff after a retransmit: If a packet has to be retransmitted more than once, how should the retransmits be spaced? For a transport endpoint embedded in a network of unknown topology and with an unknown, unknowable and constantly changing population of competing conversations, only one scheme has any hope of working—exponential backoff—but a proof of this is beyond the scope of this paper.4 3We are far from the first to recognize that transport needs to estimate both mean and variation. See, for example, [6]. But we do think our estimator is simpler than most. 4See [8]. Several authors have shown that backoffs ‘slower’ than exponential are stable given finite populations and knowledge of the global traffic. However, [17] shows that nothing slower than exponential behavior will work in the general case. To feed your intuition, consider that an IP gateway has essentially the same behavior as the ‘ether’ in an ALOHA net or Ethernet. Justifying exponential retransmit backoff is the same as
3 ADAPTING TO THE PATH: CONGESTION AVOIDANCE Figure 6: Performance of a mean+Variance retransmit timer !¥ Same data as above but the solid line shows a retransmit timer computed according to the To finesse a proof, note that a network is, to a very good approximation, a linear system That is, it is composed of elements that behave like linear operators-integrators, delays gain stages, etc. Linear system theory says that if a system is stable, the stability is expo- nential. This suggests that an unstable system(a network subject to random load shocks and prone to congestive collapse) can be stabilized by adding some exponential damping (exponential timer backoff)to its primary excitation(senders, traffic sources 3 Adapting to the path: congestion avoidance If the timers are in good shape, it is possible to state with some confidence that a timeout in- dicates a lost packet and not a broken timer. At this point, something can be done about Packets get lost for two reasons: they are damaged in transit, or the network is congested and somewhere on the path there was insufficient buffer capacity. On most network paths, loss due to damage is rare(<1%)so it is probable that a packet loss is due to congestion in the network 6 showing that no collision backoff slower than an exponential will guarantee stability on an Ethernet. Unfortu- nately, with an infinite user population even exponential backoff won't guarantee stability(although it'almost does-see [1]). Fortunately, we don' t(yet) have to deal with an infinite user population S The phrase congestion collapse(describing a positive feedback instability due to poor retransmit timers)is again the coinage of John Nagle, this time from [23] b Because a packet loss empties the window, the throughput of any window flow control protocol is quite sensitive to damage loss. For an RFC793 standard tCP running with window w(where w is at most the bandwidth-delay product), a loss probability of p degrades throughput by a factor of (1+2pmw-.E.g, a 1% age loss rate on an Arpanet path( 8 packet window) degrades TCP throughput by 14%. The congestion control scheme we propose is insensitive to damage loss until the loss rate is on the order of the window equilibration length( the number of packets it takes the window to regain its original size after a loss). If the pre-loss size is w, equilibration takes roughly w/3 packets so, for the Arpanet, the loss sensitivity
3 ADAPTING TO THE PATH: CONGESTION AVOIDANCE 8 Figure 6: Performance of a Mean+Variance retransmit timer • •••• • • •• •• •• • • •• • • • • • ••••• • • • • •••• •• • • • •••••• • • ••• • • • • ••• • • • • • • • • • ••• • • • • • • •• • •• • •• • • • • ••• •• • • • • • • •• Packet RTT (sec.) 0 10 20 30 40 50 60 70 80 90 100 110 0 2 4 6 8 10 12 Same data as above but the solid line shows a retransmit timer computed according to the algorithm in appendix A. To finesse a proof, note that a network is, to a very good approximation, a linear system. That is, it is composed of elements that behave like linear operators — integrators, delays, gain stages, etc. Linear system theory says that if a system is stable, the stability is exponential. This suggests that an unstable system (a network subject to random load shocks and prone to congestive collapse5) can be stabilized by adding some exponential damping (exponential timer backoff) to its primary excitation (senders, traffic sources). 3 Adapting to the path: congestion avoidance If the timers are in good shape, it is possible to state with some confidence that a timeout indicates a lost packet and not a broken timer. At this point, something can be done about (3). Packets get lost for two reasons: they are damaged in transit, or the network is congested and somewhere on the path there was insufficient buffer capacity. On most network paths, loss due to damage is rare (1%) so it is probable that a packet loss is due to congestion in the network.6 showing that no collision backoff slower than an exponential will guarantee stability on an Ethernet. Unfortunately, with an infinite user population even exponential backoff won’t guarantee stability (although it ‘almost’ does—see [1]). Fortunately, we don’t (yet) have to deal with an infinite user population. 5The phrase congestion collapse (describing a positive feedback instability due to poor retransmit timers) is again the coinage of John Nagle, this time from [23]. 6Because a packet loss empties the window, the throughput of any window flow control protocol is quite sensitive to damage loss. For an RFC793 standard TCP running with window w (where w is at most the bandwidth-delay product), a loss probability of p degrades throughput by a factor of (1+2pw)−1. E.g., a 1% damage loss rate on an Arpanet path (8 packet window) degrades TCP throughput by 14%. The congestion control scheme we propose is insensitive to damage loss until the loss rate is on the order of the window equilibration length (the number of packets it takes the window to regain its original size after a loss). If the pre-loss size is w, equilibration takes roughly w2/3 packets so, for the Arpanet, the loss sensitivity
3 ADAPTING TO THE PATH: CONGESTION AVOIDANCE A congestion avoidance strategy, such as the one proposed in [15], will have two components: The network must be able to signal the transport endpoints that congestion is occurring (or about to occur). And the endpoints must have a policy that decreases utilization if this signal is received and increases utilization if the signal isnt received If packet loss is(almost)al ways due to congestion and if a timeout is(almost)always due to a lost packet, we have a good candidate for the ' network is congested signal. Partic ularly since this signal is delivered automatically by all existing networks, without special modification(as opposed to [15] which requires a new bit in the packet headers and a mod- ification to all existing gateways to set this bit) The other part of a congestion avoidance strategy, the endnode action, is almost identical in the DEC/ISO scheme and our TCP and follows directly from a first-order time-series model of the network: Say network load is measured by average queue length over fixed ntervals of some appropriate length(something near the round trip time). If Li is the load at interval i, an uncongested network can be modeled by saying Li changes slowly compared to the sampling time. I.e (N constant). If the network is subject to congestion, this zeroth order model breaks down The average queue length becomes the sum of two terms, the n above that accounts for the average arrival rate of new traffic and intrinsic delay, and a new term that accounts for the fraction of traffic left over from the last time interval and the effect of this left-over traffic (e.g, induced retransmits) These are the first two terms in a Taylor series expansion of L((). There is reason to believe one might eventually need a three term, second order model, but not until the Internet has When the network is congested, y must be large and the queue lengths will start in creasing exponentially. The system will stabilize only if the traffic sources throttle back at least as quickly as the queues are growing. Since a source controls load in a window-based protocol by adjusting the size of the window, w, we end up with the sender policy On congestion W=dW-1(d<1) I.e., a multiplicative decrease of the window size(which becomes an exponential decrease over time if the congestion persists) threshold is about 5%. At this high loss rate, the empty window effect described above has already degraded throughput by 44% and the additional degradation from the congestion avoidance window shrinking is the least We are concerned that the congestion control noise sensitivity is quadratic in w but it will take at least another generation of network evolution to reach window sizes where this will be significant. If experience shows this sensitivity to be a liability, a trivial modification to the algorithm makes it linear in w. An in-progress paper explores this subject in detail 7 This is not an accident: We copied Jain s scheme after hearing his presentation at [10]and realizing that the scheme was. in a sense. universal &See any good control theory text for the relationship between a system model and admissible controls for that system. A nice introduction appears in [21], chap. 8 I.e., the system behaves like Li N yLi-I, a difference equation with the solution Ln=r"Lo
3 ADAPTING TO THE PATH: CONGESTION AVOIDANCE 9 A ‘congestion avoidance’ strategy, such as the one proposed in [15], will have two components: The network must be able to signal the transport endpoints that congestion is occurring (or about to occur). And the endpoints must have a policy that decreases utilization if this signal is received and increases utilization if the signal isn’t received. If packet loss is (almost) always due to congestion and if a timeout is (almost) always due to a lost packet, we have a good candidate for the ‘network is congested’ signal. Particularly since this signal is delivered automatically by all existing networks, without special modification (as opposed to [15] which requires a new bit in the packet headers and a modification to all existing gateways to set this bit). The other part of a congestion avoidance strategy, the endnode action, is almost identical in the DEC/ISO scheme and our TCP7 and follows directly from a first-order time-series model of the network:8 Say network load is measured by average queue length over fixed intervals of some appropriate length (something near the round trip time). If Li is the load at interval i, an uncongested network can be modeled by saying Li changes slowly compared to the sampling time. I.e., Li = N (N constant). If the network is subject to congestion, this zeroth order model breaks down. The average queue length becomes the sum of two terms, the N above that accounts for the average arrival rate of new traffic and intrinsic delay, and a new term that accounts for the fraction of traffic left over from the last time interval and the effect of this left-over traffic (e.g., induced retransmits): Li = N +γLi−1 (These are the first two terms in a Taylor series expansion of L(t). There is reason to believe one might eventually need a three term, second order model, but not until the Internet has grown substantially.) When the network is congested, γ must be large and the queue lengths will start increasing exponentially.9 The system will stabilize only if the traffic sources throttle back at least as quickly as the queues are growing. Since a source controls load in a window-based protocol by adjusting the size of the window, W, we end up with the sender policy On congestion: Wi = dWi−1 (d < 1) I.e., a multiplicative decrease of the window size (which becomes an exponential decrease over time if the congestion persists). threshold is about 5%. At this high loss rate, the empty window effect described above has already degraded throughput by 44% and the additional degradation from the congestion avoidance window shrinking is the least of one’s problems. We are concerned that the congestion control noise sensitivity is quadratic in w but it will take at least another generation of network evolution to reach window sizes where this will be significant. If experience shows this sensitivity to be a liability, a trivial modification to the algorithm makes it linear in w. An in-progress paper explores this subject in detail. 7This is not an accident: We copied Jain’s scheme after hearing his presentation at [10] and realizing that the scheme was, in a sense, universal. 8See any good control theory text for the relationship between a system model and admissible controls for that system. A nice introduction appears in [21], chap. 8. 9I.e., the system behaves like Li ≈ γLi−1, a difference equation with the solution Ln = γ nL0 which goes exponentially to infinity for any γ > 1