route. If this fails. it retrieves or builds a E0) route from its forwarding table One might suspect that testing bits in the Observe that we've applied a broad logic to header is a large part of the cost of forwarding, handling headers. Types of datagrams that appear given that bit operations represent 28% of the code frequently (fast path, destinations that miss in the In truth, only two instructions(both xors)represent route cache, common errors, multicast datagrams) bit tests on the headers bis is used to assemble are handled in the Alpha. Those that are rare header fields. and most of the remaining instructions with options, MTU size issues, uncommon errors) are used to update the header checksum and com- are pushed off to the network processor rather than pute a hash into the route cache use valuable Icache instruction space to handle them. If the balance of traffic changes(say to more The floating point operations, while they datagrams with options) the balance of code account for 12% of the instructions, actually have no impact on performance. They are used to count between the forwarding engine and network proces- SNMP MIB values and are interleaved with integer sor can be adapted instructions so they can execute in one cycle. The We have the flexibility to rebalance code presence of four fnop instructions reflects the need because the forwarding engine's peak forwarding to pad a group of two integer instructions and one rate of 9. 8 MPPS is faster than the switch's maxi- floating point instruction so the Alpha can process mum rate of 6.48 million headers per second the four instructions in one cycle Before concluding the discussion of the for- Finally observe that there is a minimum of warding engine code, we would like to briefly dis- instructions to load and store data. There are four cuss the actual instructions used, for two reasons. loads(ldg) to load the header and two loads to First, while there has been occasional speculation load the cached forwarding table entry. Then there about what mix of instructions is appropriate for are four stores (stq) to store the updated header handing IP datagrams, so far as we know, no one and a call to create a write memory barrier(wmb)to has ever published a distribution of instructions for ensure writes are sequenced a particular processor. Second, there's been occa- sional speculation about how well RisC processors 3.3. 1. Issues in Forwarding Engine design ould handle ip dat To close the presentation of the forwarding Table I shows the instruction distribution for engine, we address two frequently asked questions the fast path instructions. Instructions are grouped about forwarding engine design in general and the according to type(using the type classifications in MGR'S forwarding engine in particular he 21164 manual) and listed with the count, per centage of total instructions and whether instruc- 3.3.1.1. Why Not Use an ASIC? tions are done in integer units eo and el or both or he floating point units FA and FM The MGR forwarding engine uses a processor to make forwarding decisions. Many people often Probably the most interesting observation observe that the IPv4 specification is very stable and from the table is that 27% of the instructions are bit, ask would it be more cost effective to implement the byte or word manipulation instructions like zap forward engine in an asIC? The frequency of these instructions largely reflects The answer to this issue depends on where the fact they are used to extract various 8-, 16-and the router might be deployed. In a corporate LAN 32-bit fields from 64-bit registers holding the ip and it turns out that IPv4 is indeed a fairly static proto- link-layer headers(the ext commands)and to zero col and an AsIC-based forwarding engine is appro- byte-wide fields(zap) in preparation for inserting priate. But in an ISP backbone, the environment updated information into those registers. Oddly that the MGr was designed for, IPv4 is constantly enough, these manipulation instructions are some of the few instructions that can only be done in the evolving in subtle ways that require programmabil logic unit EO, which means some care must be taken in the code to avoid instruction contention for eo 3.3.1.2. How Effective is a Route Cache? CThis is another reason to avoid checking the header checksum. Most of the instructions involved The MGr uses a cache of recently seen desti computing the header checksum are instructions to nations. as the Internet 's back bones become extract 16-bit fields. So checking the header check increasingly heavily used and carry traffic of a sum would have further increased the contention for reater number of parallel conversations, is such a
route. If this fails, it retrieves or builds a route from its forwarding table. Observe that we’ve applied a broad logic to handling headers. Types of datagrams that appear frequently (fast path, destinations that miss in the route cache, common errors, multicast datagrams) are handled in the Alpha. Those that are rare (IP with options, MTU size issues, uncommon errors) are pushed off to the network processor rather than use valuable Icache instruction space to handle them. If the balance of traffic changes (say to more datagrams with options) the balance of code between the forwarding engine and network processor can be adapted. We hav e the flexibility to rebalance code because the forwarding engine’s peak forwarding rate of 9.8 MPPS is faster than the switch’s maximum rate of 6.48 million headers per second. Before concluding the discussion of the forwarding engine code, we would like to briefly discuss the actual instructions used, for two reasons. First, while there has been occasional speculation about what mix of instructions is appropriate for handing IP datagrams, so far as we know, no one has ever published a distribution of instructions for a particular processor. Second, there’s been occasional speculation about how well RISC processors would handle IP datagrams. Table 1 shows the instruction distribution for the fast path instructions. Instructions are grouped according to type (using the type classifications in the 21164 manual) and listed with the count, percentage of total instructions and whether instructions are done in integer units E0 and E1 or both or the floating point units FA and FM. Probably the most interesting observation from the table is that 27% of the instructions are bit, byte or word manipulation instructions like zap. The frequency of these instructions largely reflects the fact they are used to extract various 8-, 16- and 32-bit fields from 64-bit registers holding the IP and link-layer headers (the ext commands) and to zero byte-wide fields (zap) in preparation for inserting updated information into those registers. Oddly enough, these manipulation instructions are some of the few instructions that can only be done in the logic unit E0, which means some care must be taken in the code to avoid instruction contention for E0. (This is another reason to avoid checking the header checksum. Most of the instructions involved in computing the header checksum are instructions to extract 16-bit fields. So checking the header checksum would have further increased the contention for E0.) One might suspect that testing bits in the header is a large part of the cost of forwarding, given that bit operations represent 28% of the code. In truth, only two instructions (both xors) represent bit tests on the headers. bis is used to assemble header fields, and most of the remaining instructions are used to update the header checksum and compute a hash into the route cache. The floating point operations, while they account for 12% of the instructions, actually have no impact on performance. They are used to count SNMP MIB values and are interleaved with integer instructions so they can execute in one cycle. The presence of four fnop instructions reflects the need to pad a group of two integer instructions and one floating point instruction so the Alpha can process the four instructions in one cycle. Finally observe that there is a minimum of instructions to load and store data. There are four loads (ldq) to load the header and two loads to load the cached forwarding table entry. Then there are four stores (stq) to store the updated header and a call to create a write memory barrier (wmb) to ensure writes are sequenced. 3.3.1. Issues in Forwarding Engine Design To close the presentation of the forwarding engine, we address two frequently asked questions about forwarding engine design in general and the MGR’s forwarding engine in particular. 3.3.1.1. Why Not Use an ASIC? The MGR forwarding engine uses a processor to make forwarding decisions. Many people often observe that the IPv4 specification is very stable and ask would it be more cost effective to implement the forward engine in an ASIC? The answer to this issue depends on where the router might be deployed. In a corporate LAN, it turns out that IPv4 is indeed a fairly static protocol and an ASIC-based forwarding engine is appropriate. But in an ISP backbone, the environment that the MGR was designed for, IPv4 is constantly ev olving in subtle ways that require programmability. 3.3.1.2. How Effective is a Route Cache? The MGR uses a cache of recently seen destinations. As the Internet’s backbones become increasingly heavily used and carry traffic of a greater number of parallel conversations, is such a -6-
cache likely to continue to be useful? the Multicast Count, which indicates how many In the MGR, a cache hit in the processors on- copies of the packet the inbound line card needs to chip cache is at least a factor of five less expensive make and the destination Tag, which both tells the than a full route lookup in off-chip memory, so a outbound line card what destination address to put cache is valuable provided it achieves at least a on the packet, what line to send the packet out and modest hit rate. Even with an increasing number of what flow to assign the packet to. For multicast conversations, it appears that packet trains [15] will packets, the Destination Tag tells the outbound line continue to ensure that there is a strong chance that card what set of interfaces to send the packet out. two datagrams arriving close together will be headed for the same destination. A modest hit rate 4. The switched bus seems assured, and thus we believe using a cache Most routers today use a conventional shared bus. The MGR instead uses a 15-port switch to Nonetheless, we believe the days of cache move data between function cards. The switch is a are numbered because of the development of new point-to-point switch (i.e, it effectively looks like a okup algorithms -in particular, the binary hash crossbar, connecting one source with one destina scheme [34]. The binary hash scheme takes a fixed number of memory accesses, determined by the The major limitation to a point-to-point address length, not the number of keys. As a result, switch is that it does not support the one-to-many it is fairly easy to inexpensively pipeline route transfers required for multicasting. We took a sin look hm. The gorithm ple solution to this problem. Multicast packets are pipeline could be placed alongside the inbound copied multiple times, once to each outbound line FIFO, such that a header arrived at the processor card. The usual concern about making multiple with a pointer to its route. In such an architecture copies is that it reduces effective switch throughput no cache would be needed For instance, if every packet, on average, is sent to two boards the effective switch bandwidth will be 3.4. Abstract Link Laver header reduced by half. However, even without multicast As noted earlier, one innovation for keeping support this scheme is substantially better than a the forwarding engine and its code simple, is the hared bi abstract link-layer header, which summarizes link The MGR switch is a variant of a now fairly layer information for the forwarding engine and line common type of switch. It input-queued cards. Figure 3 shows the abstract link-layer header switch in which each input keeps a separate FIFO formats for inbound (line card to forwarding and bids separately for each output. Keeping track engine)and outbound(forwarding engine to line of traffic for each output separately means the card) headers. The different formats reflect the dif- switch does not suffer Head-of-Line blocking [201 ferent needs of reception and transmission and it has been shown by simulation [30] and more The inbound abstract header contains infor- recently proved [21] that such a switch can achieve mation that the forwarding engine code needs to 100% throughput. The key design choice in this confirm the validity of the IP header and the route style of switch is its allocation algorithm- how one chosen for that header. For instance, the link-layer arbitrates among the various bids. The MGR arbi length is checked for consistency against the length tration seeks to maximize throughput, at the in the ip header. The link-laver identifier. source expense of predictable latency. (This tradeoff is the card and source port, are used to determine if 4 For some types of interfaces, such as ATM, this ICMP redirect must be sent. (ICMP redirects are may require the outbound line card to generate differ- sent when a datagram goes in and out the same ent link layer headers for each line. For others, such interface. The link-layer identifier is used in cases as Ethernet, all the interfaces can share the same link here multiple virtual interfaces may co-exist on layer header. one physical interface port, in which case a data- a shared bus would monopolize the bus, even if only gram may go in ne card were getting the multicast. On but different virtual interfaces. without causing a the switch those line cards not involved in the multi redirect) cast can concurrently make The outbound abstract header contains direc- are going on tions to the line cards about how datagram transmis- The fact that our switch cop it less effective than some other switch designs (e. g sion is to be handled. The important new fields a 23), but still much better than a bus
cache likely to continue to be useful? In the MGR, a cache hit in the processor’s onchip cache is at least a factor of five less expensive than a full route lookup in off-chip memory, so a cache is valuable provided it achieves at least a modest hit rate. Even with an increasing number of conversations, it appears that packet trains [15] will continue to ensure that there is a strong chance that two datagrams arriving close together will be headed for the same destination. A modest hit rate seems assured, and thus we believe using a cache makes sense. Nonetheless, we believe the days of caches are numbered because of the development of new lookup algorithms - in particular, the binary hash scheme [34]. The binary hash scheme takes a fixed number of memory accesses, determined by the address length, not the number of keys. As a result, it is fairly easy to inexpensively pipeline route lookups using the binary hash algorithm. The pipeline could be placed alongside the inbound FIFO, such that a header arrived at the processor with a pointer to its route. In such an architecture, no cache would be needed. 3.4. Abstract Link Layer Header As noted earlier, one innovation for keeping the forwarding engine and its code simple, is the abstract link-layer header, which summarizes linklayer information for the forwarding engine and line cards. Figure 3 shows the abstract link-layer header formats for inbound (line card to forwarding engine) and outbound (forwarding engine to line card) headers. The different formats reflect the different needs of reception and transmission. The inbound abstract header contains information that the forwarding engine code needs to confirm the validity of the IP header and the route chosen for that header. For instance, the link-layer length is checked for consistency against the length in the IP header. The link-layer identifier, source card and source port, are used to determine if an ICMP redirect must be sent. (ICMP redirects are sent when a datagram goes in and out the same interface. The link-layer identifier is used in cases where multiple virtual interfaces may co-exist on one physical interface port, in which case a datagram may go in and out the same physical interface, but different virtual interfaces, without causing a redirect). The outbound abstract header contains directions to the line cards about how datagram transmission is to be handled. The important new fields are the Multicast Count, which indicates how many copies of the packet the inbound line card needs to make and the Destination Tag, which both tells the outbound line card what destination address to put on the packet, what line to send the packet out and what flow to assign the packet to. For multicast packets, the Destination Tag tells the outbound line card what set of interfaces to send the packet out.4 4. The Switched Bus Most routers today use a conventional shared bus. The MGR instead uses a 15-port switch to move data between function cards. The switch is a point-to-point switch (i.e., it effectively looks like a crossbar, connecting one source with one destination). The major limitation to a point-to-point switch is that it does not support the one-to-many transfers required for multicasting. We took a simple solution to this problem. Multicast packets are copied multiple times, once to each outbound line card. The usual concern about making multiple copies is that it reduces effective switch throughput. For instance, if every packet, on average, is sent to two boards, the effective switch bandwidth will be reduced by half. However, even without multicast support this scheme is substantially better than a shared bus.5 The MGR switch is a variant of a now fairly common type of switch. It is an input-queued switch in which each input keeps a separate FIFO and bids separately for each output. Keeping track of traffic for each output separately means the switch does not suffer Head-of-Line blocking [20] and it has been shown by simulation [30] and more recently proved [21] that such a switch can achieve 100% throughput. The key design choice in this style of switch is its allocation algorithm − how one arbitrates among the various bids. The MGR arbitration seeks to maximize throughput, at the expense of predictable latency. (This tradeoff is the 4 For some types of interfaces, such as ATM, this may require the outbound line card to generate different link layer headers for each line. For others, such as Ethernet, all the interfaces can share the same link layer header. 5 The basic difference is that a multicast transfer on a shared bus would monopolize the bus, even if only two outbound line card were getting the multicast. On the switch, those line cards not involved in the multicast can concurrently make transfers among themselves, while the multicast transactions are going on. The fact that our switch copies multiple times makes it less effective than some other switch designs (e.g. [23]), but still much better than a bus. -7-