A Fifty Gigabit Per Second IP router C. Partridge, P. Carvey E. Burgess, 1. Castineyra, T. Clarke, L. Graham, M. Hathaway, P. Herman, A. King, S. Kohlami, T. Ma, J. Mcallen, T. Mendez, w. Milliken, R. Osterlind, R. Pettyjohn, J. Rokosz,J. Seeger: M. Sollins, S Storch. B. Tober G. Troxel, D. Waitzman and S Winterble BBN Technologies(a part of GTE Corporation Abstract mission speeds. One result of these improvements has been to put pressure on router technology to keep pace. This paper describes a router, nearly completed, which is more than fast enough to keep up witl the latest transmission technologies. The router has a backplane speed of 50 Gb/s and can forward ter of millions of packets approximately 1,000 bits, is that for Transmission link bandwidths keep improv every g per second(Gb/s) of bandwidth, a at a seemingly inexorable rate, as the result of router needs a mllion packets per second (I MPPS) logy [26. Simulta- of forwarding power. Third, the router needs to neously, expanding network usage is creating an conform to a set of protocol standards. For IPv4 ever-increasing demand that can only be served by this set of standards is summarized in the internet hese higher bandwidth links. (In 1996 and 1997 Router Requirements [3]. Our router achieves all Internet Service Providers generally reported that three goals(but for one minor variance from the the number of customers was at least doubling IPv4 router requirements, discussed below) annually, and that per-customer bandwidth usage This paper presents our multigigabit router, was also growing, in some cases, by 15% per called the MGr, which is nearly completed. This month.) router achieves up to 32 million packet per second Unfortunately, transmission links alone do not forwarding rates with 50 Gb/s of full-duplex back make a network. To achieve an overall plane capacity. About a quarter of the backplane mprovement in networking performance, other capacity is lost to overhead traffic, so the packet rate components such as host adapters, operating sys- and effective bandwidth are balanced. Both rate tems, switches, multiplexors, and routers also need and bandwidth are roughly 2 to 10 times faster than to get faster. Routers have often been seen as one of the high-performance routers available toda the lagging technologies. The goal of the work described here is to show that routers can keep pace 2. Overview of the router architecture with the other technologies and are fully capable of A router is a deceptively simple piece of driving the new generation of links(OC-48c at 2.4 equipment. At minimum, it is a collection of net work interfaces. some sort of bus or connection fab- A multigigabit router (a router capable of ric connecting those interfaces. and some software loving data at several gigabits per second or faster needs to achieve three goals. First, it needs to have I See [25]. Some experts argue for more or less packet processing power. Those arguing for more enough internal bandwidth to move ower note that a tcp/p datar between its interfaces at multigigabit rates ACk but no data is 320 bits long. Link-layer headers it needs enough packet processing power to forward typically increase this to approximately 400 bits. So several million packets per second (MPPS). A good if a router were to handle only minimum-sized pack- ets, a gigabit would represent 2.5 million packets. On rule of thumb. based on the internets avera the other side, network operators have noted a recent Most of the work described in this paper is funded by shift in the average packet size to nearly 2,000 bits. If this change is not a fluke, then a gigabit would repre- the Defense Advanced Research Projects Agency sent only 0.5 million packets (DARPA). The author list includes all major contribu- tors to the final design of the router. Partridge and Recently some companies have taken to Carvey developed the router's architecture and led switch bandwidth in and out of the switch. in that 0 Gb/s rou
A Fifty Gigabit Per Second IP Router C. Partridge, P. Carvey, E. Burgess, I. Castineyra, T. Clarke, L. Graham, M. Hathaway, P. Herman, A. King, S. Kohlami, T. Ma, J. Mcallen, T. Mendez, W. Milliken, R. Osterlind, R. Pettyjohn, J. Rokosz, J. Seeger, M. Sollins, S. Storch, B. Tober, G. Troxel, D. Waitzman and S. Winterble BBN Technologies (a part of GTE Corporation) Abstract Aggressive research on gigabit per second networks has led to dramatic improvements in network transmission speeds. One result of these improvements has been to put pressure on router technology to keep pace. This paper describes a router, nearly completed, which is more than fast enough to keep up with the latest transmission technologies. The router has a backplane speed of 50 Gb/s and can forward tens of millions of packets per second. 1. Introduction Transmission link bandwidths keep improving, at a seemingly inexorable rate, as the result of research in transmission technology [26]. Simultaneously, expanding network usage is creating an ev er-increasing demand that can only be served by these higher bandwidth links. (In 1996 and 1997, Internet Service Providers generally reported that the number of customers was at least doubling annually, and that per-customer bandwidth usage was also growing, in some cases, by 15% per month.) Unfortunately, transmission links alone do not make a network. To achieve an overall improvement in networking performance, other components such as host adapters, operating systems, switches, multiplexors, and routers also need to get faster. Routers have often been seen as one of the lagging technologies. The goal of the work described here is to show that routers can keep pace with the other technologies and are fully capable of driving the new generation of links (OC-48c at 2.4 Gb/s). A multigigabit router (a router capable of moving data at several gigabits per second or faster) needs to achieve three goals. First, it needs to have enough internal bandwidth to move packets between its interfaces at multigigabit rates. Second, it needs enough packet processing power to forward several million packets per second (MPPS). A good rule of thumb, based on the Internet’s average Most of the work described in this paper is funded by the Defense Advanced Research Projects Agency (DARPA). The author list includes all major contributors to the final design of the router. Partridge and Carvey dev eloped the router’s architecture and led its implementation. packet size of approximately 1,000 bits, is that for ev ery gigabit per second (Gb/s) of bandwidth, a router needs a million packets per second (1 MPPS) of forwarding power.1 Third, the router needs to conform to a set of protocol standards. For IPv4, this set of standards is summarized in the Internet Router Requirements [3]. Our router achieves all three goals (but for one minor variance from the IPv4 router requirements, discussed below). This paper presents our multigigabit router, called the MGR, which is nearly completed. This router achieves up to 32 million packet per second forwarding rates with 50 Gb/s of full-duplex backplane capacity.2 About a quarter of the backplane capacity is lost to overhead traffic, so the packet rate and effective bandwidth are balanced. Both rate and bandwidth are roughly 2 to 10 times faster than the high-performance routers available today. 2. Overview of the Router Architecture A router is a deceptively simple piece of equipment. At minimum, it is a collection of network interfaces, some sort of bus or connection fabric connecting those interfaces, and some software 1 See [25]. Some experts argue for more or less packet processing power. Those arguing for more power note that a TCP/IP datagram containing an ACK but no data is 320 bits long. Link-layer headers typically increase this to approximately 400 bits. So if a router were to handle only minimum-sized packets, a gigabit would represent 2.5 million packets. On the other side, network operators have noted a recent shift in the average packet size to nearly 2,000 bits. If this change is not a fluke, then a gigabit would represent only 0.5 million packets. 2 Recently some companies have taken to summing switch bandwidth in and out of the switch, in that case, this router is a 100 Gb/s router. -1-
or logic that determines how to route packets among table is many times(as much as 1,000 times)more those interfaces. Within that simple description expensive than actually processing the packet however, lies a number of complexities. (As header. So the solution is to push the routing tables illustration of the complexities, consider the fact down into each forwarding engine. Since the for- that the Internet Engineering Task Force's Require- warding engines only require a summary of the data ents for IP Version 4 Routers [3] is 175 pages long in the route (in particular, next hop information) and cites over one hundred related references and their copies of the routing table, called forwarding standards. )In this section, we present an overview tables can be very small (as little as 100KB for of the mgr design and point out its major and about 50K routes [6]) minor innovations. After this section, the rest of the Second, the design uses a switched back- paper discusses the details of each module plane. Until very recently the standard router used a shared bus, rather than a switched backplane. How- 2.1. Design Summary ever, to go fast, one really needs the parallelism of a A simplified outline of the MGr design is switch. Our particular switch was custom designed shown in Figure 1, which illustrates the data pro- to meet the needs of an IProuter cessing path for a stream of packets entering from Third, the design places forwarding engines the line card on the left and exiting from the line on boards distinct from line cards. Historically, for card on the right warding processors have been placed on the line The MGR consists of multiple line card cards. We chose to separate them for several rea- (each supporting one or more network interfaces sons. One reason was expediency; we were not sure and forwarding engine cards, all plugged into a high we had enough board real-estate to fit both forward- speed switch. When a packet arrives at a line card, ing engine functionality and line card functions on its header is removed and passed through the switch the target card size. Another set of reasons involves to a forwarding engine. (The remainder of the flexibility. There are well-known industry cases of packet remains on the inbound line card). The for- router designers crippling their routers by putting warding engine reads the header to determine how too weak a processor on the line card, and effec- to forward the packet and then updates the header tively throttling the line card's interfaces to the pro- and sends the updated header and its forwarding cessor's speed. Rather than risk this mistake, we instructions back to the inbound line card. The built the fastest forwarding engine we could, and inbound line card integrates the new header with the allow as many (or few) interfaces as is appropriate rest of the packet, and sends the entire packet to the to share the use of the forwarding engine. This decision had the additional benefit of making sup. Not shown in Figure I but an important piece port for Virtual Private Networks very simple-we of the MGR is a control processor, called the net- can dedicate a forwarding engine to each virtual work processor, that provides basic management network and ensure packets never cross(and risk functions such as link up/down management and confusion)in the forwarding path generation of forwarding engine routing tables for Placing forwarding engines on separate cards the router led to a fourth innovation. because the forwarding engines are separate from the line cards, they may 2. 2. Major Innovations receive packets from line cards that use different There are five novel elements of this design link layers. At the same time, correct IP forwarding This section briefly presents the innovations. More requires some information from the link-layer detailed discussions. when needed. can be found in packet header (largely used for consistency check- he detailed sections following ing). However, for fast forwarding, one would pre- First, each forwarding engine has a complete fer that the forwarding engines not have to have set of the routing tables. Historically, routers have code to recognize, parse and load a diversity of dif- kept a central master routing table and the satellite ferent link-layer headers(each of which may have a processors each keep only a modest cache of different length). Our solution was to require all recently used routes. If a route was not in a satellite line cards to support the ability to translate their local link-layer headers to and from an abstract processor's cache, it would request the relevant route from the central table. At high speeds, the layer header format, which contained the infor central table can easily become a bottleneck tion required for IP forwarding because the cost of retrieving a route the central
or logic that determines how to route packets among those interfaces. Within that simple description, however, lies a number of complexities. (As an illustration of the complexities, consider the fact that the Internet Engineering Task Force’s Requirements for IP Version 4 Routers [3] is 175 pages long and cites over one hundred related references and standards.) In this section, we present an overview of the MGR design and point out its major and minor innovations. After this section, the rest of the paper discusses the details of each module. 2.1. Design Summary A simplified outline of the MGR design is shown in Figure 1, which illustrates the data processing path for a stream of packets entering from the line card on the left and exiting from the line card on the right. The MGR consists of multiple line cards (each supporting one or more network interfaces) and forwarding engine cards, all plugged into a high speed switch. When a packet arrives at a line card, its header is removed and passed through the switch to a forwarding engine. (The remainder of the packet remains on the inbound line card). The forwarding engine reads the header to determine how to forward the packet and then updates the header and sends the updated header and its forwarding instructions back to the inbound line card. The inbound line card integrates the new header with the rest of the packet, and sends the entire packet to the outbound line card for transmission. Not shown in Figure 1 but an important piece of the MGR is a control processor, called the network processor, that provides basic management functions such as link up/down management and generation of forwarding engine routing tables for the router. 2.2. Major Innovations There are five novel elements of this design. This section briefly presents the innovations. More detailed discussions, when needed, can be found in the detailed sections following. First, each forwarding engine has a complete set of the routing tables. Historically, routers have kept a central master routing table and the satellite processors each keep only a modest cache of recently used routes. If a route was not in a satellite processor’s cache, it would request the relevant route from the central table. At high speeds, the central table can easily become a bottleneck, because the cost of retrieving a route the central table is many times (as much as 1,000 times) more expensive than actually processing the packet header. So the solution is to push the routing tables down into each forwarding engine. Since the forwarding engines only require a summary of the data in the route (in particular, next hop information), their copies of the routing table, called forwarding tables can be very small (as little as 100KB for about 50K routes [6]). Second, the design uses a switched backplane. Until very recently the standard router used a shared bus, rather than a switched backplane. Howev er, to go fast, one really needs the parallelism of a switch. Our particular switch was custom designed to meet the needs of an IP router. Third, the design places forwarding engines on boards distinct from line cards. Historically, forwarding processors have been placed on the line cards. We chose to separate them for several reasons. One reason was expediency; we were not sure we had enough board real-estate to fit both forwarding engine functionality and line card functions on the target card size. Another set of reasons involves flexibility. There are well-known industry cases of router designers crippling their routers by putting too weak a processor on the line card, and effectively throttling the line card’s interfaces to the processor’s speed. Rather than risk this mistake, we built the fastest forwarding engine we could, and allow as many (or few) interfaces as is appropriate to share the use of the forwarding engine. This decision had the additional benefit of making support for Virtual Private Networks very simple - we can dedicate a forwarding engine to each virtual network and ensure packets never cross (and risk confusion) in the forwarding path. Placing forwarding engines on separate cards led to a fourth innovation. Because the forwarding engines are separate from the line cards, they may receive packets from line cards that use different link layers. At the same time, correct IP forwarding requires some information from the link-layer packet header (largely used for consistency checking). However, for fast forwarding, one would prefer that the forwarding engines not have to hav e code to recognize, parse and load a diversity of different link-layer headers (each of which may have a different length). Our solution was to require all line cards to support the ability to translate their local link-layer headers to and from an abstract linklayer header format, which contained the information required for IP forwarding. -2-
The fifth innovation was to include quality of in the next quad are issued. In practice this means service processing in the router. We wanted to the programmers goal is to place either two pairs of demonstrate that it was possible to build a cutting integer instructions that can issue concurrently, or a edge router that included line-speed quality of ser- pair of integer instructions plus a pair of floating vice. We chose to split the quality of service func- point instructions, in each quad tion. The forwarding engine simply classifies pack- The 21164 has 3 internal caches, plus support ets, by assigning a packet to a flow, based on the for an external cache. The Instruction and data information in the packet header. The actual caches(Icache and Dcache)are the first level caches scheduling of the packet is done by the outbound and are 8 KB each in size. The size of the lcache is line card, in a specialized processor called the Qos important because we want to run the processor at Processor the maximum instruction rate and require that all code fits into the Icache. Since Alpha instructions 3. The Forwarding engines are 32-bits long, this means the lcache can store The forwarding engines are responsible for 2,048 instructions, more than enough to do key deciding where to forward each packet. When an routing functions. If there are no errors in branch line card receives a new packet, it sends the packet prediction, there will be no bubbles(interruptions in header to a forwarding engine. The forwarding processing) when using instructions from the engine then determines how the packet should be Icache. Our software effectively ignores the Dcache and always assumes that the first load of a 32-byte The development of our forwarding engi cache line misses design was influenced by the Bell Labs router[2] There is a 96KB on-chip secondary cache which, although it has a different architecture, had (Scache) which caches both code and data. loads to solve similar problems from the Scache take a minimum of 8 cycles to complete, depending on the state of the memory 3.1. A Brief Description of the Alpha 21164 Pro- management hardware in the processor. We use the cessor Scache as a cache of recent routes. Since each route At the heart of each forwarding engine is a entry takes 64 bits, we have a maximum cache size 415 MHz Digital Equipment Corporation Alpha of approximately 12,000 routes. Studies of locality 21164 processor. Since much of the forwarding in packet streams at routers suggest a cache this size engine board is built around the Alpha, this section should yield a hit rate well in excess of 95%[11 summarizes key features of the Alpha. The focus in 15, 13]. Our own tests with a trafic trace from FIX this section is on features that impact how the Alpha West(a major inter exchange point in the Internet) functions in the forwarding engine. A more suggest a 12,000 entry cache will have a hit rate in detailed description of the 21164 and the Alpha excess of% architecture in general can be found in [1, 311 The tertiary cache(Bcache) is an external The Alpha 21164 is a 64-bit, 32-register of several megabytes managed by the pro- super-scalar RiSC processor. There are two integer cessor. Loads from the bcache can take a lons logic units called E0 and El and two floating point ime. While the Bcache uses 2Ins memory, the units called Fa and fm. The four logic units are total time to load a 32-byte cache line is 44ns distinct. While most integer instructions(including There is also a system bus, but it is far too slow for loads) can be done in either EO or El, a few impor this application and shares a single 128-bit data path tant operations, most notably byte extractions, shift with the bcache, so we designed the forwarding perations, and stores, can only be done in E( engine's memory system to bypass the system bus Floating point operations are more restricted, with interface all but one instruction limited to either fa or fm A complete forwarding table is kept in the In each cycle, the Alpha attempts to schedule one Bcache the bcache is 16 MB instruction to each logic unit. For integer register- divided into two 8 MB banks. At any time, one almost always bank is acting as the Bcache and the other bank available in the next instruction cycle. Floating being updated by the network processor via the PCI results typically take several cycles. The Alpha pro- bus. When the forwarding table is updated, the net- cesses instructions in groups of four instructions work processor instructs the Alpha to change mem- (hereafter called quads). All four instructions in ory banks and invalidate its internal caches quad must successfully issue before any instructions
The fifth innovation was to include quality of service processing in the router. We wanted to demonstrate that it was possible to build a cutting edge router that included line-speed quality of service. We chose to split the quality of service function. The forwarding engine simply classifies packets, by assigning a packet to a flow, based on the information in the packet header. The actual scheduling of the packet is done by the outbound line card, in a specialized processor called the QoS Processor. 3. The Forwarding Engines The forwarding engines are responsible for deciding where to forward each packet. When an line card receives a new packet, it sends the packet header to a forwarding engine. The forwarding engine then determines how the packet should be routed. The development of our forwarding engine design was influenced by the Bell Labs router [2], which, although it has a different architecture, had to solve similar problems. 3.1. A Brief Description of the Alpha 21164 Processor At the heart of each forwarding engine is a 415 MHz Digital Equipment Corporation Alpha 21164 processor. Since much of the forwarding engine board is built around the Alpha, this section summarizes key features of the Alpha. The focus in this section is on features that impact how the Alpha functions in the forwarding engine. A more detailed description of the 21164 and the Alpha architecture in general can be found in [1, 31]. The Alpha 21164 is a 64-bit, 32-register, super-scalar RISC processor. There are two integer logic units called E0 and E1 and two floating point units called FA and FM. The four logic units are distinct. While most integer instructions (including loads) can be done in either E0 or E1, a few important operations, most notably byte extractions, shift operations, and stores, can only be done in E0. Floating point operations are more restricted, with all but one instruction limited to either FA or FM. In each cycle, the Alpha attempts to schedule one instruction to each logic unit. For integer registerto-register operations, results are almost always available in the next instruction cycle. Floating results typically take sev eral cycles. The Alpha processes instructions in groups of four instructions (hereafter called quads). All four instructions in a quad must successfully issue before any instructions in the next quad are issued. In practice this means the programmer’s goal is to place either two pairs of integer instructions that can issue concurrently, or a pair of integer instructions plus a pair of floating point instructions, in each quad. The 21164 has 3 internal caches, plus support for an external cache. The Instruction and Data caches (Icache and Dcache) are the first level caches and are 8 KB each in size. The size of the Icache is important because we want to run the processor at the maximum instruction rate and require that all code fits into the Icache. Since Alpha instructions are 32-bits long, this means the Icache can store 2,048 instructions, more than enough to do key routing functions. If there are no errors in branch prediction, there will be no bubbles (interruptions in processing) when using instructions from the Icache. Our software effectively ignores the Dcache and always assumes that the first load of a 32-byte cache line misses. There is a 96KB on-chip secondary cache (Scache) which caches both code and data. Loads from the Scache take a minimum of 8 cycles to complete, depending on the state of the memory management hardware in the processor. We use the Scache as a cache of recent routes. Since each route entry takes 64 bits, we have a maximum cache size of approximately 12,000 routes. Studies of locality in packet streams at routers suggest a cache this size should yield a hit rate well in excess of 95% [11, 15, 13]. Our own tests with a traffic trace from FIX West (a major inter exchange point in the Internet) suggest a 12,000 entry cache will have a hit rate in excess of 95%. The tertiary cache (Bcache) is an external memory of several megabytes managed by the processor. Loads from the Bcache can take a long time. While the Bcache uses 21ns memory, the total time to load a 32-byte cache line is 44 ns. There is also a system bus, but it is far too slow for this application and shares a single 128-bit data path with the Bcache, so we designed the forwarding engine’s memory system to bypass the system bus interface. A complete forwarding table is kept in the Bcache. In our design, the Bcache is 16 MB, divided into two 8 MB banks. At any time, one bank is acting as the Bcache and the other bank is being updated by the network processor via the PCI bus. When the forwarding table is updated, the network processor instructs the Alpha to change memory banks and invalidate its internal caches. -3-
The divided Bcache highlights that we are we minimize the load on the Alpha's memory inter- taking an unusual approach: using a generic proces- or as an embedded processor. Readers may won- When the software writes out the updated der why didnt we choose an embedded processor? header it indicates which outbound interface to send The reason is that, even with the inconvenience of the Bcache, the Alpha is a very good match for this the packet to by writing to one of 241 addresses (240 addresses for each of 16 possible interfaces on task. as the section on software illustrates below 15 line cards plus one address indicating the packet forwarding an IP datagram is a small process of should be discarded. )The hardware actually imple- reading a header, processing the header, looking up ments these FIFOs as a single buffer and grabs the a route and writing out the header plus routing dispatching information from a portion of the FIFO information. The Alpha has four properties that address ake it a good fit: (1)very high clock speed, so for- warding code is executed quickly;(2)a large In addition to the dispatching information in instruction cache. so the instructions can be done at the address lines, the updated header contains some peak rate; (3)a very large on-chip cache (the key routing information. In particular, it contains Scache), so that the routing lookup will probably hit the outbound link-layer address and a flow identi- in the on-chip route cache(avoiding accesses to fier, which is used by the outbound line card to slow external memory); and (4)sufficient control on schedule when the packet is actually transmitted read and write sequencing and buffer management A side comment about the link-layer address to ensure that we could manage how data flowed is in order. Many networks have dynamic schemes through the chip for mapping IP addresses to link-layer addresses. A good example is the Address Resolution Protocol 3. 2. Forwarding Engine Hardware Operation used for Ethernet [28. If a router gets a datagram Once headers reach the forwarding engine. to an ip address whose ethernet address it doesnt hey are placed in a request FIFo queue for process know, it is supposed to send an ARP message and ing by the Alpha. The Alpha is running a loop hold the datagram until it gets an ARP reply with which simply reads from the front of the FIFO lined examines the header to determine how to route the MGR architecture, that approach doesn't work-we packet, and then makes one or more writes to have no convenient place in the forwarding engine inform the inbound and outbound line cards how to to store datagrams awaiting an ARP reply. Rather handle the packet e follow a two part strategy. First, at a low fre- quency, the router ARPs for all possible addresses Conceptually, this process is illustrated in Figure 2. A packet header has reached the front of on each interface, to collect link-layer addresses for the forwarding tables. Second, datagrams for which the request FIFO. The header includes the first 24 the destination link-layer address is unknown are or 56 bytes of the packet plus an 8-byte generic passed to the network processor, which does the link-layer header and a packet identifier which iden- lifies both the packet and the interface it is buffered datagram and incorporates the link-layer address into future forwarding tables the first 32 bytes of the packet header. When the packet is read, the packet identifier is copied into a 3.3. Forwarding Engine software holding buffer. When the Alpha writes out the updated header, the packet identifier is taken from The forwarding engine software is a few hun he holding buffer and combined with the data from dred lines of code. of which 85 instructions are the Alpha to determine where the updated header executed in the common case. These instruction execute in no less than 42 cycles, which translates to a peak forwarding speed of 9.8 MPPS per for- The Alpha software is free to read and write warding engines. This section sketches the struc- more than 32 bytes of the packet header(if present) ture of the code and then discusses some of the and can, if it chooses, read and write the packet identifier registers as well. The software must read and write this information if it is reading and writ The fast path through the code can be roughly packet headers in anything but FIFO order. The divided up into three stages, each of which is about motivation for the holding buffer is to minimize the take somewhat longer amount of data that must go through the Alpha. By depending on the pattern of packets received and the allowing software to avoid reading the packet ID, resulting branch predictions
The divided Bcache highlights that we are taking an unusual approach: using a generic processor as an embedded processor. Readers may wonder why didn’t we choose an embedded processor? The reason is that, even with the inconvenience of the Bcache, the Alpha is a very good match for this task. As the section on software illustrates below, forwarding an IP datagram is a small process of reading a header, processing the header, looking up a route and writing out the header plus routing information. The Alpha has four properties that make it a good fit: (1) very high clock speed, so forwarding code is executed quickly; (2) a large instruction cache, so the instructions can be done at peak rate; (3) a very large on-chip cache (the Scache), so that the routing lookup will probably hit in the on-chip route cache (avoiding accesses to slow external memory); and (4) sufficient control on read and write sequencing and buffer management to ensure that we could manage how data flowed through the chip. 3.2. Forwarding Engine Hardware Operation Once headers reach the forwarding engine, they are placed in a request FIFO queue for processing by the Alpha. The Alpha is running a loop which simply reads from the front of the FIFO, examines the header to determine how to route the packet, and then makes one or more writes to inform the inbound and outbound line cards how to handle the packet. Conceptually, this process is illustrated in Figure 2. A packet header has reached the front of the request FIFO. The header includes the first 24 or 56 bytes of the packet plus an 8-byte generic link-layer header and a packet identifier which identifies both the packet and the interface it is buffered on. The Alpha software is expected to read at least the first 32 bytes of the packet header. When the packet is read, the packet identifier is copied into a holding buffer. When the Alpha writes out the updated header, the packet identifier is taken from the holding buffer and combined with the data from the Alpha to determine where the updated header and packet are sent. The Alpha software is free to read and write more than 32 bytes of the packet header (if present) and can, if it chooses, read and write the packet identifier registers as well. The software must read and write this information if it is reading and writing packet headers in anything but FIFO order. The motivation for the holding buffer is to minimize the amount of data that must go through the Alpha. By allowing software to avoid reading the packet ID, we minimize the load on the Alpha’s memory interface. When the software writes out the updated header it indicates which outbound interface to send the packet to by writing to one of 241 addresses. (240 addresses for each of 16 possible interfaces on 15 line cards plus one address indicating the packet should be discarded.) The hardware actually implements these FIFOs as a single buffer and grabs the dispatching information from a portion of the FIFO address. In addition to the dispatching information in the address lines, the updated header contains some key routing information. In particular, it contains the outbound link-layer address and a flow identi- fier, which is used by the outbound line card to schedule when the packet is actually transmitted. A side comment about the link-layer address is in order. Many networks have dynamic schemes for mapping IP addresses to link-layer addresses. A good example is the Address Resolution Protocol, used for Ethernet [28]. If a router gets a datagram to an IP address whose Ethernet address it doesn’t know, it is supposed to send an ARP message and hold the datagram until it gets an ARP reply with the necessary Ethernet address. In the pipelined MGR architecture, that approach doesn’t work − we have no convenient place in the forwarding engine to store datagrams awaiting an ARP reply. Rather we follow a two part strategy. First, at a low frequency, the router ARPs for all possible addresses on each interface, to collect link-layer addresses for the forwarding tables. Second, datagrams for which the destination link-layer address is unknown are passed to the network processor, which does the ARP and, once it gets the ARP reply, forwards the datagram and incorporates the link-layer address into future forwarding tables. 3.3. Forwarding Engine Software The forwarding engine software is a few hundred lines of code, of which 85 instructions are executed in the common case. These instructions execute in no less than 42 cycles,3 which translates to a peak forwarding speed of 9.8 MPPS per forwarding engines. This section sketches the structure of the code and then discusses some of the properties of this code. The fast path through the code can be roughly divided up into three stages, each of which is about 3 The instructions can take somewhat longer depending on the pattern of packets received and the resulting branch predictions. -4-
20 to 30 instructions(10 to 15 cycles)long. The 1. Headers whose destination misses in the route first stage(a)does basic error checking to confirm cache. This is the most common case. In this hat the header is indeed from an IPv4 datagram, (b) case the processor searches the forwarding confirms that the packet and header lengths are rea- table in the bcache for the correct route sonable,(c)that the IPv4 header has no options, (d) sends the datagram to the interface indicated computes the hash offset into the route cache and in the routing entry, and generates a version loads the route; and(e)starts loading the next of the route for the route cache. The routing header. These five activities are done in parallel table uses the binary hash scheme developed intertwined instructions by Waldvogel, Varghese, Turner and Plattner During the second stage checks to see if the [34.(We also hope to experiment with the cached route matches the destination of the data- algorithm described in [6] developed at Lulea am. If not, the code jumps to an extended lookup University). Since the forwarding table con- which examines the routing table in the bache tains prefix routes and the route cache is a Then the code checks the ip ttl field and com- cache of routes for particular destinations, the putes the updated TTL and IP checksum and deter processor has to convert the forwarding table mines if the datagram is for the router itself. The entry into an appropriate, destination-specific TTL and checksum are the only header fields that cache entry normally change and they must not be changed if 2. Headers with errors. Generally, the forward the datagram is destined for the router ing engine will instruct the inbound line card In the third stage, the updated TTL and to discard the errored datagram. In some checksum are put in the IP header. The necessary cases, the forwarding engine will generate an routing information is extracted from the forward ICMP message. Templates of some common ing table entry and the updated IP header is written ICMP messages such as the Time Exceeded out along with link-layer information from the for- message are kept in the Alpha's Bcache and warding table. The routing information includes the these can be combined with the ip header to flow classifier. Currently we simply associate clas- generate a valid ICMP message sifiers with destination prefixes, but one nice feature 3. Headers with IP options. Most headers with of the route-lookup algorithm we use [34] is that it options are sent to the network processor for scales as the log of the key size, so incorporating handling, simply because option parsing is additional information like the IP Type of Service slow and expensive. However, should an IP field into the lookup key typically has only a modest option become widely used, the forwarding effect on performance code could be modified to handle the option This code performs all the steps required by in a special piece of code outside the fast the Internet Router Requirements [3] except one doesnt check the IP header checksum, but simply 4. Datagrams that must be fragmented. Rather updates it. The update algorithm is safe [4, 18, 291 than requiring line cards to support fragmen If the checksum is bad. it will bad after the tation logic, we do fragmentation on the net update. The reason for not checking the header work processor. Now that IP MTU disco checksum is that in the best code we have been able [22] is prevalent, fragmentation should be to write, computing it would require 17 instruction and, due to consumer-producer relationships, those 5. Multicast datagrams. Multicast datagram instructions would have to be spread over a mini- require special routing, since the routing of mum of 14 cycles. Assuming we can successfully gram is dependent on the source interleave the 17 instructions among other instruc- address and the inbound link as well as the tions in the path, at minimum they still increase the time to perform the forwarding code by 9 cycles or multicast destination. Furthermore, the pro- cessor may have to write out multiple copies bout 21%. This is a large penalty to pay to check of the header to dispatch copies of the data for a rare error that can be caught end-to-end gram to different line cards. All this work is Indeed. for this reason. IPv6 does not include done in separate multicasting code in the pro- cessor. Multicast routes are stored in a sepa- Certain datagrams are not handled in the fast rate multicast forwarding table. The code path code. These datagrams can be divided into five hecks to see if the destination is a multicast categories destination and if so. looks for a multicast
20 to 30 instructions (10 to 15 cycles) long. The first stage (a) does basic error checking to confirm that the header is indeed from an IPv4 datagram, (b) confirms that the packet and header lengths are reasonable, (c) that the IPv4 header has no options, (d) computes the hash offset into the route cache and loads the route; and (e) starts loading the next header. These five activities are done in parallel in intertwined instructions. During the second stage checks to see if the cached route matches the destination of the datagram. If not, the code jumps to an extended lookup which examines the routing table in the Bache. Then the code checks the IP TTL field and computes the updated TTL and IP checksum and determines if the datagram is for the router itself. The TTL and checksum are the only header fields that normally change and they must not be changed if the datagram is destined for the router . In the third stage, the updated TTL and checksum are put in the IP header. The necessary routing information is extracted from the forwarding table entry and the updated IP header is written out along with link-layer information from the forwarding table. The routing information includes the flow classifier. Currently we simply associate classifiers with destination prefixes, but one nice feature of the route-lookup algorithm we use [34] is that it scales as the log of the key size, so incorporating additional information like the IP Type of Service field into the lookup key typically has only a modest effect on performance. This code performs all the steps required by the Internet Router Requirements [3] except one: it doesn’t check the IP header checksum, but simply updates it. The update algorithm is safe [4, 18, 29]. If the checksum is bad, it will remain bad after the update. The reason for not checking the header checksum is that, in the best code we have been able to write, computing it would require 17 instructions and, due to consumer-producer relationships, those instructions would have to be spread over a minimum of 14 cycles. Assuming we can successfully interleave the 17 instructions among other instructions in the path, at minimum they still increase the time to perform the forwarding code by 9 cycles or about 21%. This is a large penalty to pay to check for a rare error that can be caught end-to-end. Indeed, for this reason, IPv6 does not include a header checksum [8]. Certain datagrams are not handled in the fast path code. These datagrams can be divided into five categories: 1. Headers whose destination misses in the route cache. This is the most common case. In this case the processor searches the forwarding table in the Bcache for the correct route, sends the datagram to the interface indicated in the routing entry, and generates a version of the route for the route cache. The routing table uses the binary hash scheme developed by Waldvogel, Varghese, Turner and Plattner [34]. (We also hope to experiment with the algorithm described in [6] developed at Lulea University). Since the forwarding table contains prefix routes and the route cache is a cache of routes for particular destinations, the processor has to convert the forwarding table entry into an appropriate, destination-specific, cache entry. 2. Headers with errors. Generally, the forwarding engine will instruct the inbound line card to discard the errored datagram. In some cases, the forwarding engine will generate an ICMP message. Templates of some common ICMP messages such as the TimeExceeded message are kept in the Alpha’s Bcache and these can be combined with the IP header to generate a valid ICMP message. 3. Headers with IP options. Most headers with options are sent to the network processor for handling, simply because option parsing is slow and expensive. Howev er, should an IP option become widely used, the forwarding code could be modified to handle the option in a special piece of code outside the fast path. 4. Datagrams that must be fragmented. Rather than requiring line cards to support fragmentation logic, we do fragmentation on the network processor. Now that IP MTU discovery [22] is prevalent, fragmentation should be rare. 5. Multicast datagrams. Multicast datagrams require special routing, since the routing of the datagram is dependent on the source address and the inbound link as well as the multicast destination. Furthermore, the processor may have to write out multiple copies of the header to dispatch copies of the datagram to different line cards. All this work is done in separate multicasting code in the processor. Multicast routes are stored in a separate multicast forwarding table. The code checks to see if the destination is a multicast destination, and if so, looks for a multicast -5-