mesh without wraparound connections is 2(-)for the two nodes at diagonally opposed corners,and that of a wraparound mesh is 2LP/2].The diameter of a hypercube-connected network is log p since two node labels can differ in at most log p positions.The diameter of a complete binary tree is 2 log((p 1)/2)because the two communicating nodes may be in separate subtrees of the root node,and a message might have to travel all the way to the root and then down the other subtree. Connectivity The connectivity of a network is a measure of the multiplicity of paths between any two processing nodes.A network with high connectivity is desirable,because it lowers contention for communication resources.One measure of connectivity is the minimum number of arcs that must be removed from the network to break it into two disconnected networks.This is called the arc connectivity of the network.The arc connectivity is one for linear arrays,as well as tree and star networks.It is two for rings and 2-D meshes without wraparound,four for 2-D wraparound meshes,and d for d-dimensional hypercubes. Bisection Width and Bisection Bandwidth The bisection width of a network is defined as the minimum number of communication links that must be removed to partition the network into two equal halves.The bisection width of a ring is two,since any partition cuts across only two communication links. Similarly,the bisection width of a two-dimensional p-node mesh without wraparound connections is p and with wraparound connections is 2P.The bisection width of a tree and a star is one,and that of a completely-connected network of p nodes is p2/4.The bisection width of a hypercube can be derived from its construction.We construct a d-dimensional hypercube by connecting corresponding links of two(d-1)- dimensional hypercubes.Since each of these subcubes contains 2(d-1)or p/2 nodes,at least p/2 communication links must cross any partition of a hypercube into two subcubes (Problem 2.15). The number of bits that can be communicated simultaneously over a link connecting two nodes is called the channe/width.Channel width is equal to the number of physical wires in each communication link. The peak rate at which a single physical wire can deliver bits is called the channel rate.The peak rate at which data can be communicated between the ends of a communication link is called channel bandwidth.Channel bandwidth is the product of channel rate and channel width. Table 2.1.A summary of the characteristics of various static network topologies connecting p nodes. Network Diameter Bisection Width Arc Connectivity Cost(No.of links) Completely-connected p214 p-1 p(p-1)/2 Star 2 1 1 p-1 Complete binary tree 2log(p+1)/2)1 p-1 Linear array p-1 1 1 p-1 2-D mesh,no wraparound 2(√p-1) p 2(p-√下) 2-D wraparound mesh 2Lp/2 2/P 4 2p Hypercube log p p/2 logp (p log p)/2 Wraparound k-ary d-cube d/2] 2kd-1 2d dp The bisection bandwidth of a network is defined as the minimum volume of communication allowed between any two halves of the network.It is the product of the bisection width and the channel bandwidth.Bisection bandwidth of a network is also sometimes referred to as cross-section bandwidth. Cost Many criteria can be used to evaluate the cost of a network.One way of defining the cost of a network is in terms of the number of communication links or the number of wires required by the network. Linear arrays and trees use only p-1 links to connect p nodes.A d-dimensional wraparound mesh has dp links.A hypercube-connected network has (p log p)/2 links. The bisection bandwidth of a network can also be used as a measure of its cost,as it provides a lower bound on the area in a two-dimensional packaging or the volume in a three-dimensional packaging.If the bisection width of a network is w,the lower bound on the area in a two-dimensional packaging is e(w2), and the lower bound on the volume in a three-dimensional packaging is (w3/2).According to this
mesh without wraparound connections is for the two nodes at diagonally opposed corners, and that of a wraparound mesh is . The diameter of a hypercube-connected network is log p since two node labels can differ in at most log p positions. The diameter of a complete binary tree is 2 log((p + 1)/2) because the two communicating nodes may be in separate subtrees of the root node, and a message might have to travel all the way to the root and then down the other subtree. Connectivity The connectivity of a network is a measure of the multiplicity of paths between any two processing nodes. A network with high connectivity is desirable, because it lowers contention for communication resources. One measure of connectivity is the minimum number of arcs that must be removed from the network to break it into two disconnected networks. This is called the arc connectivity of the network. The arc connectivity is one for linear arrays, as well as tree and star networks. It is two for rings and 2-D meshes without wraparound, four for 2-D wraparound meshes, and d for d-dimensional hypercubes. Bisection Width and Bisection Bandwidth The bisection width of a network is defined as the minimum number of communication links that must be removed to partition the network into two equal halves. The bisection width of a ring is two, since any partition cuts across only two communication links. Similarly, the bisection width of a two-dimensional p-node mesh without wraparound connections is and with wraparound connections is . The bisection width of a tree and a star is one, and that of a completely-connected network of p nodes is p2/4. The bisection width of a hypercube can be derived from its construction. We construct a d-dimensional hypercube by connecting corresponding links of two (d - 1)- dimensional hypercubes. Since each of these subcubes contains 2(d-1) or p/2 nodes, at least p/2 communication links must cross any partition of a hypercube into two subcubes (Problem 2.15). The number of bits that can be communicated simultaneously over a link connecting two nodes is called the channel width. Channel width is equal to the number of physical wires in each communication link. The peak rate at which a single physical wire can deliver bits is called the channel rate. The peak rate at which data can be communicated between the ends of a communication link is called channel bandwidth. Channel bandwidth is the product of channel rate and channel width. The bisection bandwidth of a network is defined as the minimum volume of communication allowed between any two halves of the network. It is the product of the bisection width and the channel bandwidth. Bisection bandwidth of a network is also sometimes referred to as cross-section bandwidth. Cost Many criteria can be used to evaluate the cost of a network. One way of defining the cost of a network is in terms of the number of communication links or the number of wires required by the network. Linear arrays and trees use only p - 1 links to connect p nodes. A d-dimensional wraparound mesh has dp links. A hypercube-connected network has (p log p)/2 links. The bisection bandwidth of a network can also be used as a measure of its cost, as it provides a lower bound on the area in a two-dimensional packaging or the volume in a three-dimensional packaging. If the bisection width of a network is w, the lower bound on the area in a two-dimensional packaging is (w2), and the lower bound on the volume in a three-dimensional packaging is (w3/2). According to this Table 2.1. A summary of the characteristics of various static network topologies connecting p nodes. Network Diameter Bisection Width Arc Connectivity Cost (No. of links) Completely-connected 1 p2/4 p - 1 p(p - 1)/2 Star 2 1 1 p - 1 Complete binary tree 2 log((p + 1)/2) 1 1 p - 1 Linear array p - 1 1 1 p - 1 2-D mesh, no wraparound 2 2-D wraparound mesh 4 2p Hypercube log p p/2 logp (p log p)/2 Wraparound k-ary d-cube 2kd-1 2d dp Page 24 of 56 file://C:\Documents and Settings\CIC000\Local Settings\Temp\~hh1AB0.htm 2/8/2013
criterion,hypercubes and completely connected networks are more expensive than the other networks. We summarize the characteristics of various static networks in Table 2.1,which highlights the various cost-performance tradeoffs. 2.4.5 Evaluating Dynamic Interconnection Networks A number of evaluation metrics for dynamic networks follow from the corresponding metrics for static networks.Since a message traversing a switch must pay an overhead,it is logical to think of each switch as a node in the network,in addition to the processing nodes.The diameter of the network can now be defined as the maximum distance between any two nodes in the network.This is indicative of the maximum delay that a message will encounter in being communicated between the selected pair of nodes. In reality,we would like the metric to be the maximum distance between any two processing nodes; however,for all networks of interest,this is equivalent to the maximum distance between any (processing or switching)pair of nodes. The connectivity of a dynamic network can be defined in terms of node or edge connectivity.The node connectivity is the minimum number of nodes that must fail(be removed from the network)to fragment the network into two parts.As before,we should consider only switching nodes (as opposed to all nodes) However,considering all nodes gives a good approximation to the multiplicity of paths in a dynamic network.The arc connectivity of the network can be similarly defined as the minimum number of edges that must fail (be removed from the network)to fragment the network into two unreachable parts. The bisection width of a dynamic network must be defined more precisely than diameter and connectivity. In the case of bisection width,we consider any possible partitioning of the p processing nodes into two equal parts.Note that this does not restrict the partitioning of the switching nodes.For each such partition,we select an induced partitioning of the switching nodes such that the number of edges crossing this partition is minimized.The minimum number of edges for any such partition is the bisection width of the dynamic network.Another intuitive way of thinking of bisection width is in terms of the minimum number of edges that must be removed from the network so as to partition the network into two halves with identical number of processing nodes.We illustrate this concept further in the following example: Example 2.13 Bisection width of dynamic networks Consider the network illustrated in Fiqure 2.20.We illustrate here three bisections,A,B,and C, each of which partitions the network into two groups of two processing nodes each.Notice that these partitions need not partition the network nodes equally.In the example,each partition results in an edge cut of four.We conclude that the bisection width of this graph is four. Figure 2.20.Bisection width of a dynamic network is computed by examining various equi-partitions of the processing nodes and selecting the minimum number of edges crossing the partition.In this case,each partition yields an edge cut of four.Therefore,the bisection width of this graph is four. P
criterion, hypercubes and completely connected networks are more expensive than the other networks. We summarize the characteristics of various static networks in Table 2.1, which highlights the various cost-performance tradeoffs. 2.4.5 Evaluating Dynamic Interconnection Networks A number of evaluation metrics for dynamic networks follow from the corresponding metrics for static networks. Since a message traversing a switch must pay an overhead, it is logical to think of each switch as a node in the network, in addition to the processing nodes. The diameter of the network can now be defined as the maximum distance between any two nodes in the network. This is indicative of the maximum delay that a message will encounter in being communicated between the selected pair of nodes. In reality, we would like the metric to be the maximum distance between any two processing nodes; however, for all networks of interest, this is equivalent to the maximum distance between any (processing or switching) pair of nodes. The connectivity of a dynamic network can be defined in terms of node or edge connectivity. The node connectivity is the minimum number of nodes that must fail (be removed from the network) to fragment the network into two parts. As before, we should consider only switching nodes (as opposed to all nodes). However, considering all nodes gives a good approximation to the multiplicity of paths in a dynamic network. The arc connectivity of the network can be similarly defined as the minimum number of edges that must fail (be removed from the network) to fragment the network into two unreachable parts. The bisection width of a dynamic network must be defined more precisely than diameter and connectivity. In the case of bisection width, we consider any possible partitioning of the p processing nodes into two equal parts. Note that this does not restrict the partitioning of the switching nodes. For each such partition, we select an induced partitioning of the switching nodes such that the number of edges crossing this partition is minimized. The minimum number of edges for any such partition is the bisection width of the dynamic network. Another intuitive way of thinking of bisection width is in terms of the minimum number of edges that must be removed from the network so as to partition the network into two halves with identical number of processing nodes. We illustrate this concept further in the following example: Example 2.13 Bisection width of dynamic networks Consider the network illustrated in Figure 2.20. We illustrate here three bisections, A, B, and C, each of which partitions the network into two groups of two processing nodes each. Notice that these partitions need not partition the network nodes equally. In the example, each partition results in an edge cut of four. We conclude that the bisection width of this graph is four. Figure 2.20. Bisection width of a dynamic network is computed by examining various equi-partitions of the processing nodes and selecting the minimum number of edges crossing the partition. In this case, each partition yields an edge cut of four. Therefore, the bisection width of this graph is four. Page 25 of 56 file://C:\Documents and Settings\CIC000\Local Settings\Temp\~hh1AB0.htm 2/8/2013
The cost of a dynamic network is determined by the link cost,as is the case with static networks,as well as the switch cost.In typical dynamic networks,the degree of a switch is constant.Therefore,the number of links and switches is asymptotically identical.Furthermore,in typical networks,switch cost exceeds link cost.For this reason,the cost of dynamic networks is often determined by the number of switching nodes in the network. We summarize the characteristics of various dynamic networks in Table 2.2. 2.4.6 Cache Coherence in Multiprocessor Systems While interconnection networks provide basic mechanisms for communicating messages(data),in the case of shared-address-space computers additional hardware is required to keep multiple copies of data consistent with each other.Specifically,if there exist two copies of the data(in different caches/memory elements),how do we ensure that different processors operate on these in a manner that follows predefined semantics? Table 2.2.A summary of the characteristics of various dynamic network topologies connecting p processing nodes. Network Diameter Bisection Width Arc Connectivity Cost (No.of links) Crossbar 1 ⊙ 1 p2 Omega Network log p p/2 2 p/2 Dynamic Tree 2 log p 1 2 p-1 The problem of keeping caches in multiprocessor systems coherent is significantly more complex than in uniprocessor systems.This is because in addition to multiple copies as in uniprocessor systems,there may also be multiple processors modifying these copies.Consider a simple scenario illustrated in Fiqure 2.21. Two processors Po and P are connected over a shared bus to a globally accessible memory.Both processors load the same variable.There are now three copies of the variable.The coherence mechanism must now ensure that all operations performed on these copies are serializable (i.e.,there exists some serial order of instruction execution that corresponds to the parallel schedule).When a processor changes the value of its copy of the variable,one of two things must happen:the other copies must be invalidated, or the other copies must be updated.Failing this,other processors may potentially work with incorrect (stale)values of the variable.These two protocols are referred to as invalidate and update protocols and are illustrated in Figure 2.21(a)and (b). Figure 2.21.Cache coherence in multiprocessor systems:(a)Invalidate protocol;(b) Update protocol for shared variables
The cost of a dynamic network is determined by the link cost, as is the case with static networks, as well as the switch cost. In typical dynamic networks, the degree of a switch is constant. Therefore, the number of links and switches is asymptotically identical. Furthermore, in typical networks, switch cost exceeds link cost. For this reason, the cost of dynamic networks is often determined by the number of switching nodes in the network. We summarize the characteristics of various dynamic networks in Table 2.2. 2.4.6 Cache Coherence in Multiprocessor Systems While interconnection networks provide basic mechanisms for communicating messages (data), in the case of shared-address-space computers additional hardware is required to keep multiple copies of data consistent with each other. Specifically, if there exist two copies of the data (in different caches/memory elements), how do we ensure that different processors operate on these in a manner that follows predefined semantics? The problem of keeping caches in multiprocessor systems coherent is significantly more complex than in uniprocessor systems. This is because in addition to multiple copies as in uniprocessor systems, there may also be multiple processors modifying these copies. Consider a simple scenario illustrated in Figure 2.21. Two processors P0 and P1 are connected over a shared bus to a globally accessible memory. Both processors load the same variable. There are now three copies of the variable. The coherence mechanism must now ensure that all operations performed on these copies are serializable (i.e., there exists some serial order of instruction execution that corresponds to the parallel schedule). When a processor changes the value of its copy of the variable, one of two things must happen: the other copies must be invalidated, or the other copies must be updated. Failing this, other processors may potentially work with incorrect (stale) values of the variable. These two protocols are referred to as invalidate and update protocols and are illustrated in Figure 2.21(a) and (b). Figure 2.21. Cache coherence in multiprocessor systems: (a) Invalidate protocol; (b) Update protocol for shared variables. Table 2.2. A summary of the characteristics of various dynamic network topologies connecting p processing nodes. Network Diameter Bisection Width Arc Connectivity Cost (No. of links) Crossbar 1 p 1 p2 Omega Network log p p/2 2 p/2 Dynamic Tree 2 log p 1 2 p - 1 Page 26 of 56 file://C:\Documents and Settings\CIC000\Local Settings\Temp\~hh1AB0.htm 2/8/2013
PO PI PO load x load x write#3,× X1 X =1 X■3 Invalidate Memory Memory (a PO PI PO load x load x write #3,x X=1 ×=1 X=3 X= X X=3 Update Memory Memory (b) In an update protocol,whenever a data item is written,all of its copies in the system are updated.For this reason,if a processor simply reads a data item once and never uses it,subsequent updates to this item at other processors cause excess overhead in terms of latency at source and bandwidth on the network.On the other hand,in this situation,an invalidate protocol invalidates the data item on the first update at a remote processor and subsequent updates need not be performed on this copy. Another important factor affecting the performance of these protocols is false sharing.False sharing refers to the situation in which different processors update different parts ofof the same cache-line.Thus, although the updates are not performed on shared variables,the system does not detect this.In an invalidate protocol,when a processor updates its part of the cache-line,the other copies of this line are invalidated.When other processors try to update their parts of the cache-line,the line must actually be fetched from the remote processor.It is easy to see that false-sharing can cause a cache-line to be ping- ponged between various processors.In an update protocol,this situation is slightly better since all reads can be performed locally and the writes must be updated.This saves an invalidate operation that is otherwise wasted. The tradeoff between invalidate and update schemes is the classic tradeoff between communication overhead (updates)and idling(stalling in invalidates).Current generation cache coherent machines typically rely on invalidate protocols.The rest of our discussion of multiprocessor cache systems therefore assumes invalidate protocols. Maintaining Coherence Using Invalidate Protocols Multiple copies of a single data item are kept consistent by keeping track of the number of copies and the state of each of these copies.We discuss here one possible set of states associated with data items and events that trigger transitions among these states.Note that this set of states and transitions is not unique.It is possible to define other states and associated transitions as well. Let us revisit the example in Figure_2.21.Initially the variable x resides in the global memory.The first step executed by both processors is a load operation on this variable.At this point,the state of the variable is said to be shared,since it is shared by multiple processors.When processor Po executes a store on this variable,it marks all other copies of this variable as invalid.It must also mark its own copy as modified or dirty.This is done to ensure that all subsequent accesses to this variable at other processors will be serviced by processor Po and not from the memory.At this point,say,processor P executes another load operation on x.Processor P,attempts to fetch this variable and,since the variable was marked dirty by processor Po,processor Po services the request.Copies of this variable at processor P and the global memory are updated and the variable re-enters the shared state.Thus,in this simple model,there are three states-shared,invalid,and dirty-that a cache line goes through
In an update protocol, whenever a data item is written, all of its copies in the system are updated. For this reason, if a processor simply reads a data item once and never uses it, subsequent updates to this item at other processors cause excess overhead in terms of latency at source and bandwidth on the network. On the other hand, in this situation, an invalidate protocol invalidates the data item on the first update at a remote processor and subsequent updates need not be performed on this copy. Another important factor affecting the performance of these protocols is false sharing. False sharing refers to the situation in which different processors update different parts of of the same cache-line. Thus, although the updates are not performed on shared variables, the system does not detect this. In an invalidate protocol, when a processor updates its part of the cache-line, the other copies of this line are invalidated. When other processors try to update their parts of the cache-line, the line must actually be fetched from the remote processor. It is easy to see that false-sharing can cause a cache-line to be pingponged between various processors. In an update protocol, this situation is slightly better since all reads can be performed locally and the writes must be updated. This saves an invalidate operation that is otherwise wasted. The tradeoff between invalidate and update schemes is the classic tradeoff between communication overhead (updates) and idling (stalling in invalidates). Current generation cache coherent machines typically rely on invalidate protocols. The rest of our discussion of multiprocessor cache systems therefore assumes invalidate protocols. Maintaining Coherence Using Invalidate Protocols Multiple copies of a single data item are kept consistent by keeping track of the number of copies and the state of each of these copies. We discuss here one possible set of states associated with data items and events that trigger transitions among these states. Note that this set of states and transitions is not unique. It is possible to define other states and associated transitions as well. Let us revisit the example in Figure 2.21. Initially the variable x resides in the global memory. The first step executed by both processors is a load operation on this variable. At this point, the state of the variable is said to be shared, since it is shared by multiple processors. When processor P0 executes a store on this variable, it marks all other copies of this variable as invalid. It must also mark its own copy as modified or dirty. This is done to ensure that all subsequent accesses to this variable at other processors will be serviced by processor P0 and not from the memory. At this point, say, processor P1 executes another load operation on x . Processor P1 attempts to fetch this variable and, since the variable was marked dirty by processor P0, processor P0 services the request. Copies of this variable at processor P1 and the global memory are updated and the variable re-enters the shared state. Thus, in this simple model, there are three states - shared, invalid, and dirty - that a cache line goes through. Page 27 of 56 file://C:\Documents and Settings\CIC000\Local Settings\Temp\~hh1AB0.htm 2/8/2013
The complete state diagram of a simple three-state protocol is illustrated in Fiqure 2.22.The solid lines depict processor actions and the dashed lines coherence actions.For example,when a processor executes a read on an invalid block,the block is fetched and a transition is made from invalid to shared.Similarly,if a processor does a write on a shared block,the coherence protocol propagates a C_write(a coherence write)on the block.This triggers a transition from shared to invalid at all the other blocks. Figure 2.22.State diagram of a simple three-state coherence protocol. read Shared read write c read c_write Invalid write Dirty read/write C write flush Example 2.14 Maintaining coherence using a simple three-state protocol Consider an example of two program segments being executed by processor Po and P as illustrated in Figure 2.23.The system consists of local memories (or caches)at processors Po and P1,and a global memory.The three-state protocol assumed in this example corresponds to the state diagram illustrated in Figure 2.22.Cache lines in this system can be either shared,invalid, or dirty.Each data item(variable)is assumed to be on a different cache line.Initially,the two variables x and y are tagged dirty and the only copies of these variables exist in the global memory.Figure_2.23 illustrates state transitions along with values of copies of the variables with each instruction execution. Figure 2.23.Example of parallel program execution with the simple three-state coherence protocol discussed in Section 2.4.6
The complete state diagram of a simple three-state protocol is illustrated in Figure 2.22. The solid lines depict processor actions and the dashed lines coherence actions. For example, when a processor executes a read on an invalid block, the block is fetched and a transition is made from invalid to shared. Similarly, if a processor does a write on a shared block, the coherence protocol propagates a C_write (a coherence write) on the block. This triggers a transition from shared to invalid at all the other blocks. Figure 2.22. State diagram of a simple three-state coherence protocol. Example 2.14 Maintaining coherence using a simple three-state protocol Consider an example of two program segments being executed by processor P0 and P1 as illustrated in Figure 2.23. The system consists of local memories (or caches) at processors P0 and P1, and a global memory. The three-state protocol assumed in this example corresponds to the state diagram illustrated in Figure 2.22. Cache lines in this system can be either shared, invalid, or dirty. Each data item (variable) is assumed to be on a different cache line. Initially, the two variables x and y are tagged dirty and the only copies of these variables exist in the global memory. Figure 2.23 illustrates state transitions along with values of copies of the variables with each instruction execution. Figure 2.23. Example of parallel program execution with the simple three-state coherence protocol discussed in Section 2.4.6. Page 28 of 56 file://C:\Documents and Settings\CIC000\Local Settings\Temp\~hh1AB0.htm 2/8/2013