address-space multiprocessor -a trend that is fast gaining momentum in modern message-passing parallel computers.Instances of such a view come naturally from clustered workstations and non-shared address-space multicomputers.On such platforms,interactions between processes running on different nodes must be accomplished using messages,hence the name message passing.This exchange of messages is used to transfer data,work,and to synchronize actions among the processes.In its most general form,message-passing paradigms support execution of a different program on each of the p nodes. Since interactions are accomplished by sending and receiving messages,the basic operations in this programming paradigm are send and receive (the corresponding calls may differ across APIs but the semantics are largely identical).In addition,since the send and receive operations must specify target addresses,there must be a mechanism to assign a unique identification or ID to each of the multiple processes executing a parallel program.This ID is typically made available to the program using a function such as whoami,which returns to a calling process its ID.There is one other function that is typically needed to complete the basic set of message-passing operations-numprocs,which specifies the number of processes participating in the ensemble.With these four basic operations,it is possible to write any message-passing program.Different message-passing APIs,such as the Message Passing Interface(MPI) and Parallel Virtual Machine(PVM),support these basic operations and a variety of higher level functionality under different function names.Examples of parallel platforms that support the message- passing paradigm include the IBM SP,SGI Origin 2000,and workstation clusters. It is easy to emulate a message-passing architecture containing p nodes on a shared-address-space computer with an identical number of nodes.Assuming uniprocessor nodes,this can be done by partitioning the shared-address-space into p disjoint parts and assigning one such partition exclusively to each processor.A processor can then "send"or "receive"messages by writing to or reading from another processor's partition while using appropriate synchronization primitives to inform its communication partner when it has finished reading or writing the data.However,emulating a shared-address-space architecture on a message-passing computer is costly,since accessing another node's memory requires sending and receiving messages. Team LiB 4 PREVIOUS NEXT Team LiB 4 PREVIOUS NEXT 2.4 Physical Organization of Parallel Platforms In this section,we discuss the physical architecture of parallel machines.We start with an ideal architecture,outline practical difficulties associated with realizing this model,and discuss some conventional architectures. 2.4.1 Architecture of an Ideal Parallel Computer A natural extension of the serial model of computation (the Random Access Machine,or RAM)consists of p processors and a global memory of unbounded size that is uniformly accessible to all processors.All processors access the same address space.Processors share a common clock but may execute different instructions in each cycle.This ideal model is also referred to as a parallel random access machine (PRAM).Since PRAMs allow concurrent access to various memory locations,depending on how simultaneous memory accesses are handled,PRAMs can be divided into four subclasses. 1.Exclusive-read,exclusive-write (EREW)PRAM.In this class,access to a memory location is exclusive.No concurrent read or write operations are allowed.This is the weakest PRAM model, affording minimum concurrency in memory access. 2.Concurrent-read,exclusive-write (CREW)PRAM.In this class,multiple read accesses to a memory location are allowed.However,multiple write accesses to a memory location are serialized 3.Exclusive-read,concurrent-write (ERCW)PRAM.Multiple write accesses are allowed to a memory location,but multiple read accesses are serialized. 4. Concurrent-read,concurrent-write (CRCW)PRAM.This class allows multiple read and write accesses to a common memory location.This is the most powerful PRAM model
address-space multiprocessor – a trend that is fast gaining momentum in modern message-passing parallel computers. Instances of such a view come naturally from clustered workstations and non-sharedaddress-space multicomputers. On such platforms, interactions between processes running on different nodes must be accomplished using messages, hence the name message passing. This exchange of messages is used to transfer data, work, and to synchronize actions among the processes. In its most general form, message-passing paradigms support execution of a different program on each of the p nodes. Since interactions are accomplished by sending and receiving messages, the basic operations in this programming paradigm are send and receive (the corresponding calls may differ across APIs but the semantics are largely identical). In addition, since the send and receive operations must specify target addresses, there must be a mechanism to assign a unique identification or ID to each of the multiple processes executing a parallel program. This ID is typically made available to the program using a function such as whoami, which returns to a calling process its ID. There is one other function that is typically needed to complete the basic set of message-passing operations – numprocs, which specifies the number of processes participating in the ensemble. With these four basic operations, it is possible to write any message-passing program. Different message-passing APIs, such as the Message Passing Interface (MPI) and Parallel Virtual Machine (PVM), support these basic operations and a variety of higher level functionality under different function names. Examples of parallel platforms that support the messagepassing paradigm include the IBM SP, SGI Origin 2000, and workstation clusters. It is easy to emulate a message-passing architecture containing p nodes on a shared-address-space computer with an identical number of nodes. Assuming uniprocessor nodes, this can be done by partitioning the shared-address-space into p disjoint parts and assigning one such partition exclusively to each processor. A processor can then "send" or "receive" messages by writing to or reading from another processor's partition while using appropriate synchronization primitives to inform its communication partner when it has finished reading or writing the data. However, emulating a shared-address-space architecture on a message-passing computer is costly, since accessing another node's memory requires sending and receiving messages. [ Team LiB ] [ Team LiB ] 2.4 Physical Organization of Parallel Platforms In this section, we discuss the physical architecture of parallel machines. We start with an ideal architecture, outline practical difficulties associated with realizing this model, and discuss some conventional architectures. 2.4.1 Architecture of an Ideal Parallel Computer A natural extension of the serial model of computation (the Random Access Machine, or RAM) consists of p processors and a global memory of unbounded size that is uniformly accessible to all processors. All processors access the same address space. Processors share a common clock but may execute different instructions in each cycle. This ideal model is also referred to as a parallel random access machine (PRAM). Since PRAMs allow concurrent access to various memory locations, depending on how simultaneous memory accesses are handled, PRAMs can be divided into four subclasses. 1. Exclusive-read, exclusive-write (EREW) PRAM. In this class, access to a memory location is exclusive. No concurrent read or write operations are allowed. This is the weakest PRAM model, affording minimum concurrency in memory access. 2. Concurrent-read, exclusive-write (CREW) PRAM. In this class, multiple read accesses to a memory location are allowed. However, multiple write accesses to a memory location are serialized. 3. Exclusive-read, concurrent-write (ERCW) PRAM. Multiple write accesses are allowed to a memory location, but multiple read accesses are serialized. 4. Concurrent-read, concurrent-write (CRCW) PRAM. This class allows multiple read and write accesses to a common memory location. This is the most powerful PRAM model. Page 14 of 56 file://C:\Documents and Settings\CIC000\Local Settings\Temp\~hh1AB0.htm 2/8/2013
Allowing concurrent read access does not create any semantic discrepancies in the program.However, concurrent write access to a memory location requires arbitration.Several protocols are used to resolve concurrent writes.The most frequently used protocols are as follows: Common,in which the concurrent write is allowed if all the values that the processors are attempting to write are identical. Arbitrary,in which an arbitrary processor is allowed to proceed with the write operation and the rest fail. Priority,in which all processors are organized into a predefined prioritized list,and the processor with the highest priority succeeds and the rest fail. Sum,in which the sum of all the quantities is written (the sum-based write conflict resolution model can be extended to any associative operator defined on the quantities being written). Architectural Complexity of the Ideal Model Consider the implementation of an EREW PRAM as a shared-memory computer with p processors and a global memory of m words.The processors are connected to the memory through a set of switches.These switches determine the memory word being accessed by each processor.In an EREW PRAM,each of the p processors in the ensemble can access any of the memory words,provided that a word is not accessed by more than one processor simultaneously.To ensure such connectivity,the total number of switches must be e(mp).(See the Appendix for an explanation of the notation.)For a reasonable memory size, constructing a switching network of this complexity is very expensive.Thus,PRAM models of computation are impossible to realize in practice. 2.4.2 Interconnection Networks for Parallel Computers Interconnection networks provide mechanisms for data transfer between processing nodes or between processors and memory modules.A blackbox view of an interconnection network consists of n inputs and m outputs.The outputs may or may not be distinct from the inputs.Typical interconnection networks are built using links and switches.A link corresponds to physical media such as a set of wires or fibers capable of carrying information.A variety of factors influence link characteristics.For links based on conducting media,the capacitive coupling between wires limits the speed of signal propagation.This capacitive coupling and attenuation of signal strength are functions of the length of the link. Interconnection networks can be classified as static or dynamic.Static networks consist of point-to-point communication links among processing nodes and are also referred to as direct networks.Dynamic networks,on the other hand,are built using switches and communication links.Communication links are connected to one another dynamically by the switches to establish paths among processing nodes and memory banks.Dynamic networks are also referred to as indirect networks.Figure 2.6(a)illustrates a simple static network of four processing elements or nodes.Each processing node is connected via a network interface to two other nodes in a mesh configuration.Figure 2.6(b)illustrates a dynamic network of four nodes connected via a network of switches to other nodes. Figure 2.6.Classification of interconnection networks:(a)a static network;and (b)a dynamic network
Allowing concurrent read access does not create any semantic discrepancies in the program. However, concurrent write access to a memory location requires arbitration. Several protocols are used to resolve concurrent writes. The most frequently used protocols are as follows: Common, in which the concurrent write is allowed if all the values that the processors are attempting to write are identical. Arbitrary, in which an arbitrary processor is allowed to proceed with the write operation and the rest fail. Priority, in which all processors are organized into a predefined prioritized list, and the processor with the highest priority succeeds and the rest fail. Sum, in which the sum of all the quantities is written (the sum-based write conflict resolution model can be extended to any associative operator defined on the quantities being written). Architectural Complexity of the Ideal Model Consider the implementation of an EREW PRAM as a shared-memory computer with p processors and a global memory of m words. The processors are connected to the memory through a set of switches. These switches determine the memory word being accessed by each processor. In an EREW PRAM, each of the p processors in the ensemble can access any of the memory words, provided that a word is not accessed by more than one processor simultaneously. To ensure such connectivity, the total number of switches must be (mp). (See the Appendix for an explanation of the notation.) For a reasonable memory size, constructing a switching network of this complexity is very expensive. Thus, PRAM models of computation are impossible to realize in practice. 2.4.2 Interconnection Networks for Parallel Computers Interconnection networks provide mechanisms for data transfer between processing nodes or between processors and memory modules. A blackbox view of an interconnection network consists of n inputs and m outputs. The outputs may or may not be distinct from the inputs. Typical interconnection networks are built using links and switches. A link corresponds to physical media such as a set of wires or fibers capable of carrying information. A variety of factors influence link characteristics. For links based on conducting media, the capacitive coupling between wires limits the speed of signal propagation. This capacitive coupling and attenuation of signal strength are functions of the length of the link. Interconnection networks can be classified as static or dynamic. Static networks consist of point-to-point communication links among processing nodes and are also referred to as direct networks. Dynamic networks, on the other hand, are built using switches and communication links. Communication links are connected to one another dynamically by the switches to establish paths among processing nodes and memory banks. Dynamic networks are also referred to as indirect networks. Figure 2.6(a) illustrates a simple static network of four processing elements or nodes. Each processing node is connected via a network interface to two other nodes in a mesh configuration. Figure 2.6(b) illustrates a dynamic network of four nodes connected via a network of switches to other nodes. Figure 2.6. Classification of interconnection networks: (a) a static network; and (b) a dynamic network. Page 15 of 56 file://C:\Documents and Settings\CIC000\Local Settings\Temp\~hh1AB0.htm 2/8/2013
Static network Indirect network Network interface/switch Switching element Processing node A single switch in an interconnection network consists of a set of input ports and a set of output ports Switches provide a range of functionality.The minimal functionality provided by a switch is a mapping from the input to the output ports.The total number of ports on a switch is also called the degree of the switch.Switches may also provide support for internal buffering (when the requested output port is busy), routing (to alleviate congestion on the network),and multicast(same output on multiple ports).The mapping from input to output ports can be provided using a variety of mechanisms based on physical crossbars,multi-ported memories,multiplexor-demultiplexors,and multiplexed buses.The cost of a switch is influenced by the cost of the mapping hardware,the peripheral hardware and packaging costs.The mapping hardware typically grows as the square of the degree of the switch,the peripheral hardware linearly as the degree,and the packaging costs linearly as the number of pins. The connectivity between the nodes and the network is provided by a network interface.The network interface has input and output ports that pipe data into and out of the network.It typically has the responsibility of packetizing data,computing routing information,buffering incoming and outgoing data for matching speeds of network and processing elements,and error checking.The position of the interface between the processing element and the network is also important.While conventional network interfaces hang off the I/O buses,interfaces in tightly coupled parallel machines hang off the memory bus.Since I/O buses are typically slower than memory buses,the latter can support higher bandwidth. 2.4.3 Network Topologies A wide variety of network topologies have been used in interconnection networks.These topologies try to trade off cost and scalability with performance.While pure topologies have attractive mathematical properties,in practice interconnection networks tend to be combinations or modifications of the pure topologies discussed in this section. Bus-Based Networks A bus-based network is perhaps the simplest network consisting of a shared medium that is common to all the nodes.A bus has the desirable property that the cost of the network scales linearly as the number of nodes,p.This cost is typically associated with bus interfaces.Furthermore,the distance between any two nodes in the network is constant (O(1)).Buses are also ideal for broadcasting information among nodes. Since the transmission medium is shared,there is little overhead associated with broadcast compared to point-to-point message transfer.However,the bounded bandwidth of a bus places limitations on the overall performance of the network as the number of nodes increases.Typical bus based machines are limited to dozens of nodes.Sun Enterprise servers and Intel Pentium based shared-bus multiprocessors are examples of such architectures. The demands on bus bandwidth can be reduced by making use of the property that in typical programs,a majority of the data accessed is local to the node.For such programs,it is possible to provide a cache for each node.Private data is cached at the node and only remote data is accessed through the bus. Example 2.12 Reducing shared-bus bandwidth using caches Fiqure 2.7(a)illustrates p processors sharing a bus to the memory.Assuming that each processor
A single switch in an interconnection network consists of a set of input ports and a set of output ports. Switches provide a range of functionality. The minimal functionality provided by a switch is a mapping from the input to the output ports. The total number of ports on a switch is also called the degree of the switch. Switches may also provide support for internal buffering (when the requested output port is busy), routing (to alleviate congestion on the network), and multicast (same output on multiple ports). The mapping from input to output ports can be provided using a variety of mechanisms based on physical crossbars, multi-ported memories, multiplexor-demultiplexors, and multiplexed buses. The cost of a switch is influenced by the cost of the mapping hardware, the peripheral hardware and packaging costs. The mapping hardware typically grows as the square of the degree of the switch, the peripheral hardware linearly as the degree, and the packaging costs linearly as the number of pins. The connectivity between the nodes and the network is provided by a network interface. The network interface has input and output ports that pipe data into and out of the network. It typically has the responsibility of packetizing data, computing routing information, buffering incoming and outgoing data for matching speeds of network and processing elements, and error checking. The position of the interface between the processing element and the network is also important. While conventional network interfaces hang off the I/O buses, interfaces in tightly coupled parallel machines hang off the memory bus. Since I/O buses are typically slower than memory buses, the latter can support higher bandwidth. 2.4.3 Network Topologies A wide variety of network topologies have been used in interconnection networks. These topologies try to trade off cost and scalability with performance. While pure topologies have attractive mathematical properties, in practice interconnection networks tend to be combinations or modifications of the pure topologies discussed in this section. Bus-Based Networks A bus-based network is perhaps the simplest network consisting of a shared medium that is common to all the nodes. A bus has the desirable property that the cost of the network scales linearly as the number of nodes, p. This cost is typically associated with bus interfaces. Furthermore, the distance between any two nodes in the network is constant (O(1)). Buses are also ideal for broadcasting information among nodes. Since the transmission medium is shared, there is little overhead associated with broadcast compared to point-to-point message transfer. However, the bounded bandwidth of a bus places limitations on the overall performance of the network as the number of nodes increases. Typical bus based machines are limited to dozens of nodes. Sun Enterprise servers and Intel Pentium based shared-bus multiprocessors are examples of such architectures. The demands on bus bandwidth can be reduced by making use of the property that in typical programs, a majority of the data accessed is local to the node. For such programs, it is possible to provide a cache for each node. Private data is cached at the node and only remote data is accessed through the bus. Example 2.12 Reducing shared-bus bandwidth using caches Figure 2.7(a) illustrates p processors sharing a bus to the memory. Assuming that each processor Page 16 of 56 file://C:\Documents and Settings\CIC000\Local Settings\Temp\~hh1AB0.htm 2/8/2013
accesses k data items,and each data access takes time tcyde,the execution time is lower bounded by tcycle x kp seconds.Now consider the hardware organization of Fiqure 2.7(b).Let us assume that 50%of the memory accesses (0.5k)are made to local data.This local data resides in the private memory of the processor.We assume that access time to the private memory is identical to the global memory,i.e.,tcycle.In this case,the total execution time is lower bounded by 0.5 x tcycle x k+0.5 x teycle x kp.Here,the first term results from accesses to local data and the second term from access to shared data.It is easy to see that as p becomes large,the organization of Fiqure 2.7(b)results in a lower bound that approaches 0.5 x tcyde x kp.This time is a 50%improvement in lower bound on execution time compared to the organization of Fiqure 2.Za.■ Figure 2.7.Bus-based interconnects (a)with no local caches;(b)with local memory/caches. Address Data KowaW paryS Processor 0 Processor 1 (a) Address Data Cache/ Cache/ Local Memory Local Memory Processor 0 Processor I (b) In practice,shared and private data is handled in a more sophisticated manner.This is briefly addressed with cache coherence issues in Section 2.4.6. Crossbar Networks A simple way to connect p processors to b memory banks is to use a crossbar network.A crossbar network employs a grid of switches or switching nodes as shown in Fiqure 2.8.The crossbar network is a non-blocking network in the sense that the connection of a processing node to a memory bank does not block the connection of any other processing nodes to other memory banks. Figure 2.8.A completely non-blocking crossbar network connecting p processors to b memory banks
accesses k data items, and each data access takes time tcycle, the execution time is lower bounded by tcycle x kp seconds. Now consider the hardware organization of Figure 2.7(b). Let us assume that 50% of the memory accesses (0.5k) are made to local data. This local data resides in the private memory of the processor. We assume that access time to the private memory is identical to the global memory, i.e., tcycle. In this case, the total execution time is lower bounded by 0.5 x tcycle x k + 0.5 x tcycle x kp. Here, the first term results from accesses to local data and the second term from access to shared data. It is easy to see that as p becomes large, the organization of Figure 2.7(b) results in a lower bound that approaches 0.5 x tcycle x kp. This time is a 50% improvement in lower bound on execution time compared to the organization of Figure 2.7(a). Figure 2.7. Bus-based interconnects (a) with no local caches; (b) with local memory/caches. In practice, shared and private data is handled in a more sophisticated manner. This is briefly addressed with cache coherence issues in Section 2.4.6. Crossbar Networks A simple way to connect p processors to b memory banks is to use a crossbar network. A crossbar network employs a grid of switches or switching nodes as shown in Figure 2.8. The crossbar network is a non-blocking network in the sense that the connection of a processing node to a memory bank does not block the connection of any other processing nodes to other memory banks. Figure 2.8. A completely non-blocking crossbar network connecting p processors to b memory banks. Page 17 of 56 file://C:\Documents and Settings\CIC000\Local Settings\Temp\~hh1AB0.htm 2/8/2013
Memory Banks 0 2 3 A switching element P-I The total number of switching nodes required to implement such a network is (pb).It is reasonable to assume that the number of memory banks b is at least p;otherwise,at any given time,there will be some processing nodes that will be unable to access any memory banks.Therefore,as the value of p is increased,the complexity(component count)of the switching network grows as (p2).(See the Appendix for an explanation of the notation.)As the number of processing nodes becomes large,this switch complexity is difficult to realize at high data rates.Consequently,crossbar networks are not very scalable in terms of cost. Multistage Networks The crossbar interconnection network is scalable in terms of performance but unscalable in terms of cost. Conversely,the shared bus network is scalable in terms of cost but unscalable in terms of performance. An intermediate class of networks called multistage interconnection networks lies between these two extremes.It is more scalable than the bus in terms of performance and more scalable than the crossbar in terms of cost. The general schematic of a multistage network consisting of p processing nodes and b memory banks is shown in Fiqure 2.9.A commonly used multistage connection network is the omega network.This network consists of log p stages,where p is the number of inputs(processing nodes)and also the number of outputs(memory banks).Each stage of the omega network consists of an interconnection pattern that connects p inputs and p outputs;a link exists between input i and output j if the following is true: Equation 2.1 j- 2i, 0≤i≤p/2-1 2i+1-p,p/2≤i≤p-1 Figure 2.9.The schematic of a typical multistage interconnection network
The total number of switching nodes required to implement such a network is (pb). It is reasonable to assume that the number of memory banks b is at least p; otherwise, at any given time, there will be some processing nodes that will be unable to access any memory banks. Therefore, as the value of p is increased, the complexity (component count) of the switching network grows as (p2). (See the Appendix for an explanation of the notation.) As the number of processing nodes becomes large, this switch complexity is difficult to realize at high data rates. Consequently, crossbar networks are not very scalable in terms of cost. Multistage Networks The crossbar interconnection network is scalable in terms of performance but unscalable in terms of cost. Conversely, the shared bus network is scalable in terms of cost but unscalable in terms of performance. An intermediate class of networks called multistage interconnection networks lies between these two extremes. It is more scalable than the bus in terms of performance and more scalable than the crossbar in terms of cost. The general schematic of a multistage network consisting of p processing nodes and b memory banks is shown in Figure 2.9. A commonly used multistage connection network is the omega network. This network consists of log p stages, where p is the number of inputs (processing nodes) and also the number of outputs (memory banks). Each stage of the omega network consists of an interconnection pattern that connects p inputs and p outputs; a link exists between input i and output j if the following is true: Equation 2.1 Figure 2.9. The schematic of a typical multistage interconnection network. Page 18 of 56 file://C:\Documents and Settings\CIC000\Local Settings\Temp\~hh1AB0.htm 2/8/2013