14 CHAPTER 1 Why Parallel Computing? Make a table showing the numbers of receives and additions carried out by core 0 when the two sums are used with 2,4,8,...,1024 cores. 1.7 The first part of the global sum example-when each core adds its assigned computed values-is usually considered to be an example of data-parallelism, while the second part of the first global sum-when the cores send their par- tial sums to the master core,which adds them-could be considered to be an example of task-parallelism.What about the second part of the second global sum-when the cores use a tree structure to add their partial sums?Is this an example of data-or task-parallelism?Why? 1.8 Suppose the faculty are going to have a party for the students in the depart- ment. a.Identify tasks that can be assigned to the faculty members that will allow them to use task-parallelism when they prepare for the party.Work out a schedule that shows when the various tasks can be performed. b.We might hope that one of the tasks in the preceding part is cleaning the house where the party will be held.How can we use data-parallelism to partition the work of cleaning the house among the faculty? c.Use a combination of task-and data-parallelism to prepare for the party.(If there's too much work for the faculty,you can use TAs to pick up the slack.) 1.9 Write an essay describing a research problem in your major that would benefit from the use of parallel computing.Provide a rough outline of how parallelism would be used.Would you use task-or data-parallelism?
14 CHAPTER 1 Why Parallel Computing? Make a table showing the numbers of receives and additions carried out by core 0 when the two sums are used with 2, 4, 8,..., 1024 cores. 1.7 The first part of the global sum example—when each core adds its assigned computed values—is usually considered to be an example of data-parallelism, while the second part of the first global sum—when the cores send their partial sums to the master core, which adds them—could be considered to be an example of task-parallelism. What about the second part of the second global sum—when the cores use a tree structure to add their partial sums? Is this an example of data- or task-parallelism? Why? 1.8 Suppose the faculty are going to have a party for the students in the department. a. Identify tasks that can be assigned to the faculty members that will allow them to use task-parallelism when they prepare for the party. Work out a schedule that shows when the various tasks can be performed. b. We might hope that one of the tasks in the preceding part is cleaning the house where the party will be held. How can we use data-parallelism to partition the work of cleaning the house among the faculty? c. Use a combination of task- and data-parallelism to prepare for the party. (If there’s too much work for the faculty, you can use TAs to pick up the slack.) 1.9 Write an essay describing a research problem in your major that would benefit from the use of parallel computing. Provide a rough outline of how parallelism would be used. Would you use task- or data-parallelism?
CHAPTER Parallel Hardware and Parallel Software 2 It's perfectly feasible for specialists in disciplines other than computer science and computer engineering to write parallel programs.However,in order to write effi- cient parallel programs,we do need some knowledge of the underlying hardware and system software.It's also very useful to have some knowledge of different types of parallel software,so in this chapter we'll take a brief look at a few topics in hardware and software.We'll also take a brief look at evaluating program performance and a method for developing parallel programs.We'll close with a discussion of what kind of environment we might expect to be working in,and a few rules and assumptions we'll make in the rest of the book. This is a long,broad chapter,so it may be a good idea to skim through some of the sections on a first reading so that you have a good idea of what's in the chapter.Then, when a concept or term in a later chapter isn't quite clear,it may be helpful to refer back to this chapter.In particular,you may want to skim over most of the material in "Modifications to the von Neumann Model,"except "The Basics of Caching." Also,in the "Parallel Hardware"section,you can safely skim the material on"SIMD Systems'”and "Interconnection Networks.” 2.1 SOME BACKGROUND Parallel hardware and software have grown out of conventional serial hardware and software:hardware and software that runs (more or less)a single job at a time.So in order to better understand the current state of parallel systems,let's begin with a brief look at a few aspects of serial systems. 2.1.1 The von Neumann architecture The "classical"von Neumann architecture consists of main memory,a central- processing unit(CPU)or processor or core,and an interconnection between the memory and the CPU.Main memory consists of a collection of locations,each of which is capable of storing both instructions and data.Every location consists of an address,which is used to access the location and the contents of the location-the instructions or data stored in the location. An Introduction to Parallel Programming.DOl:10.1016/B978-0-12-374260-5.00002-6 15 Copyright 2011 Elsevier Inc.All rights reserved
CHAPTER 2 Parallel Hardware and Parallel Software It’s perfectly feasible for specialists in disciplines other than computer science and computer engineering to write parallel programs. However, in order to write effi- cient parallel programs, we do need some knowledge of the underlying hardware and system software. It’s also very useful to have some knowledge of different types of parallel software, so in this chapter we’ll take a brief look at a few topics in hardware and software. We’ll also take a brief look at evaluating program performance and a method for developing parallel programs. We’ll close with a discussion of what kind of environment we might expect to be working in, and a few rules and assumptions we’ll make in the rest of the book. This is a long, broad chapter, so it may be a good idea to skim through some of the sections on a first reading so that you have a good idea of what’s in the chapter. Then, when a concept or term in a later chapter isn’t quite clear, it may be helpful to refer back to this chapter. In particular, you may want to skim over most of the material in “Modifications to the von Neumann Model,” except “The Basics of Caching.” Also, in the “Parallel Hardware” section, you can safely skim the material on “SIMD Systems” and “Interconnection Networks.” 2.1 SOME BACKGROUND Parallel hardware and software have grown out of conventional serial hardware and software: hardware and software that runs (more or less) a single job at a time. So in order to better understand the current state of parallel systems, let’s begin with a brief look at a few aspects of serial systems. 2.1.1 The von Neumann architecture The “classical” von Neumann architecture consists of main memory, a centralprocessing unit (CPU) or processor or core, and an interconnection between the memory and the CPU. Main memory consists of a collection of locations, each of which is capable of storing both instructions and data. Every location consists of an address, which is used to access the location and the contents of the location—the instructions or data stored in the location. Copyright c 2011 Elsevier Inc. All rights reserved. An Introduction to Parallel Programming. DOI: 10.1016/B978-0-12-374260-5.00002-6 15
16 CHAPTER 2 Parallel Hardware and Parallel Software The central processing unit is divided into a control unit and an arithmetic and logic unit(ALU).The control unit is responsible for deciding which instructions in a program should be executed,and the ALU is responsible for executing the actual instructions.Data in the CPU and information about the state of an executing program are stored in special,very fast storage called registers.The control unit has a special register called the program counter.It stores the address of the next instruction to be executed. Instructions and data are transferred between the CPU and memory via the inter- connect.This has traditionally been a bus,which consists of a collection of parallel wires and some hardware controlling access to the wires.A von Neumann machine executes a single instruction at a time,and each instruction operates on only a few pieces of data.See Figure 2.1. When data or instructions are transferred from memory to the CPU,we some- times say the data or instructions are fetched or read from memory.When data are transferred from the CPU to memory,we sometimes say the data are written to mem- ory or stored.The separation of memory and CPU is often called the von Neumann CPU ALU Control registers registers Interconnect Address Contents Main memory FIGURE 2.1 The von Neumann architecture
16 CHAPTER 2 Parallel Hardware and Parallel Software The central processing unit is divided into a control unit and an arithmetic and logic unit (ALU). The control unit is responsible for deciding which instructions in a program should be executed, and the ALU is responsible for executing the actual instructions. Data in the CPU and information about the state of an executing program are stored in special, very fast storage called registers. The control unit has a special register called the program counter. It stores the address of the next instruction to be executed. Instructions and data are transferred between the CPU and memory via the interconnect. This has traditionally been a bus, which consists of a collection of parallel wires and some hardware controlling access to the wires. A von Neumann machine executes a single instruction at a time, and each instruction operates on only a few pieces of data. See Figure 2.1. When data or instructions are transferred from memory to the CPU, we sometimes say the data or instructions are fetched or read from memory. When data are transferred from the CPU to memory, we sometimes say the data are written to memory or stored. The separation of memory and CPU is often called the von Neumann registers Interconnect Address Main memory Contents registers ALU Control CPU FIGURE 2.1 The von Neumann architecture
2.1 Some Background 17 bottleneck,since the interconnect determines the rate at which instructions and data can be accessed.The potentially vast quantity of data and instructions needed to run a program is effectively isolated from the CPU.In 2010 CPUs are capable of executing instructions more than one hundred times faster than they can fetch items from main memory. In order to better understand this problem,imagine that a large company has a single factory (the CPU)in one town and a single warehouse (main memory)in another.Further imagine that there is a single two-lane road joining the warehouse and the factory.All the raw materials used in manufacturing the products are stored in the warehouse.Also,all the finished products are stored in the warehouse before being shipped to customers.If the rate at which products can be manufactured is much larger than the rate at which raw materials and finished products can be transported, then it's likely that there will be a huge traffic jam on the road,and the employees and machinery in the factory will either be idle for extended periods or they will have to reduce the rate at which they produce finished products. In order to address the von Neumann bottleneck,and,more generally,improve CPU performance,computer engineers and computer scientists have experimented with many modifications to the basic von Neumann architecture.Before discussing some of these modifications,let's first take a moment to discuss some aspects of the software that are used in both von Neumann systems and more modern systems. 2.1.2 Processes,multitasking,and threads Recall that the operating system,or OS,is a major piece of software whose purpose is to manage hardware and software resources on a computer.It determines which programs can run and when they can run.It also controls the allocation of memory to running programs and access to peripheral devices such as hard disks and network interface cards. When a user runs a program,the operating system creates a process-an instance of a computer program that is being executed.A process consists of several entities: The executable machine language program. A block of memory,which will include the executable code,a call stack that keeps track of active functions,a heap,and some other memory locations. Descriptors of resources that the operating system has allocated to the process- for example,file descriptors. Security information-for example,information specifying which hardware and software resources the process can access. Information about the state of the process,such as whether the process is ready to run or is waiting on some resource,the content of the registers,and information about the process'memory. Most modern operating systems are multitasking.This means that the operating system provides support for the apparent simultaneous execution of multiple pro- grams.This is possible even on a system with a single core,since each process runs
2.1 Some Background 17 bottleneck, since the interconnect determines the rate at which instructions and data can be accessed. The potentially vast quantity of data and instructions needed to run a program is effectively isolated from the CPU. In 2010 CPUs are capable of executing instructions more than one hundred times faster than they can fetch items from main memory. In order to better understand this problem, imagine that a large company has a single factory (the CPU) in one town and a single warehouse (main memory) in another. Further imagine that there is a single two-lane road joining the warehouse and the factory. All the raw materials used in manufacturing the products are stored in the warehouse. Also, all the finished products are stored in the warehouse before being shipped to customers. If the rate at which products can be manufactured is much larger than the rate at which raw materials and finished products can be transported, then it’s likely that there will be a huge traffic jam on the road, and the employees and machinery in the factory will either be idle for extended periods or they will have to reduce the rate at which they produce finished products. In order to address the von Neumann bottleneck, and, more generally, improve CPU performance, computer engineers and computer scientists have experimented with many modifications to the basic von Neumann architecture. Before discussing some of these modifications, let’s first take a moment to discuss some aspects of the software that are used in both von Neumann systems and more modern systems. 2.1.2 Processes, multitasking, and threads Recall that the operating system, or OS, is a major piece of software whose purpose is to manage hardware and software resources on a computer. It determines which programs can run and when they can run. It also controls the allocation of memory to running programs and access to peripheral devices such as hard disks and network interface cards. When a user runs a program, the operating system creates a process—an instance of a computer program that is being executed. A process consists of several entities: . The executable machine language program. . A block of memory, which will include the executable code, a call stack that keeps track of active functions, a heap, and some other memory locations. . Descriptors of resources that the operating system has allocated to the process— for example, file descriptors. . Security information—for example, information specifying which hardware and software resources the process can access. . Information about the state of the process, such as whether the process is ready to run or is waiting on some resource, the content of the registers, and information about the process’ memory. Most modern operating systems are multitasking. This means that the operating system provides support for the apparent simultaneous execution of multiple programs. This is possible even on a system with a single core, since each process runs
18 CHAPTER 2 Parallel Hardware and Parallel Software Thread Process Thread FIGURE 2.2 A process and two threads for a small interval of time(typically a few milliseconds),often called a time slice. After one running program has executed for a time slice,the operating system can run a different program.A multitasking OS may change the running process many times a minute,even though changing the running process can take a long time. In a multitasking OS if a process needs to wait for a resource-for example,it needs to read data from external storage-it will block.This means that it will stop executing and the operating system can run another process.However,many pro- grams can continue to do useful work even though the part of the program that is currently executing must wait on a resource.For example,an airline reservation system that is blocked waiting for a seat map for one user could provide a list of available flights to another user.Threading provides a mechanism for programmers to divide their programs into more or less independent tasks with the property that when one thread is blocked another thread can be run.Furthermore,in most sys- tems it's possible to switch between threads much faster than it's possible to switch between processes.This is because threads are "lighter weight"than processes. Threads are contained within processes,so they can use the same executable,and they usually share the same memory and the same I/O devices.In fact,two threads belonging to one process can share most of the process'resources.The two most important exceptions are that they'll need a record of their own program counters and they'll need their own call stacks so that they can execute independently of each other. If a process is the"master"thread of execution and threads are started and stopped by the process,then we can envision the process and its subsidiary threads as lines: when a thread is started,it forks off the process;when a thread terminates,it joins the process.See Figure 2.2. 2.2 MODIFICATIONS TO THE VON NEUMANN MODEL As we noted earlier,since the first electronic digital computers were developed back in the 1940s,computer scientists and computer engineers have made many improve- ments to the basic von Neumann architecture.Many are targeted at reducing the problem of the von Neumann bottleneck,but many are also targeted at simply mak- ing CPUs faster.In this section we'll look at three of these improvements:caching, virtual memory,and low-level parallelism
18 CHAPTER 2 Parallel Hardware and Parallel Software Thread Thread Process FIGURE 2.2 A process and two threads for a small interval of time (typically a few milliseconds), often called a time slice. After one running program has executed for a time slice, the operating system can run a different program. A multitasking OS may change the running process many times a minute, even though changing the running process can take a long time. In a multitasking OS if a process needs to wait for a resource—for example, it needs to read data from external storage—it will block. This means that it will stop executing and the operating system can run another process. However, many programs can continue to do useful work even though the part of the program that is currently executing must wait on a resource. For example, an airline reservation system that is blocked waiting for a seat map for one user could provide a list of available flights to another user. Threading provides a mechanism for programmers to divide their programs into more or less independent tasks with the property that when one thread is blocked another thread can be run. Furthermore, in most systems it’s possible to switch between threads much faster than it’s possible to switch between processes. This is because threads are “lighter weight” than processes. Threads are contained within processes, so they can use the same executable, and they usually share the same memory and the same I/O devices. In fact, two threads belonging to one process can share most of the process’ resources. The two most important exceptions are that they’ll need a record of their own program counters and they’ll need their own call stacks so that they can execute independently of each other. If a process is the “master” thread of execution and threads are started and stopped by the process, then we can envision the process and its subsidiary threads as lines: when a thread is started, it forks off the process; when a thread terminates, it joins the process. See Figure 2.2. 2.2 MODIFICATIONS TO THE VON NEUMANN MODEL As we noted earlier, since the first electronic digital computers were developed back in the 1940s, computer scientists and computer engineers have made many improvements to the basic von Neumann architecture. Many are targeted at reducing the problem of the von Neumann bottleneck, but many are also targeted at simply making CPUs faster. In this section we’ll look at three of these improvements: caching, virtual memory, and low-level parallelism