《并行与分布式程序设计》课程教学参考书：并行与并发编程《An Introduction to Parallel Programming》.pdf_P26-P30

CHAPTER Parallel Hardware and Parallel Software 2 It's perfectly feasible for specialists in disciplines other than computer science and computer engineering to write parallel programs.However,in order to write effi- cient parallel programs,we do need some knowledge of the underlying hardware and system software.It's also very useful to have some knowledge of different types of parallel software,so in this chapter we'll take a brief look at a few topics in hardware and software.We'll also take a brief look at evaluating program performance and a method for developing parallel programs.We'll close with a discussion of what kind of environment we might expect to be working in,and a few rules and assumptions we'll make in the rest of the book. This is a long,broad chapter,so it may be a good idea to skim through some of the sections on a first reading so that you have a good idea of what's in the chapter.Then, when a concept or term in a later chapter isn't quite clear,it may be helpful to refer back to this chapter.In particular,you may want to skim over most of the material in "Modifications to the von Neumann Model,"except "The Basics of Caching." Also,in the "Parallel Hardware"section,you can safely skim the material on"SIMD Systems'”and "Interconnection Networks.” 2.1 SOME BACKGROUND Parallel hardware and software have grown out of conventional serial hardware and software:hardware and software that runs (more or less)a single job at a time.So in order to better understand the current state of parallel systems,let's begin with a brief look at a few aspects of serial systems. 2.1.1 The von Neumann architecture The "classical"von Neumann architecture consists of main memory,a central- processing unit(CPU)or processor or core,and an interconnection between the memory and the CPU.Main memory consists of a collection of locations,each of which is capable of storing both instructions and data.Every location consists of an address,which is used to access the location and the contents of the location-the instructions or data stored in the location. An Introduction to Parallel Programming.DOl:10.1016/B978-0-12-374260-5.00002-6 15 Copyright 2011 Elsevier Inc.All rights reserved

CHAPTER 2 Parallel Hardware and Parallel Software It’s perfectly feasible for specialists in disciplines other than computer science and computer engineering to write parallel programs. However, in order to write effi- cient parallel programs, we do need some knowledge of the underlying hardware and system software. It’s also very useful to have some knowledge of different types of parallel software, so in this chapter we’ll take a brief look at a few topics in hardware and software. We’ll also take a brief look at evaluating program performance and a method for developing parallel programs. We’ll close with a discussion of what kind of environment we might expect to be working in, and a few rules and assumptions we’ll make in the rest of the book. This is a long, broad chapter, so it may be a good idea to skim through some of the sections on a first reading so that you have a good idea of what’s in the chapter. Then, when a concept or term in a later chapter isn’t quite clear, it may be helpful to refer back to this chapter. In particular, you may want to skim over most of the material in “Modifications to the von Neumann Model,” except “The Basics of Caching.” Also, in the “Parallel Hardware” section, you can safely skim the material on “SIMD Systems” and “Interconnection Networks.” 2.1 SOME BACKGROUND Parallel hardware and software have grown out of conventional serial hardware and software: hardware and software that runs (more or less) a single job at a time. So in order to better understand the current state of parallel systems, let’s begin with a brief look at a few aspects of serial systems. 2.1.1 The von Neumann architecture The “classical” von Neumann architecture consists of main memory, a centralprocessing unit (CPU) or processor or core, and an interconnection between the memory and the CPU. Main memory consists of a collection of locations, each of which is capable of storing both instructions and data. Every location consists of an address, which is used to access the location and the contents of the location—the instructions or data stored in the location. Copyright c 2011 Elsevier Inc. All rights reserved. An Introduction to Parallel Programming. DOI: 10.1016/B978-0-12-374260-5.00002-6 15

2.1 Some Background 17 bottleneck,since the interconnect determines the rate at which instructions and data can be accessed.The potentially vast quantity of data and instructions needed to run a program is effectively isolated from the CPU.In 2010 CPUs are capable of executing instructions more than one hundred times faster than they can fetch items from main memory. In order to better understand this problem,imagine that a large company has a single factory (the CPU)in one town and a single warehouse (main memory)in another.Further imagine that there is a single two-lane road joining the warehouse and the factory.All the raw materials used in manufacturing the products are stored in the warehouse.Also,all the finished products are stored in the warehouse before being shipped to customers.If the rate at which products can be manufactured is much larger than the rate at which raw materials and finished products can be transported, then it's likely that there will be a huge traffic jam on the road,and the employees and machinery in the factory will either be idle for extended periods or they will have to reduce the rate at which they produce finished products. In order to address the von Neumann bottleneck,and,more generally,improve CPU performance,computer engineers and computer scientists have experimented with many modifications to the basic von Neumann architecture.Before discussing some of these modifications,let's first take a moment to discuss some aspects of the software that are used in both von Neumann systems and more modern systems. 2.1.2 Processes,multitasking,and threads Recall that the operating system,or OS,is a major piece of software whose purpose is to manage hardware and software resources on a computer.It determines which programs can run and when they can run.It also controls the allocation of memory to running programs and access to peripheral devices such as hard disks and network interface cards. When a user runs a program,the operating system creates a process-an instance of a computer program that is being executed.A process consists of several entities: The executable machine language program. A block of memory,which will include the executable code,a call stack that keeps track of active functions,a heap,and some other memory locations. Descriptors of resources that the operating system has allocated to the process- for example,file descriptors. Security information-for example,information specifying which hardware and software resources the process can access. Information about the state of the process,such as whether the process is ready to run or is waiting on some resource,the content of the registers,and information about the process'memory. Most modern operating systems are multitasking.This means that the operating system provides support for the apparent simultaneous execution of multiple pro- grams.This is possible even on a system with a single core,since each process runs

2.1 Some Background 17 bottleneck, since the interconnect determines the rate at which instructions and data can be accessed. The potentially vast quantity of data and instructions needed to run a program is effectively isolated from the CPU. In 2010 CPUs are capable of executing instructions more than one hundred times faster than they can fetch items from main memory. In order to better understand this problem, imagine that a large company has a single factory (the CPU) in one town and a single warehouse (main memory) in another. Further imagine that there is a single two-lane road joining the warehouse and the factory. All the raw materials used in manufacturing the products are stored in the warehouse. Also, all the finished products are stored in the warehouse before being shipped to customers. If the rate at which products can be manufactured is much larger than the rate at which raw materials and finished products can be transported, then it’s likely that there will be a huge traffic jam on the road, and the employees and machinery in the factory will either be idle for extended periods or they will have to reduce the rate at which they produce finished products. In order to address the von Neumann bottleneck, and, more generally, improve CPU performance, computer engineers and computer scientists have experimented with many modifications to the basic von Neumann architecture. Before discussing some of these modifications, let’s first take a moment to discuss some aspects of the software that are used in both von Neumann systems and more modern systems. 2.1.2 Processes, multitasking, and threads Recall that the operating system, or OS, is a major piece of software whose purpose is to manage hardware and software resources on a computer. It determines which programs can run and when they can run. It also controls the allocation of memory to running programs and access to peripheral devices such as hard disks and network interface cards. When a user runs a program, the operating system creates a process—an instance of a computer program that is being executed. A process consists of several entities: . The executable machine language program. . A block of memory, which will include the executable code, a call stack that keeps track of active functions, a heap, and some other memory locations. . Descriptors of resources that the operating system has allocated to the process— for example, file descriptors. . Security information—for example, information specifying which hardware and software resources the process can access. . Information about the state of the process, such as whether the process is ready to run or is waiting on some resource, the content of the registers, and information about the process’ memory. Most modern operating systems are multitasking. This means that the operating system provides support for the apparent simultaneous execution of multiple programs. This is possible even on a system with a single core, since each process runs

18 CHAPTER 2 Parallel Hardware and Parallel Software Thread Process Thread FIGURE 2.2 A process and two threads for a small interval of time(typically a few milliseconds),often called a time slice. After one running program has executed for a time slice,the operating system can run a different program.A multitasking OS may change the running process many times a minute,even though changing the running process can take a long time. In a multitasking OS if a process needs to wait for a resource-for example,it needs to read data from external storage-it will block.This means that it will stop executing and the operating system can run another process.However,many pro- grams can continue to do useful work even though the part of the program that is currently executing must wait on a resource.For example,an airline reservation system that is blocked waiting for a seat map for one user could provide a list of available flights to another user.Threading provides a mechanism for programmers to divide their programs into more or less independent tasks with the property that when one thread is blocked another thread can be run.Furthermore,in most sys- tems it's possible to switch between threads much faster than it's possible to switch between processes.This is because threads are "lighter weight"than processes. Threads are contained within processes,so they can use the same executable,and they usually share the same memory and the same I/O devices.In fact,two threads belonging to one process can share most of the process'resources.The two most important exceptions are that they'll need a record of their own program counters and they'll need their own call stacks so that they can execute independently of each other. If a process is the"master"thread of execution and threads are started and stopped by the process,then we can envision the process and its subsidiary threads as lines: when a thread is started,it forks off the process;when a thread terminates,it joins the process.See Figure 2.2. 2.2 MODIFICATIONS TO THE VON NEUMANN MODEL As we noted earlier,since the first electronic digital computers were developed back in the 1940s,computer scientists and computer engineers have made many improve- ments to the basic von Neumann architecture.Many are targeted at reducing the problem of the von Neumann bottleneck,but many are also targeted at simply mak- ing CPUs faster.In this section we'll look at three of these improvements:caching, virtual memory,and low-level parallelism

18 CHAPTER 2 Parallel Hardware and Parallel Software Thread Thread Process FIGURE 2.2 A process and two threads for a small interval of time (typically a few milliseconds), often called a time slice. After one running program has executed for a time slice, the operating system can run a different program. A multitasking OS may change the running process many times a minute, even though changing the running process can take a long time. In a multitasking OS if a process needs to wait for a resource—for example, it needs to read data from external storage—it will block. This means that it will stop executing and the operating system can run another process. However, many programs can continue to do useful work even though the part of the program that is currently executing must wait on a resource. For example, an airline reservation system that is blocked waiting for a seat map for one user could provide a list of available flights to another user. Threading provides a mechanism for programmers to divide their programs into more or less independent tasks with the property that when one thread is blocked another thread can be run. Furthermore, in most systems it’s possible to switch between threads much faster than it’s possible to switch between processes. This is because threads are “lighter weight” than processes. Threads are contained within processes, so they can use the same executable, and they usually share the same memory and the same I/O devices. In fact, two threads belonging to one process can share most of the process’ resources. The two most important exceptions are that they’ll need a record of their own program counters and they’ll need their own call stacks so that they can execute independently of each other. If a process is the “master” thread of execution and threads are started and stopped by the process, then we can envision the process and its subsidiary threads as lines: when a thread is started, it forks off the process; when a thread terminates, it joins the process. See Figure 2.2. 2.2 MODIFICATIONS TO THE VON NEUMANN MODEL As we noted earlier, since the first electronic digital computers were developed back in the 1940s, computer scientists and computer engineers have made many improvements to the basic von Neumann architecture. Many are targeted at reducing the problem of the von Neumann bottleneck, but many are also targeted at simply making CPUs faster. In this section we’ll look at three of these improvements: caching, virtual memory, and low-level parallelism