《并行与分布式程序设计》课程教学参考书：并行与并发编程《An Introduction to Parallel Programming》.pdf_P36-P40

24 CHAPTER 2 Parallel Hardware and Parallel Software parts that are idle are kept in a block of secondary storage called swap space.Like CPU caches,virtual memory operates on blocks of data and instructions.These blocks are commonly called pages,and since secondary storage access can be hun- dreds of thousands of times slower than main memory access,pages are relatively large-most systems have a fixed page size that currently ranges from 4 to 16 kilobytes. We may run into trouble if we try to assign physical memory addresses to pages when we compile a program.If we do this,then each page of the program can only be assigned to one block of memory,and with a multitasking operating system,we're likely to have many programs wanting to use the same block of memory.In order to avoid this problem,when a program is compiled,its pages are assigned virtual page numbers.Then,when the program is run,a table is created that maps the vir- tual page numbers to physical addresses.When the program is run and it refers to a virtual address,this page table is used to translate the virtual address into a physical address.If the creation of the page table is managed by the operating system,it can ensure that the memory used by one program doesn't overlap the memory used by another. A drawback to the use of a page table is that it can double the time needed to access a location in main memory.Suppose,for example,that we want to execute an instruction in main memory.Then our executing program will have the virtual address of this instruction,but before we can find the instruction in memory,we'll need to translate the virtual address into a physical address.In order to do this, we'll need to find the page in memory that contains the instruction.Now the vir- tual page number is stored as a part of the virtual address.As an example,suppose our addresses are 32 bits and our pages are 4 kilobytes =4096 bytes.Then each byte in the page can be identified with 12 bits,since212=4096.Thus,we can use the low- order 12 bits of the virtual address to locate a byte within a page,and the remaining bits of the virtual address can be used to locate an individual page.See Table 2.2. Observe that the virtual page number can be computed from the virtual address with- out going to memory.However,once we've found the virtual page number,we'll need to access the page table to translate it into a physical page.If the required part of the page table isn't in cache,we'll need to load it from memory.After it's loaded,we can translate our virtual address to a physical address and get the required instruction. Table 2.2 Virtual Address Divided into Virtual Page Number and Byte Offset Virtual Address Virtual Page Number Byte Offset 31 30,13121110.1 0 1 0 1 0 0.,.1

24 CHAPTER 2 Parallel Hardware and Parallel Software parts that are idle are kept in a block of secondary storage called swap space. Like CPU caches, virtual memory operates on blocks of data and instructions. These blocks are commonly called pages, and since secondary storage access can be hundreds of thousands of times slower than main memory access, pages are relatively large—most systems have a fixed page size that currently ranges from 4 to 16 kilobytes. We may run into trouble if we try to assign physical memory addresses to pages when we compile a program. If we do this, then each page of the program can only be assigned to one block of memory, and with a multitasking operating system, we’re likely to have many programs wanting to use the same block of memory. In order to avoid this problem, when a program is compiled, its pages are assigned virtual page numbers. Then, when the program is run, a table is created that maps the virtual page numbers to physical addresses. When the program is run and it refers to a virtual address, this page table is used to translate the virtual address into a physical address. If the creation of the page table is managed by the operating system, it can ensure that the memory used by one program doesn’t overlap the memory used by another. A drawback to the use of a page table is that it can double the time needed to access a location in main memory. Suppose, for example, that we want to execute an instruction in main memory. Then our executing program will have the virtual address of this instruction, but before we can find the instruction in memory, we’ll need to translate the virtual address into a physical address. In order to do this, we’ll need to find the page in memory that contains the instruction. Now the virtual page number is stored as a part of the virtual address. As an example, suppose our addresses are 32 bits and our pages are 4 kilobytes = 4096 bytes. Then each byte in the page can be identified with 12 bits, since 212 = 4096. Thus, we can use the loworder 12 bits of the virtual address to locate a byte within a page, and the remaining bits of the virtual address can be used to locate an individual page. See Table 2.2. Observe that the virtual page number can be computed from the virtual address without going to memory. However, once we’ve found the virtual page number, we’ll need to access the page table to translate it into a physical page. If the required part of the page table isn’t in cache, we’ll need to load it from memory. After it’s loaded, we can translate our virtual address to a physical address and get the required instruction. Table 2.2 Virtual Address Divided into Virtual Page Number and Byte Offset Virtual Address Virtual Page Number Byte Offset 31 30 ··· 13 12 11 10 ··· 1 0 1 0 ··· 1 1 0 0 ··· 1 1

2.2 Modifications to the von Neumann Model 25 This is clearly a problem.Although multiple programs can use main memory at more or less the same time,using a page table has the potential to significantly increase each program's overall run-time.In order to address this issue,processors have a special address translation cache called a translation-lookaside buffer,or TLB.It caches a small number of entries (typically 16-512)from the page table in very fast memory.Using the principle of spatial and temporal locality,we would expect that most of our memory references will be to pages whose physical address is stored in the TLB,and the number of memory references that require accesses to the page table in main memory will be substantially reduced. The terminology for the TLB is the same as the terminology for caches.When we look for an address and the virtual page number is in the TLB,it's called a TLB hit. If it's not in the TLB,it's called a miss.The terminology for the page table,however, has an important difference from the terminology for caches.If we attempt to access a page that's not in memory,that is,the page table doesn't have a valid physical address for the page and the page is only stored on disk,then the attempted access is called a page fault. The relative slowness of disk accesses has a couple of additional consequences for virtual memory.First,with CPU caches we could handle write-misses with either a write-through or write-back scheme.With virtual memory,however,disk accesses are so expensive that they should be avoided whenever possible,so virtual memory always uses a write-back scheme.This can be handled by keeping a bit on each page in memory that indicates whether the page has been updated.If it has been updated, when it is evicted from main memory,it will be written to disk.Second,since disk accesses are so slow,management of the page table and the handling of disk accesses can be done by the operating system.Thus,even though we as programmers don't directly control virtual memory,unlike CPU caches,which are handled by system hardware,virtual memory is usually controlled by a combination of system hardware and operating system software. 2.2.5 Instruction-level parallelism Instruction-level parallelism,or ILP,attempts to improve processor performance by having multiple processor components or functional units simultaneously exe- cuting instructions.There are two main approaches to ILP:pipelining,in which functional units are arranged in stages,and multiple issue,in which multiple instruc- tions can be simultaneously initiated.Both approaches are used in virtually all modern CPUs. Pipelining The principle of pipelining is similar to a factory assembly line:while one team is bolting a car's engine to the chassis,another team can connect the transmission to the engine and the driveshaft of a car that's already been processed by the first team, and a third team can bolt the body to the chassis in a car that's been processed by

2.2 Modifications to the von Neumann Model 25 This is clearly a problem. Although multiple programs can use main memory at more or less the same time, using a page table has the potential to significantly increase each program’s overall run-time. In order to address this issue, processors have a special address translation cache called a translation-lookaside buffer, or TLB. It caches a small number of entries (typically 16–512) from the page table in very fast memory. Using the principle of spatial and temporal locality, we would expect that most of our memory references will be to pages whose physical address is stored in the TLB, and the number of memory references that require accesses to the page table in main memory will be substantially reduced. The terminology for the TLB is the same as the terminology for caches. When we look for an address and the virtual page number is in the TLB, it’s called a TLB hit. If it’s not in the TLB, it’s called a miss. The terminology for the page table, however, has an important difference from the terminology for caches. If we attempt to access a page that’s not in memory, that is, the page table doesn’t have a valid physical address for the page and the page is only stored on disk, then the attempted access is called a page fault. The relative slowness of disk accesses has a couple of additional consequences for virtual memory. First, with CPU caches we could handle write-misses with either a write-through or write-back scheme. With virtual memory, however, disk accesses are so expensive that they should be avoided whenever possible, so virtual memory always uses a write-back scheme. This can be handled by keeping a bit on each page in memory that indicates whether the page has been updated. If it has been updated, when it is evicted from main memory, it will be written to disk. Second, since disk accesses are so slow, management of the page table and the handling of disk accesses can be done by the operating system. Thus, even though we as programmers don’t directly control virtual memory, unlike CPU caches, which are handled by system hardware, virtual memory is usually controlled by a combination of system hardware and operating system software. 2.2.5 Instruction-level parallelism Instruction-level parallelism, or ILP, attempts to improve processor performance by having multiple processor components or functional units simultaneously executing instructions. There are two main approaches to ILP: pipelining, in which functional units are arranged in stages, and multiple issue, in which multiple instructions can be simultaneously initiated. Both approaches are used in virtually all modern CPUs. Pipelining The principle of pipelining is similar to a factory assembly line: while one team is bolting a car’s engine to the chassis, another team can connect the transmission to the engine and the driveshaft of a car that’s already been processed by the first team, and a third team can bolt the body to the chassis in a car that’s been processed by

26 CHAPTER 2 Parallel Hardware and Parallel Software the first two teams.As an example involving computation,suppose we want to add the floating point numbers 9.87x 104 and 6.54 x 103.Then we can use the following steps: Time Operation Operand 1 Operand 2 Result 0 Fetch operands 9.87×104 6.54×103 1 Compare exponents 9.87×104 6.54×103 2 Shift one operand 9.87×104 0.654×104 3 Add 9.87×104 0.654×104 10.524×104 Normalize result 9.87×104 0.654×104 1.0524×105 5 Round result 9.87×104 0.654×101.05×105 6 Store result 9.87×104 0.654×104 1.05×105 Here we're using base 10 and a three digit mantissa or significand with one digit to the left of the decimal point.Thus,in the example,normalizing shifts the decimal point one unit to the left,and rounding rounds to three digits. Now if each of the operations takes one nanosecond(10-9 seconds),the addition operation will take seven nanoseconds.So if we execute the code f1oatx[1000],y[1000],z[1000]: for(i=0:i<1000:1+) z[i]=x[i]+y[i]: the for loop will take something like 7000 nanoseconds. As an alternative,suppose we divide our floating point adder into seven sepa- rate pieces of hardware or functional units.The first unit will fetch two operands, the second will compare exponents,and so on.Also suppose that the output of one functional unit is the input to the next.So,for example,the output of the functional unit that adds the two values is the input to the unit that normalizes the result.Then a single floating point addition will still take seven nanoseconds.However,when we execute the for loop,we can fetch x[1]and y[1]while we're comparing the expo- nents of x[0]and y[0].More generally,it's possible for us to simultaneously execute seven different stages in seven different additions.See Table 2.3.From the table we see that after time 5,the pipelined loop produces a result every nanosecond,instead of every seven nanoseconds,so the total time to execute the for loop has been reduced from 7000 nanoseconds to 1006 nanoseconds-an improvement of almost a factor of seven. In general,a pipeline with k stages won't get a k-fold improvement in per- formance.For example,if the times required by the various functional units are different,then the stages will effectively run at the speed of the slowest functional unit.Furthermore,delays such as waiting for an operand to become available can cause the pipeline to stall.See Exercise 2.1 for more details on the performance of pipelines

26 CHAPTER 2 Parallel Hardware and Parallel Software the first two teams. As an example involving computation, suppose we want to add the floating point numbers 9.87 × 104 and 6.54 × 103 . Then we can use the following steps: Time Operation Operand 1 Operand 2 Result 0 Fetch operands 9.87 × 104 6.54 × 103 1 Compare exponents 9.87 × 104 6.54 × 103 2 Shift one operand 9.87 × 104 0.654 × 104 3 Add 9.87 × 104 0.654 × 104 10.524 × 104 4 Normalize result 9.87 × 104 0.654 × 104 1.0524 × 105 5 Round result 9.87 × 104 0.654 × 104 1.05 × 105 6 Store result 9.87 × 104 0.654 × 104 1.05 × 105 Here we’re using base 10 and a three digit mantissa or significand with one digit to the left of the decimal point. Thus, in the example, normalizing shifts the decimal point one unit to the left, and rounding rounds to three digits. Now if each of the operations takes one nanosecond (10−9 seconds), the addition operation will take seven nanoseconds. So if we execute the code float x[1000], y[1000], z[1000]; . . . for (i = 0; i < 1000; i++) z[i] = x[i] + y[i]; the for loop will take something like 7000 nanoseconds. As an alternative, suppose we divide our floating point adder into seven separate pieces of hardware or functional units. The first unit will fetch two operands, the second will compare exponents, and so on. Also suppose that the output of one functional unit is the input to the next. So, for example, the output of the functional unit that adds the two values is the input to the unit that normalizes the result. Then a single floating point addition will still take seven nanoseconds. However, when we execute the for loop, we can fetch x[1] and y[1] while we’re comparing the exponents of x[0] and y[0]. More generally, it’s possible for us to simultaneously execute seven different stages in seven different additions. See Table 2.3. From the table we see that after time 5, the pipelined loop produces a result every nanosecond, instead of every seven nanoseconds, so the total time to execute the for loop has been reduced from 7000 nanoseconds to 1006 nanoseconds—an improvement of almost a factor of seven. In general, a pipeline with k stages won’t get a k-fold improvement in performance. For example, if the times required by the various functional units are different, then the stages will effectively run at the speed of the slowest functional unit. Furthermore, delays such as waiting for an operand to become available can cause the pipeline to stall. See Exercise 2.1 for more details on the performance of pipelines

28 CHAPTER 2 Parallel Hardware and Parallel Software Z=X y; if (z>0) W=X; else w=y: As another example,in the code Z=x y: w=*a-p:/a-p is a pointer * the system might predict that a-p does not refer to z,and hence it can simultaneously execute the two assignments. As both examples make clear,speculative execution must allow for the possibility that the predicted behavior is incorrect.In the first example,we will need to go back and execute the assignment w =y if the assignment z x y results in a value that's not positive.In the second example,if a-p does point to z,we'll need to re- execute the assignment w *a_p. If the compiler does the speculation,it will usually insert code that tests whether the speculation was correct,and,if not,takes corrective action.If the hardware does the speculation,the processor usually stores the result(s)of the speculative execution in a buffer.When it's known that the speculation was correct,the contents of the buffer are transferred to registers or memory.If the speculation was incorrect,the contents of the buffer are discarded and the instruction is re-executed. While dynamic multiple issue systems can execute instructions out of order,in current generation systems the instructions are still loaded in order and the results of the instructions are also committed in order.That is,the results of instructions are written to registers and memory in the program-specified order. Optimizing compilers,on the other hand,can reorder instructions.This,as we'll see later,can have important consequences for shared-memory programming. 2.2.6 Hardware multithreading ILP can be very difficult to exploit:it is a program with a long sequence of depen- dent statements offers few opportunities.For example,in a direct calculation of the Fibonacci numbers f[0]=f[1]=1: for(i=2:i<=n:1+) f[i]=f[i-1]+f[i-2]: there's essentially no opportunity for simultaneous execution of instructions. Thread-level parallelism,or TLP,attempts to provide parallelism through the simultaneous execution of different threads,so it provides a coarser-grained parallelism than ILP,that is,the program units that are being simultaneously executed-threads-are larger or coarser than the finer-grained units-individual instructions

28 CHAPTER 2 Parallel Hardware and Parallel Software z = x + y; if (z > 0) w = x; else w = y; As another example, in the code z = x + y; w = ∗a p; /∗ a p is a pointer ∗/ the system might predict that a p does not refer to z, and hence it can simultaneously execute the two assignments. As both examples make clear, speculative execution must allow for the possibility that the predicted behavior is incorrect. In the first example, we will need to go back and execute the assignment w = y if the assignment z = x + y results in a value that’s not positive. In the second example, if a p does point to z, we’ll need to reexecute the assignment w = ∗a p. If the compiler does the speculation, it will usually insert code that tests whether the speculation was correct, and, if not, takes corrective action. If the hardware does the speculation, the processor usually stores the result(s) of the speculative execution in a buffer. When it’s known that the speculation was correct, the contents of the buffer are transferred to registers or memory. If the speculation was incorrect, the contents of the buffer are discarded and the instruction is re-executed. While dynamic multiple issue systems can execute instructions out of order, in current generation systems the instructions are still loaded in order and the results of the instructions are also committed in order. That is, the results of instructions are written to registers and memory in the program-specified order. Optimizing compilers, on the other hand, can reorder instructions. This, as we’ll see later, can have important consequences for shared-memory programming. 2.2.6 Hardware multithreading ILP can be very difficult to exploit: it is a program with a long sequence of dependent statements offers few opportunities. For example, in a direct calculation of the Fibonacci numbers f[0] = f[1] = 1; for (i = 2; i <= n; i++) f[i] = f[i−1] + f[i−2]; there’s essentially no opportunity for simultaneous execution of instructions. Thread-level parallelism, or TLP, attempts to provide parallelism through the simultaneous execution of different threads, so it provides a coarser-grained parallelism than ILP, that is, the program units that are being simultaneously executed—threads—are larger or coarser than the finer-grained units—individual instructions