《并行与分布式程序设计》课程教学参考书：并行与并发编程《An Introduction to Parallel Programming》.pdf_P21-P25

10 CHAPTER 1 Why Parallel Computing? network.On the other hand,distributed programs tend to be more"loosely coupled." The tasks may be executed by multiple computers that are separated by large dis- tances,and the tasks themselves are often executed by programs that were created independently.As examples,our two concurrent addition programs would be con- sidered parallel by most authors,while a Web search program would be considered distributed. But beware,there isn't general agreement on these terms.For example,many authors consider shared-memory programs to be"parallel"and distributed-memory programs to be "distributed."As our title suggests,we'll be interested in par- allel programs-programs in which closely coupled tasks cooperate to solve a problem. 1.7 THE REST OF THE BOOK How can we use this book to help us write parallel programs? First,when you're interested in high-performance,whether you're writing serial or parallel programs,you need to know a little bit about the systems you're working with-both hardware and software.In Chapter 2,we'll give an overview of parallel hardware and software.In order to understand this discussion,it will be necessary to review some information on serial hardware and software.Much of the material in Chapter 2 won't be needed when we're getting started,so you might want to skim some of this material,and refer back to it occasionally when you're reading later chapters. The heart of the book is contained in Chapters 3 through 6.Chapters 3,4,and 5 provide a very elementary introduction to programming parallel systems using C and MPI,Pthreads,and OpenMP.respectively.The only prerequisite for reading these chapters is a knowledge of C programming.We've tried to make these chapters inde- pendent of each other,and you should be able to read them in any order.However,in order to make them independent,we did find it necessary to repeat some material.So if you've read one of the three chapters,and you go on to read another,be prepared to skim over some of the material in the new chapter. Chapter 6 puts together all we've learned in the preceding chapters,and devel- ops two fairly large programs in both a shared-and a distributed-memory setting. However,it should be possible to read much of this even if you've only read one of Chapters 3,4,or 5.The last chapter,Chapter 7,provides a few suggestions for further study on parallel programming. 1.8 A WORD OF WARNING Before proceeding,a word of warning.It may be tempting to write parallel pro- grams"by the seat of your pants,"without taking the trouble to carefully design and incrementally develop your program.This will almost certainly be a mistake.Every

10 CHAPTER 1 Why Parallel Computing? network. On the other hand, distributed programs tend to be more “loosely coupled.” The tasks may be executed by multiple computers that are separated by large distances, and the tasks themselves are often executed by programs that were created independently. As examples, our two concurrent addition programs would be considered parallel by most authors, while a Web search program would be considered distributed. But beware, there isn’t general agreement on these terms. For example, many authors consider shared-memory programs to be “parallel” and distributed-memory programs to be “distributed.” As our title suggests, we’ll be interested in parallel programs—programs in which closely coupled tasks cooperate to solve a problem. 1.7 THE REST OF THE BOOK How can we use this book to help us write parallel programs? First, when you’re interested in high-performance, whether you’re writing serial or parallel programs, you need to know a little bit about the systems you’re working with—both hardware and software. In Chapter 2, we’ll give an overview of parallel hardware and software. In order to understand this discussion, it will be necessary to review some information on serial hardware and software. Much of the material in Chapter 2 won’t be needed when we’re getting started, so you might want to skim some of this material, and refer back to it occasionally when you’re reading later chapters. The heart of the book is contained in Chapters 3 through 6. Chapters 3, 4, and 5 provide a very elementary introduction to programming parallel systems using C and MPI, Pthreads, and OpenMP, respectively. The only prerequisite for reading these chapters is a knowledge of C programming. We’ve tried to make these chapters independent of each other, and you should be able to read them in any order. However, in order to make them independent, we did find it necessary to repeat some material. So if you’ve read one of the three chapters, and you go on to read another, be prepared to skim over some of the material in the new chapter. Chapter 6 puts together all we’ve learned in the preceding chapters, and develops two fairly large programs in both a shared- and a distributed-memory setting. However, it should be possible to read much of this even if you’ve only read one of Chapters 3, 4, or 5. The last chapter, Chapter 7, provides a few suggestions for further study on parallel programming. 1.8 A WORD OF WARNING Before proceeding, a word of warning. It may be tempting to write parallel programs “by the seat of your pants,” without taking the trouble to carefully design and incrementally develop your program. This will almost certainly be a mistake. Every

12 CHAPTER 1 Why Parallel Computing? 1.10 SUMMARY For many years we've enjoyed the fruits of ever faster processors.However,because of physical limitations the rate of performance improvement in conventional pro- cessors is decreasing.In order to increase the power of processors,chipmakers have turned to multicore integrated circuits,that is,integrated circuits with multiple conventional processors on a single chip. Ordinary serial programs,which are programs written for a conventional single- core processor,usually cannot exploit the presence of multiple cores,and it's unlikely that translation programs will be able to shoulder all the work of parallelizing serial programs,meaning converting them into parallel programs,which can make use of multiple cores.As software developers,we need to learn to write parallel programs. When we write parallel programs,we usually need to coordinate the work of the cores.This can involve communication among the cores,load balancing,and synchronization of the cores. In this book we'll be learning to program parallel systems so that we can maximize their performance.We'll be using the C language with either MPI, Pthreads,or OpenMP.MPI is used for programming distributed-memory systems, and Pthreads and OpenMP are used for programming shared-memory systems.In distributed-memory systems,the cores have their own private memories,while in shared-memory systems,it's possible,in principle,for each core to access each memory location. Concurrent programs can have multiple tasks in progress at any instant.Paral- lel and distributed programs usually have tasks that execute simultaneously.There isn't a hard and fast distinction between parallel and distributed,although in parallel programs,the tasks are usually more tightly coupled. Parallel programs are usually very complex.So it's even more important to use good program development techniques with parallel programs. 1.11 EXERCISES 1.1 Devise formulas for the functions that calculate my_first_i and my-last-i in the global sum example.Remember that each core should be assigned roughly the same number of elements of computations in the loop.Hint:First consider the case when n is evenly divisible by p. 1.2 We've implicitly assumed that each call to Compute-next-value requires roughly the same amount of work as the other calls.How would you change your answer to the preceding question if call i=k requires k+1 times as much work as the call with i=0?So if the first call (i=0)requires 2 milliseconds, the second call (i=1)requires 4,the third (i=2)requires 6,and so on. 1.3 Try to write pseudo-code for the tree-structured global sum illustrated in Figure 1.1.Assume the number of cores is a power of two (1,2,4,8,...)

12 CHAPTER 1 Why Parallel Computing? 1.10 SUMMARY For many years we’ve enjoyed the fruits of ever faster processors. However, because of physical limitations the rate of performance improvement in conventional processors is decreasing. In order to increase the power of processors, chipmakers have turned to multicore integrated circuits, that is, integrated circuits with multiple conventional processors on a single chip. Ordinary serial programs, which are programs written for a conventional singlecore processor, usually cannot exploit the presence of multiple cores, and it’s unlikely that translation programs will be able to shoulder all the work of parallelizing serial programs, meaning converting them into parallel programs, which can make use of multiple cores. As software developers, we need to learn to write parallel programs. When we write parallel programs, we usually need to coordinate the work of the cores. This can involve communication among the cores, load balancing, and synchronization of the cores. In this book we’ll be learning to program parallel systems so that we can maximize their performance. We’ll be using the C language with either MPI, Pthreads, or OpenMP. MPI is used for programming distributed-memory systems, and Pthreads and OpenMP are used for programming shared-memory systems. In distributed-memory systems, the cores have their own private memories, while in shared-memory systems, it’s possible, in principle, for each core to access each memory location. Concurrent programs can have multiple tasks in progress at any instant. Parallel and distributed programs usually have tasks that execute simultaneously. There isn’t a hard and fast distinction between parallel and distributed, although in parallel programs, the tasks are usually more tightly coupled. Parallel programs are usually very complex. So it’s even more important to use good program development techniques with parallel programs. 1.11 EXERCISES 1.1 Devise formulas for the functions that calculate my first i and my last i in the global sum example. Remember that each core should be assigned roughly the same number of elements of computations in the loop. Hint: First consider the case when n is evenly divisible by p. 1.2 We’ve implicitly assumed that each call to Compute next value requires roughly the same amount of work as the other calls. How would you change your answer to the preceding question if call i = k requires k + 1 times as much work as the call with i = 0? So if the first call (i = 0) requires 2 milliseconds, the second call (i = 1) requires 4, the third (i = 2) requires 6, and so on. 1.3 Try to write pseudo-code for the tree-structured global sum illustrated in Figure 1.1. Assume the number of cores is a power of two (1, 2, 4, 8, . . . )

1.11 Exercises 13 Hints:Use a variable divisor to determine whether a core should send its sum or receive and add.The divisor should start with the value 2 and be doubled after each iteration.Also use a variable core_di fference to deter- mine which core should be partnered with the current core.It should start with the value 1 and also be doubled after each iteration.For example,in the first iteration 0 divisor 0 and 1 divisor 1,so 0 receives and adds,while 1 sends.Also in the first iteration 0 core_difference 1 and 1 -core_difference =0,so 0 and 1 are paired in the first iteration. 1.4 As an alternative to the approach outlined in the preceding problem,we can use C's bitwise operators to implement the tree-structured global sum.In order to see how this works,it helps to write down the binary (base 2)representation of each of the core ranks,and note the pairings during each stage: Stages Cores 2 2 010=0002110=0012 210=0102410=1002 110=0012010=0002 210=0102310=0112 010=0002 310=0112 210=0102 410=1002510=1012 610=1102 010=0002 510=1012410=1002 610=1102.710=1112 410=1002 710=1112610=1102 From the table we see that during the first stage each core is paired with the core whose rank differs in the rightmost or first bit.During the second stage cores that continue are paired with the core whose rank differs in the second bit,and during the third stage cores are paired with the core whose rank differs in the third bit.Thus,if we have a binary value bitmask that is 0012 for the first stage,0102 for the second,and 1002 for the third,we can get the rank of the core we're paired with by "inverting"the bit in our rank that is nonzero in bi tmask.This can be done using the bitwise exclusive or A operator. Implement this algorithm in pseudo-code using the bitwise exclusive or and the left-shift operator. 1.5 What happens if your pseudo-code in Exercise 1.3 or Exercise 1.4 is run when the number of cores is not a power of two (e.g.,3,5,6,7)?Can you modify the pseudo-code so that it will work correctly regardless of the number of cores? 1.6 Derive formulas for the number of receives and additions that core 0 carries out using a.the original pseudo-code for a global sum,and b.the tree-structured global sum

1.11 Exercises 13 Hints: Use a variable divisor to determine whether a core should send its sum or receive and add. The divisor should start with the value 2 and be doubled after each iteration. Also use a variable core difference to determine which core should be partnered with the current core. It should start with the value 1 and also be doubled after each iteration. For example, in the first iteration 0 % divisor = 0 and 1 % divisor = 1, so 0 receives and adds, while 1 sends. Also in the first iteration 0 + core difference = 1 and 1 − core difference = 0, so 0 and 1 are paired in the first iteration. 1.4 As an alternative to the approach outlined in the preceding problem, we can use C’s bitwise operators to implement the tree-structured global sum. In order to see how this works, it helps to write down the binary (base 2) representation of each of the core ranks, and note the pairings during each stage: Stages Cores 1 2 3 010 = 0002 110 = 0012 210 = 0102 410 = 1002 110 = 0012 010 = 0002 × × 210 = 0102 310 = 0112 010 = 0002 × 310 = 0112 210 = 0102 × × 410 = 1002 510 = 1012 610 = 1102 010 = 0002 510 = 1012 410 = 1002 × × 610 = 1102 710 = 1112 410 = 1002 × 710 = 1112 610 = 1102 × × From the table we see that during the first stage each core is paired with the core whose rank differs in the rightmost or first bit. During the second stage cores that continue are paired with the core whose rank differs in the second bit, and during the third stage cores are paired with the core whose rank differs in the third bit. Thus, if we have a binary value bitmask that is 0012 for the first stage, 0102 for the second, and 1002 for the third, we can get the rank of the core we’re paired with by “inverting” the bit in our rank that is nonzero in bitmask. This can be done using the bitwise exclusive or ∧ operator. Implement this algorithm in pseudo-code using the bitwise exclusive or and the left-shift operator. 1.5 What happens if your pseudo-code in Exercise 1.3 or Exercise 1.4 is run when the number of cores is not a power of two (e.g., 3, 5, 6, 7)? Can you modify the pseudo-code so that it will work correctly regardless of the number of cores? 1.6 Derive formulas for the number of receives and additions that core 0 carries out using a. the original pseudo-code for a global sum, and b. the tree-structured global sum