《并行与分布式程序设计》课程教学参考书：并行与并发编程《An Introduction to Parallel Programming》.pdf_P16-P20

6 CHAPTER 1 Why Parallel Computing? dramatic with large numbers of cores.With 1000 cores,the first method will require 999 receives and adds,while the second will only require 10,an improvement of almost a factor of 100！ The first global sum is a fairly obvious generalization of the serial global sum: divide the work of adding among the cores,and after each core has computed its part of the sum,the master core simply repeats the basic serial addition-if there are p cores,then it needs to add p values.The second global sum,on the other hand,bears little relation to the original serial addition. The point here is that it's unlikely that a translation program would"discover" the second global sum.Rather there would more likely be a predefined efficient global sum that the translation program would have access to.It could "recog- nize"the original serial loop and replace it with a precoded,efficient,parallel global sum. We might expect that software could be written so that a large number of common serial constructs could be recognized and efficiently parallelized,that is,modified so that they can use multiple cores.However,as we apply this principle to ever more complex serial programs,it becomes more and more difficult to recognize the construct,and it becomes less and less likely that we'll have a precoded,efficient parallelization. Thus,we cannot simply continue to write serial programs,we must write parallel programs,programs that exploit the power of multiple processors. 1.4 HOW DO WE WRITE PARALLEL PROGRAMS? There are a number of possible answers to this question,but most of them depend on the basic idea of partitioning the work to be done among the cores.There are two widely used approaches:task-parallelism and data-parallelism.In task-parallelism, we partition the various tasks carried out in solving the problem among the cores.In data-parallelism,we partition the data used in solving the problem among the cores, and each core carries out more or less similar operations on its part of the data. As an example,suppose that Prof P has to teach a section of"Survey of English Literature."Also suppose that Prof P has one hundred students in her section,so she's been assigned four teaching assistants (TAs):Mr A,Ms B,Mr C,and Ms D. At last the semester is over,and Prof P makes up a final exam that consists of five questions.In order to grade the exam,she and her TAs might consider the following two options:each of them can grade all one hundred responses to one of the questions; say P grades question 1,A grades question 2,and so on.Alternatively,they can divide the one hundred exams into five piles of twenty exams each,and each of them can grade all the papers in one of the piles;P grades the papers in the first pile,A grades the papers in the second pile,and so on. In both approaches the"cores"are the professor and her TAs.The first approach might be considered an example of task-parallelism.There are five tasks to be carried out:grading the first question,grading the second question,and so on.Presumably, the graders will be looking for different information in question 1,which is about

6 CHAPTER 1 Why Parallel Computing? dramatic with large numbers of cores. With 1000 cores, the first method will require 999 receives and adds, while the second will only require 10, an improvement of almost a factor of 100! The first global sum is a fairly obvious generalization of the serial global sum: divide the work of adding among the cores, and after each core has computed its part of the sum, the master core simply repeats the basic serial addition—if there are p cores, then it needs to add p values. The second global sum, on the other hand, bears little relation to the original serial addition. The point here is that it’s unlikely that a translation program would “discover” the second global sum. Rather there would more likely be a predefined efficient global sum that the translation program would have access to. It could “recognize” the original serial loop and replace it with a precoded, efficient, parallel global sum. We might expect that software could be written so that a large number of common serial constructs could be recognized and efficiently parallelized, that is, modified so that they can use multiple cores. However, as we apply this principle to ever more complex serial programs, it becomes more and more difficult to recognize the construct, and it becomes less and less likely that we’ll have a precoded, efficient parallelization. Thus, we cannot simply continue to write serial programs, we must write parallel programs, programs that exploit the power of multiple processors. 1.4 HOW DO WE WRITE PARALLEL PROGRAMS? There are a number of possible answers to this question, but most of them depend on the basic idea of partitioning the work to be done among the cores. There are two widely used approaches: task-parallelism and data-parallelism. In task-parallelism, we partition the various tasks carried out in solving the problem among the cores. In data-parallelism, we partition the data used in solving the problem among the cores, and each core carries out more or less similar operations on its part of the data. As an example, suppose that Prof P has to teach a section of “Survey of English Literature.” Also suppose that Prof P has one hundred students in her section, so she’s been assigned four teaching assistants (TAs): Mr A, Ms B, Mr C, and Ms D. At last the semester is over, and Prof P makes up a final exam that consists of five questions. In order to grade the exam, she and her TAs might consider the following two options: each of them can grade all one hundred responses to one of the questions; say P grades question 1, A grades question 2, and so on. Alternatively, they can divide the one hundred exams into five piles of twenty exams each, and each of them can grade all the papers in one of the piles; P grades the papers in the first pile, A grades the papers in the second pile, and so on. In both approaches the “cores” are the professor and her TAs. The first approach might be considered an example of task-parallelism. There are five tasks to be carried out: grading the first question, grading the second question, and so on. Presumably, the graders will be looking for different information in question 1, which is about

1.4 How Do We Write Parallel Programs? 7 Shakespeare,from the information in question 2,which is about Milton,and so on. So the professor and her TAs will be"executing different instructions." On the other hand,the second approach might be considered an example of data- parallelism.The"data"are the students'papers,which are divided among the cores, and each core applies more or less the same grading instructions to each paper. The first part of the global sum example in Section 1.3 would probably be considered an example of data-parallelism.The data are the values computed by Compute_next_value,and each core carries out roughly the same operations on its assigned elements:it computes the required values by calling Compute_next-value and adds them together.The second part of the first global sum example might be considered an example of task-parallelism.There are two tasks:receiving and adding the cores'partial sums,which is carried out by the master core,and giving the partial sum to the master core,which is carried out by the other cores. When the cores can work independently,writing a parallel program is much the same as writing a serial program.Things get a good deal more complex when the cores need to coordinate their work.In the second global sum example,although the tree structure in the diagram is very easy to understand,writing the actual code is relatively complex.See Exercises 1.3 and 1.4.Unfortunately,it's much more common for the cores to need coordination. In both global sum examples,the coordination involves communication:one or more cores send their current partial sums to another core.The global sum examples should also involve coordination through load balancing:even though we didn't give explicit formulas,it's clear that we want the cores all to be assigned roughly the same number of values to compute.If,for example,one core has to compute most of the values,then the other cores will finish much sooner than the heavily loaded core, and their computational power will be wasted. A third type of coordination is synchronization.As an example,suppose that instead of computing the values to be added,the values are read from stdin.Say x is an array that is read in by the master core: if (I'm the master core) for (my-i =0:my-i n:my_i++) scanf("%1f".&x[my-i]) In most systems the cores are not automatically synchronized.Rather,each core works at its own pace.In this case,the problem is that we don't want the other cores to race ahead and start computing their partial sums before the master is done ini- tializing x and making it available to the other cores.That is,the cores need to wait before starting execution of the code: for (my_i my_first-i:my_i my_last_i:my_i++) my_sum +x[my_i]: We need to add in a point of synchronization between the initialization of x and the computation of the partial sums: Synchronize_cores():

1.4 How Do We Write Parallel Programs? 7 Shakespeare, from the information in question 2, which is about Milton, and so on. So the professor and her TAs will be “executing different instructions.” On the other hand, the second approach might be considered an example of dataparallelism. The “data” are the students’ papers, which are divided among the cores, and each core applies more or less the same grading instructions to each paper. The first part of the global sum example in Section 1.3 would probably be considered an example of data-parallelism. The data are the values computed by Compute next value, and each core carries out roughly the same operations on its assigned elements: it computes the required values by calling Compute next value and adds them together. The second part of the first global sum example might be considered an example of task-parallelism. There are two tasks: receiving and adding the cores’ partial sums, which is carried out by the master core, and giving the partial sum to the master core, which is carried out by the other cores. When the cores can work independently, writing a parallel program is much the same as writing a serial program. Things get a good deal more complex when the cores need to coordinate their work. In the second global sum example, although the tree structure in the diagram is very easy to understand, writing the actual code is relatively complex. See Exercises 1.3 and 1.4. Unfortunately, it’s much more common for the cores to need coordination. In both global sum examples, the coordination involves communication: one or more cores send their current partial sums to another core. The global sum examples should also involve coordination through load balancing: even though we didn’t give explicit formulas, it’s clear that we want the cores all to be assigned roughly the same number of values to compute. If, for example, one core has to compute most of the values, then the other cores will finish much sooner than the heavily loaded core, and their computational power will be wasted. A third type of coordination is synchronization. As an example, suppose that instead of computing the values to be added, the values are read from stdin. Say x is an array that is read in by the master core: if (I’m the master core) for (my i = 0; my i < n; my i++) scanf("%lf", &x[my i]); In most systems the cores are not automatically synchronized. Rather, each core works at its own pace. In this case, the problem is that we don’t want the other cores to race ahead and start computing their partial sums before the master is done initializing x and making it available to the other cores. That is, the cores need to wait before starting execution of the code: for (my i = my first i; my i < my last i; my i++) my sum += x[my i]; We need to add in a point of synchronization between the initialization of x and the computation of the partial sums: Synchronize cores();

8 CHAPTER 1 Why Parallel Computing? The idea here is that each core will wait in the function Synchronize_cores until all the cores have entered the function-in particular,until the master core has entered this function. Currently,the most powerful parallel programs are written using explicit parallel constructs,that is,they are written using extensions to languages such as C and C++. These programs include explicit instructions for parallelism:core 0 executes task 0, core 1 executes task 1,...,all cores synchronize,...,and so on,so such programs are often extremely complex.Furthermore,the complexity of modern cores often makes it necessary to use considerable care in writing the code that will be executed by a single core. There are other options for writing parallel programs-for example,higher level languages-but they tend to sacrifice performance in order to make program development somewhat easier. 1.5 WHAT WE'LL BE DOING We'll be focusing on learning to write programs that are explicitly parallel.Our pur- pose is to learn the basics of programming parallel computers using the C language and three different extensions to C:the Message-Passing Interface or MPL,POSIX threads or Pthreads,and OpenMP.MPI and Pthreads are libraries of type defini- tions,functions,and macros that can be used in C programs.OpenMP consists of a library and some modifications to the C compiler. You may well wonder why we're learning three different extensions to C instead of just one.The answer has to do with both the extensions and parallel systems. There are two main types of parallel systems that we'll be focusing on:shared- memory systems and distributed-memory systems.In a shared-memory system, the cores can share access to the computer's memory;in principle,each core can read and write each memory location.In a shared-memory system,we can coordinate the cores by having them examine and update shared-memory locations.In a distributed- memory system,on the other hand,each core has its own,private memory,and the cores must communicate explicitly by doing something like sending messages across a network.Figure 1.2 shows a schematic of the two types of systems.Pthreads and OpenMP were designed for programming shared-memory systems.They provide mechanisms for accessing shared-memory locations.MPI,on the other hand,was designed for programming distributed-memory systems.It provides mechanisms for sending messages. But why two extensions for shared-memory?OpenMP is a relatively high-level extension to C.For example,it can "parallelize"our addition loop sum =0: for (i =0;i n:i++)( x Compute_next_value(...): sum +x:

8 CHAPTER 1 Why Parallel Computing? The idea here is that each core will wait in the function Synchronize cores until all the cores have entered the function—in particular, until the master core has entered this function. Currently, the most powerful parallel programs are written using explicit parallel constructs, that is, they are written using extensions to languages such as C and C++. These programs include explicit instructions for parallelism: core 0 executes task 0, core 1 executes task 1, . . . , all cores synchronize, . . . , and so on, so such programs are often extremely complex. Furthermore, the complexity of modern cores often makes it necessary to use considerable care in writing the code that will be executed by a single core. There are other options for writing parallel programs—for example, higher level languages—but they tend to sacrifice performance in order to make program development somewhat easier. 1.5 WHAT WE’LL BE DOING We’ll be focusing on learning to write programs that are explicitly parallel. Our purpose is to learn the basics of programming parallel computers using the C language and three different extensions to C: the Message-Passing Interface or MPI, POSIX threads or Pthreads, and OpenMP. MPI and Pthreads are libraries of type definitions, functions, and macros that can be used in C programs. OpenMP consists of a library and some modifications to the C compiler. You may well wonder why we’re learning three different extensions to C instead of just one. The answer has to do with both the extensions and parallel systems. There are two main types of parallel systems that we’ll be focusing on: sharedmemory systems and distributed-memory systems. In a shared-memory system, the cores can share access to the computer’s memory; in principle, each core can read and write each memory location. In a shared-memory system, we can coordinate the cores by having them examine and update shared-memory locations. In a distributedmemory system, on the other hand, each core has its own, private memory, and the cores must communicate explicitly by doing something like sending messages across a network. Figure 1.2 shows a schematic of the two types of systems. Pthreads and OpenMP were designed for programming shared-memory systems. They provide mechanisms for accessing shared-memory locations. MPI, on the other hand, was designed for programming distributed-memory systems. It provides mechanisms for sending messages. But why two extensions for shared-memory? OpenMP is a relatively high-level extension to C. For example, it can “parallelize” our addition loop sum = 0; for (i = 0; i < n; i++) { x = Compute next value(. . .); sum += x; }