4 CHAPTER 1 Why Parallel Computing? An efficient parallel implementation of a serial program may not be obtained by finding efficient parallelizations of each of its steps.Rather,the best parallelization may be obtained by stepping back and devising an entirely new algorithm. As an example,suppose that we need to compute n values and add them together. We know that this can be done with the following serial code: sum 0: for (i =0;i n;i++)( x Compute_next-value(...) sum +x: } Now suppose we also have p cores and p is much smaller than n.Then each core can form a partial sum of approximately n/p values: my_sum 0: my-first.i=:.·: my_last-i = for (my-i my_first-i:my-i my-last-i:my-i++) my_x Compute_next_value(...): my_sum +my_x: } Here the prefix my-indicates that each core is using its own,private variables,and each core can execute this block of code independently of the other cores. After each core completes execution of this code,its variable my_sum will store the sum of the values computed by its calls to Compute_next-value.For example, if there are eight cores,n=24,and the 24 calls to Compute_next_value return the values 1,4,3,9,2,85,1,1,6,2,7,2,5,0,4,1,8,6,5,1,2,39, then the values stored in my_sum might be Core 01234567 my-sum8197157131214 Here we're assuming the cores are identified by nonnegative integers in the range 0,1,...,p-1,where p is the number of cores. When the cores are done computing their values of my-sum,they can form a global sum by sending their results to a designated "master"core,which can add their results: if (I'm the master core)( sum my_x: for each core other than myself receive value from core: sum +value:
4 CHAPTER 1 Why Parallel Computing? An efficient parallel implementation of a serial program may not be obtained by finding efficient parallelizations of each of its steps. Rather, the best parallelization may be obtained by stepping back and devising an entirely new algorithm. As an example, suppose that we need to compute n values and add them together. We know that this can be done with the following serial code: sum = 0; for (i = 0; i < n; i++) { x = Compute next value(. . .); sum += x; } Now suppose we also have p cores and p is much smaller than n. Then each core can form a partial sum of approximately n/p values: my sum = 0; my first i = . . . ; my last i = . . . ; for (my i = my first i; my i < my last i; my i++) { my x = Compute next value(. . .); my sum += my x; } Here the prefix my indicates that each core is using its own, private variables, and each core can execute this block of code independently of the other cores. After each core completes execution of this code, its variable my sum will store the sum of the values computed by its calls to Compute next value. For example, if there are eight cores, n = 24, and the 24 calls to Compute next value return the values 1, 4, 3, 9, 2, 8, 5, 1, 1, 6, 2, 7, 2, 5, 0, 4, 1, 8, 6, 5, 1, 2, 3, 9, then the values stored in my sum might be Core 0 1 2 3 4 5 6 7 my sum 8 19 7 15 7 13 12 14 Here we’re assuming the cores are identified by nonnegative integers in the range 0, 1,...,p − 1, where p is the number of cores. When the cores are done computing their values of my sum, they can form a global sum by sending their results to a designated “master” core, which can add their results: if (I’m the master core) { sum = my x; for each core other than myself { receive value from core; sum += value; }
1.3 Why We Need to Write Parallel Programs 5 else send my-x to the master: In our example,if the master core is core 0,it would add the values 8+19+7+ 15+7+13+12+14=95. But you can probably see a better way to do this-especially if the number of cores is large.Instead of making the master core do all the work of computing the final sum,we can pair the cores so that while core 0 adds in the result of core 1,core 2 can add in the result of core 3,core 4 can add in the result of core 5 and so on.Then we can repeat the process with only the even-ranked cores:0 adds in the result of 2. 4 adds in the result of 6,and so on.Now cores divisible by 4 repeat the process,and so on.See Figure 1.1.The circles contain the current value of each core's sum,and the lines with arrows indicate that one core is sending its sum to another core.The plus signs indicate that a core is receiving a sum from another core and adding the received sum into its own sum. For both "global"sums,the master core (core 0)does more work than any other core,and the length of time it takes the program to complete the final sum should be the length of time it takes for the master to complete.However,with eight cores, the master will carry out seven receives and adds using the first method,while with the second method it will only carry out three.So the second method results in an improvement of more than a factor of two.The difference becomes much more Cores 0 3 5 15 13 12 14 22 (20 26 Time 46 +(95 FIGURE 1.1 Multiple cores forming a global sum
1.3 Why We Need to Write Parallel Programs 5 } else { send my x to the master; } In our example, if the master core is core 0, it would add the values 8 + 19 + 7 + 15 + 7 + 13 + 12 + 14 = 95. But you can probably see a better way to do this—especially if the number of cores is large. Instead of making the master core do all the work of computing the final sum, we can pair the cores so that while core 0 adds in the result of core 1, core 2 can add in the result of core 3, core 4 can add in the result of core 5 and so on. Then we can repeat the process with only the even-ranked cores: 0 adds in the result of 2, 4 adds in the result of 6, and so on. Now cores divisible by 4 repeat the process, and so on. See Figure 1.1. The circles contain the current value of each core’s sum, and the lines with arrows indicate that one core is sending its sum to another core. The plus signs indicate that a core is receiving a sum from another core and adding the received sum into its own sum. For both “global” sums, the master core (core 0) does more work than any other core, and the length of time it takes the program to complete the final sum should be the length of time it takes for the master to complete. However, with eight cores, the master will carry out seven receives and adds using the first method, while with the second method it will only carry out three. So the second method results in an improvement of more than a factor of two. The difference becomes much more Time Cores 8 27 19 12 7 26 15 13 14 49 7 20 46 95 22 0 + + + + + + + 12 3 4 5 6 7 FIGURE 1.1 Multiple cores forming a global sum
6 CHAPTER 1 Why Parallel Computing? dramatic with large numbers of cores.With 1000 cores,the first method will require 999 receives and adds,while the second will only require 10,an improvement of almost a factor of 100! The first global sum is a fairly obvious generalization of the serial global sum: divide the work of adding among the cores,and after each core has computed its part of the sum,the master core simply repeats the basic serial addition-if there are p cores,then it needs to add p values.The second global sum,on the other hand,bears little relation to the original serial addition. The point here is that it's unlikely that a translation program would"discover" the second global sum.Rather there would more likely be a predefined efficient global sum that the translation program would have access to.It could "recog- nize"the original serial loop and replace it with a precoded,efficient,parallel global sum. We might expect that software could be written so that a large number of common serial constructs could be recognized and efficiently parallelized,that is,modified so that they can use multiple cores.However,as we apply this principle to ever more complex serial programs,it becomes more and more difficult to recognize the construct,and it becomes less and less likely that we'll have a precoded,efficient parallelization. Thus,we cannot simply continue to write serial programs,we must write parallel programs,programs that exploit the power of multiple processors. 1.4 HOW DO WE WRITE PARALLEL PROGRAMS? There are a number of possible answers to this question,but most of them depend on the basic idea of partitioning the work to be done among the cores.There are two widely used approaches:task-parallelism and data-parallelism.In task-parallelism, we partition the various tasks carried out in solving the problem among the cores.In data-parallelism,we partition the data used in solving the problem among the cores, and each core carries out more or less similar operations on its part of the data. As an example,suppose that Prof P has to teach a section of"Survey of English Literature."Also suppose that Prof P has one hundred students in her section,so she's been assigned four teaching assistants (TAs):Mr A,Ms B,Mr C,and Ms D. At last the semester is over,and Prof P makes up a final exam that consists of five questions.In order to grade the exam,she and her TAs might consider the following two options:each of them can grade all one hundred responses to one of the questions; say P grades question 1,A grades question 2,and so on.Alternatively,they can divide the one hundred exams into five piles of twenty exams each,and each of them can grade all the papers in one of the piles;P grades the papers in the first pile,A grades the papers in the second pile,and so on. In both approaches the"cores"are the professor and her TAs.The first approach might be considered an example of task-parallelism.There are five tasks to be carried out:grading the first question,grading the second question,and so on.Presumably, the graders will be looking for different information in question 1,which is about
6 CHAPTER 1 Why Parallel Computing? dramatic with large numbers of cores. With 1000 cores, the first method will require 999 receives and adds, while the second will only require 10, an improvement of almost a factor of 100! The first global sum is a fairly obvious generalization of the serial global sum: divide the work of adding among the cores, and after each core has computed its part of the sum, the master core simply repeats the basic serial addition—if there are p cores, then it needs to add p values. The second global sum, on the other hand, bears little relation to the original serial addition. The point here is that it’s unlikely that a translation program would “discover” the second global sum. Rather there would more likely be a predefined efficient global sum that the translation program would have access to. It could “recognize” the original serial loop and replace it with a precoded, efficient, parallel global sum. We might expect that software could be written so that a large number of common serial constructs could be recognized and efficiently parallelized, that is, modified so that they can use multiple cores. However, as we apply this principle to ever more complex serial programs, it becomes more and more difficult to recognize the construct, and it becomes less and less likely that we’ll have a precoded, efficient parallelization. Thus, we cannot simply continue to write serial programs, we must write parallel programs, programs that exploit the power of multiple processors. 1.4 HOW DO WE WRITE PARALLEL PROGRAMS? There are a number of possible answers to this question, but most of them depend on the basic idea of partitioning the work to be done among the cores. There are two widely used approaches: task-parallelism and data-parallelism. In task-parallelism, we partition the various tasks carried out in solving the problem among the cores. In data-parallelism, we partition the data used in solving the problem among the cores, and each core carries out more or less similar operations on its part of the data. As an example, suppose that Prof P has to teach a section of “Survey of English Literature.” Also suppose that Prof P has one hundred students in her section, so she’s been assigned four teaching assistants (TAs): Mr A, Ms B, Mr C, and Ms D. At last the semester is over, and Prof P makes up a final exam that consists of five questions. In order to grade the exam, she and her TAs might consider the following two options: each of them can grade all one hundred responses to one of the questions; say P grades question 1, A grades question 2, and so on. Alternatively, they can divide the one hundred exams into five piles of twenty exams each, and each of them can grade all the papers in one of the piles; P grades the papers in the first pile, A grades the papers in the second pile, and so on. In both approaches the “cores” are the professor and her TAs. The first approach might be considered an example of task-parallelism. There are five tasks to be carried out: grading the first question, grading the second question, and so on. Presumably, the graders will be looking for different information in question 1, which is about
1.4 How Do We Write Parallel Programs? 7 Shakespeare,from the information in question 2,which is about Milton,and so on. So the professor and her TAs will be"executing different instructions." On the other hand,the second approach might be considered an example of data- parallelism.The"data"are the students'papers,which are divided among the cores, and each core applies more or less the same grading instructions to each paper. The first part of the global sum example in Section 1.3 would probably be considered an example of data-parallelism.The data are the values computed by Compute_next_value,and each core carries out roughly the same operations on its assigned elements:it computes the required values by calling Compute_next-value and adds them together.The second part of the first global sum example might be considered an example of task-parallelism.There are two tasks:receiving and adding the cores'partial sums,which is carried out by the master core,and giving the partial sum to the master core,which is carried out by the other cores. When the cores can work independently,writing a parallel program is much the same as writing a serial program.Things get a good deal more complex when the cores need to coordinate their work.In the second global sum example,although the tree structure in the diagram is very easy to understand,writing the actual code is relatively complex.See Exercises 1.3 and 1.4.Unfortunately,it's much more common for the cores to need coordination. In both global sum examples,the coordination involves communication:one or more cores send their current partial sums to another core.The global sum examples should also involve coordination through load balancing:even though we didn't give explicit formulas,it's clear that we want the cores all to be assigned roughly the same number of values to compute.If,for example,one core has to compute most of the values,then the other cores will finish much sooner than the heavily loaded core, and their computational power will be wasted. A third type of coordination is synchronization.As an example,suppose that instead of computing the values to be added,the values are read from stdin.Say x is an array that is read in by the master core: if (I'm the master core) for (my-i =0:my-i n:my_i++) scanf("%1f".&x[my-i]) In most systems the cores are not automatically synchronized.Rather,each core works at its own pace.In this case,the problem is that we don't want the other cores to race ahead and start computing their partial sums before the master is done ini- tializing x and making it available to the other cores.That is,the cores need to wait before starting execution of the code: for (my_i my_first-i:my_i my_last_i:my_i++) my_sum +x[my_i]: We need to add in a point of synchronization between the initialization of x and the computation of the partial sums: Synchronize_cores():
1.4 How Do We Write Parallel Programs? 7 Shakespeare, from the information in question 2, which is about Milton, and so on. So the professor and her TAs will be “executing different instructions.” On the other hand, the second approach might be considered an example of dataparallelism. The “data” are the students’ papers, which are divided among the cores, and each core applies more or less the same grading instructions to each paper. The first part of the global sum example in Section 1.3 would probably be considered an example of data-parallelism. The data are the values computed by Compute next value, and each core carries out roughly the same operations on its assigned elements: it computes the required values by calling Compute next value and adds them together. The second part of the first global sum example might be considered an example of task-parallelism. There are two tasks: receiving and adding the cores’ partial sums, which is carried out by the master core, and giving the partial sum to the master core, which is carried out by the other cores. When the cores can work independently, writing a parallel program is much the same as writing a serial program. Things get a good deal more complex when the cores need to coordinate their work. In the second global sum example, although the tree structure in the diagram is very easy to understand, writing the actual code is relatively complex. See Exercises 1.3 and 1.4. Unfortunately, it’s much more common for the cores to need coordination. In both global sum examples, the coordination involves communication: one or more cores send their current partial sums to another core. The global sum examples should also involve coordination through load balancing: even though we didn’t give explicit formulas, it’s clear that we want the cores all to be assigned roughly the same number of values to compute. If, for example, one core has to compute most of the values, then the other cores will finish much sooner than the heavily loaded core, and their computational power will be wasted. A third type of coordination is synchronization. As an example, suppose that instead of computing the values to be added, the values are read from stdin. Say x is an array that is read in by the master core: if (I’m the master core) for (my i = 0; my i < n; my i++) scanf("%lf", &x[my i]); In most systems the cores are not automatically synchronized. Rather, each core works at its own pace. In this case, the problem is that we don’t want the other cores to race ahead and start computing their partial sums before the master is done initializing x and making it available to the other cores. That is, the cores need to wait before starting execution of the code: for (my i = my first i; my i < my last i; my i++) my sum += x[my i]; We need to add in a point of synchronization between the initialization of x and the computation of the partial sums: Synchronize cores();
8 CHAPTER 1 Why Parallel Computing? The idea here is that each core will wait in the function Synchronize_cores until all the cores have entered the function-in particular,until the master core has entered this function. Currently,the most powerful parallel programs are written using explicit parallel constructs,that is,they are written using extensions to languages such as C and C++. These programs include explicit instructions for parallelism:core 0 executes task 0, core 1 executes task 1,...,all cores synchronize,...,and so on,so such programs are often extremely complex.Furthermore,the complexity of modern cores often makes it necessary to use considerable care in writing the code that will be executed by a single core. There are other options for writing parallel programs-for example,higher level languages-but they tend to sacrifice performance in order to make program development somewhat easier. 1.5 WHAT WE'LL BE DOING We'll be focusing on learning to write programs that are explicitly parallel.Our pur- pose is to learn the basics of programming parallel computers using the C language and three different extensions to C:the Message-Passing Interface or MPL,POSIX threads or Pthreads,and OpenMP.MPI and Pthreads are libraries of type defini- tions,functions,and macros that can be used in C programs.OpenMP consists of a library and some modifications to the C compiler. You may well wonder why we're learning three different extensions to C instead of just one.The answer has to do with both the extensions and parallel systems. There are two main types of parallel systems that we'll be focusing on:shared- memory systems and distributed-memory systems.In a shared-memory system, the cores can share access to the computer's memory;in principle,each core can read and write each memory location.In a shared-memory system,we can coordinate the cores by having them examine and update shared-memory locations.In a distributed- memory system,on the other hand,each core has its own,private memory,and the cores must communicate explicitly by doing something like sending messages across a network.Figure 1.2 shows a schematic of the two types of systems.Pthreads and OpenMP were designed for programming shared-memory systems.They provide mechanisms for accessing shared-memory locations.MPI,on the other hand,was designed for programming distributed-memory systems.It provides mechanisms for sending messages. But why two extensions for shared-memory?OpenMP is a relatively high-level extension to C.For example,it can "parallelize"our addition loop sum =0: for (i =0;i n:i++)( x Compute_next_value(...): sum +x:
8 CHAPTER 1 Why Parallel Computing? The idea here is that each core will wait in the function Synchronize cores until all the cores have entered the function—in particular, until the master core has entered this function. Currently, the most powerful parallel programs are written using explicit parallel constructs, that is, they are written using extensions to languages such as C and C++. These programs include explicit instructions for parallelism: core 0 executes task 0, core 1 executes task 1, . . . , all cores synchronize, . . . , and so on, so such programs are often extremely complex. Furthermore, the complexity of modern cores often makes it necessary to use considerable care in writing the code that will be executed by a single core. There are other options for writing parallel programs—for example, higher level languages—but they tend to sacrifice performance in order to make program development somewhat easier. 1.5 WHAT WE’LL BE DOING We’ll be focusing on learning to write programs that are explicitly parallel. Our purpose is to learn the basics of programming parallel computers using the C language and three different extensions to C: the Message-Passing Interface or MPI, POSIX threads or Pthreads, and OpenMP. MPI and Pthreads are libraries of type definitions, functions, and macros that can be used in C programs. OpenMP consists of a library and some modifications to the C compiler. You may well wonder why we’re learning three different extensions to C instead of just one. The answer has to do with both the extensions and parallel systems. There are two main types of parallel systems that we’ll be focusing on: sharedmemory systems and distributed-memory systems. In a shared-memory system, the cores can share access to the computer’s memory; in principle, each core can read and write each memory location. In a shared-memory system, we can coordinate the cores by having them examine and update shared-memory locations. In a distributedmemory system, on the other hand, each core has its own, private memory, and the cores must communicate explicitly by doing something like sending messages across a network. Figure 1.2 shows a schematic of the two types of systems. Pthreads and OpenMP were designed for programming shared-memory systems. They provide mechanisms for accessing shared-memory locations. MPI, on the other hand, was designed for programming distributed-memory systems. It provides mechanisms for sending messages. But why two extensions for shared-memory? OpenMP is a relatively high-level extension to C. For example, it can “parallelize” our addition loop sum = 0; for (i = 0; i < n; i++) { x = Compute next value(. . .); sum += x; }