1.6 Concurrent,Parallel,Distributed 9 Core0 Core 0 Memory 0 Core 1 Core 1 Memory 1 Core p-1 Core p-1 Memory p- (a) (b) FIGURE 1.2 (a)A shared-memory system and(b)a distributed-memory system with a single directive,while Pthreads requires that we do something similar to our example.On the other hand,Pthreads provides some coordination constructs that are unavailable in OpenMP.OpenMP allows us to parallelize many programs with rela- tive ease,while Pthreads provides us with some constructs that make other programs easier to parallelize. 1.6 CONCURRENT,PARALLEL,DISTRIBUTED If you look at some other books on parallel computing or you search the Web for information on parallel computing,you're likely to also run across the terms concurrent computing and distributed computing.Although there isn't complete agreement on the distinction between the terms parallel,distributed,and concurrent, many authors make the following distinctions: In concurrent computing,a program is one in which multiple tasks can be in progress at any instant [4]. In parallel computing,a program is one in which multiple tasks cooperate closely to solve a problem. In distributed computing,a program may need to cooperate with other programs to solve a problem. So parallel and distributed programs are concurrent,but a program such as a mul- titasking operating system is also concurrent,even when it is run on a machine with only one core,since multiple tasks can be in progress at any instant.There isn't a clear-cut distinction between parallel and distributed programs,but a parallel program usually runs multiple tasks simultaneously on cores that are physically close to each other and that either share the same memory or are connected by a very high-speed
1.6 Concurrent, Parallel, Distributed 9 (a) (b) Core 0 Memory 0 Core 1 Memory 1 Core p −1 Memory p −1 Network Core 0 Core 1 Memory Core p −1 FIGURE 1.2 (a) A shared-memory system and (b) a distributed-memory system with a single directive, while Pthreads requires that we do something similar to our example. On the other hand, Pthreads provides some coordination constructs that are unavailable in OpenMP. OpenMP allows us to parallelize many programs with relative ease, while Pthreads provides us with some constructs that make other programs easier to parallelize. 1.6 CONCURRENT, PARALLEL, DISTRIBUTED If you look at some other books on parallel computing or you search the Web for information on parallel computing, you’re likely to also run across the terms concurrent computing and distributed computing. Although there isn’t complete agreement on the distinction between the terms parallel, distributed, and concurrent, many authors make the following distinctions: . In concurrent computing, a program is one in which multiple tasks can be in progress at any instant [4]. . In parallel computing, a program is one in which multiple tasks cooperate closely to solve a problem. . In distributed computing, a program may need to cooperate with other programs to solve a problem. So parallel and distributed programs are concurrent, but a program such as a multitasking operating system is also concurrent, even when it is run on a machine with only one core, since multiple tasks can be in progress at any instant. There isn’t a clear-cut distinction between parallel and distributed programs, but a parallel program usually runs multiple tasks simultaneously on cores that are physically close to each other and that either share the same memory or are connected by a very high-speed
10 CHAPTER 1 Why Parallel Computing? network.On the other hand,distributed programs tend to be more"loosely coupled." The tasks may be executed by multiple computers that are separated by large dis- tances,and the tasks themselves are often executed by programs that were created independently.As examples,our two concurrent addition programs would be con- sidered parallel by most authors,while a Web search program would be considered distributed. But beware,there isn't general agreement on these terms.For example,many authors consider shared-memory programs to be"parallel"and distributed-memory programs to be "distributed."As our title suggests,we'll be interested in par- allel programs-programs in which closely coupled tasks cooperate to solve a problem. 1.7 THE REST OF THE BOOK How can we use this book to help us write parallel programs? First,when you're interested in high-performance,whether you're writing serial or parallel programs,you need to know a little bit about the systems you're working with-both hardware and software.In Chapter 2,we'll give an overview of parallel hardware and software.In order to understand this discussion,it will be necessary to review some information on serial hardware and software.Much of the material in Chapter 2 won't be needed when we're getting started,so you might want to skim some of this material,and refer back to it occasionally when you're reading later chapters. The heart of the book is contained in Chapters 3 through 6.Chapters 3,4,and 5 provide a very elementary introduction to programming parallel systems using C and MPI,Pthreads,and OpenMP.respectively.The only prerequisite for reading these chapters is a knowledge of C programming.We've tried to make these chapters inde- pendent of each other,and you should be able to read them in any order.However,in order to make them independent,we did find it necessary to repeat some material.So if you've read one of the three chapters,and you go on to read another,be prepared to skim over some of the material in the new chapter. Chapter 6 puts together all we've learned in the preceding chapters,and devel- ops two fairly large programs in both a shared-and a distributed-memory setting. However,it should be possible to read much of this even if you've only read one of Chapters 3,4,or 5.The last chapter,Chapter 7,provides a few suggestions for further study on parallel programming. 1.8 A WORD OF WARNING Before proceeding,a word of warning.It may be tempting to write parallel pro- grams"by the seat of your pants,"without taking the trouble to carefully design and incrementally develop your program.This will almost certainly be a mistake.Every
10 CHAPTER 1 Why Parallel Computing? network. On the other hand, distributed programs tend to be more “loosely coupled.” The tasks may be executed by multiple computers that are separated by large distances, and the tasks themselves are often executed by programs that were created independently. As examples, our two concurrent addition programs would be considered parallel by most authors, while a Web search program would be considered distributed. But beware, there isn’t general agreement on these terms. For example, many authors consider shared-memory programs to be “parallel” and distributed-memory programs to be “distributed.” As our title suggests, we’ll be interested in parallel programs—programs in which closely coupled tasks cooperate to solve a problem. 1.7 THE REST OF THE BOOK How can we use this book to help us write parallel programs? First, when you’re interested in high-performance, whether you’re writing serial or parallel programs, you need to know a little bit about the systems you’re working with—both hardware and software. In Chapter 2, we’ll give an overview of parallel hardware and software. In order to understand this discussion, it will be necessary to review some information on serial hardware and software. Much of the material in Chapter 2 won’t be needed when we’re getting started, so you might want to skim some of this material, and refer back to it occasionally when you’re reading later chapters. The heart of the book is contained in Chapters 3 through 6. Chapters 3, 4, and 5 provide a very elementary introduction to programming parallel systems using C and MPI, Pthreads, and OpenMP, respectively. The only prerequisite for reading these chapters is a knowledge of C programming. We’ve tried to make these chapters independent of each other, and you should be able to read them in any order. However, in order to make them independent, we did find it necessary to repeat some material. So if you’ve read one of the three chapters, and you go on to read another, be prepared to skim over some of the material in the new chapter. Chapter 6 puts together all we’ve learned in the preceding chapters, and develops two fairly large programs in both a shared- and a distributed-memory setting. However, it should be possible to read much of this even if you’ve only read one of Chapters 3, 4, or 5. The last chapter, Chapter 7, provides a few suggestions for further study on parallel programming. 1.8 A WORD OF WARNING Before proceeding, a word of warning. It may be tempting to write parallel programs “by the seat of your pants,” without taking the trouble to carefully design and incrementally develop your program. This will almost certainly be a mistake. Every
1.9 Typographical Conventions 11 parallel program contains at least one serial program.Since we almost always need to coordinate the actions of multiple cores,writing parallel programs is almost always more complex than writing a serial program that solves the same problem.In fact, it is often far more complex.All the rules about careful design and development are usually far more important for the writing of parallel programs than they are for serial programs. 1.9 TYPOGRAPHICAL CONVENTIONS We'll make use of the following typefaces in the text: Program text,displayed or within running text,will use the following typefaces: /This is a short program * #include <stdio.h int main(int argc,char*argv[])( printf("hello.world\n"): return 0: Definitions are given in the body of the text,and the term being defined is printed in boldface type:A parallel program can make use of multiple cores. When we need to refer to the environment in which a program is being developed, we'll assume that we're using a UNIX shell,and we'll use a to indicate the shell prompt: gcc -g-Wall -o hello hello.c We'll specify the syntax of function calls with fixed argument lists by including a sample argument list.For example,the integer absolute value function,abs,in stdl ib might have its syntax specified with int abs(int x):/Returns absolute value of int x * For more complicated syntax,we'll enclose required content in angle brackets<> and optional content in square brackets []For example,the C if statement might have its syntax specified as follows: if <expression> <statementl> [else <statement2>] This says that the i f statement must include an expression enclosed in parentheses, and the right parenthesis must be followed by a statement.This statement can be followed by an optional else clause.If the else clause is present,it must include a second statement
1.9 Typographical Conventions 11 parallel program contains at least one serial program. Since we almost always need to coordinate the actions of multiple cores, writing parallel programs is almost always more complex than writing a serial program that solves the same problem. In fact, it is often far more complex. All the rules about careful design and development are usually far more important for the writing of parallel programs than they are for serial programs. 1.9 TYPOGRAPHICAL CONVENTIONS We’ll make use of the following typefaces in the text: . Program text, displayed or within running text, will use the following typefaces: /∗ This is a short program ∗/ #include <stdio.h> int main(int argc, char∗ argv[]) { printf("hello, world\n"); return 0; } . Definitions are given in the body of the text, and the term being defined is printed in boldface type: A parallel program can make use of multiple cores. . When we need to refer to the environment in which a program is being developed, we’ll assume that we’re using a UNIX shell, and we’ll use a $ to indicate the shell prompt: $ gcc −g −Wall −o hello hello.c . We’ll specify the syntax of function calls with fixed argument lists by including a sample argument list. For example, the integer absolute value function, abs, in stdlib might have its syntax specified with int abs(int x); /∗ Returns absolute value of int x ∗/ For more complicated syntax, we’ll enclose required content in angle brackets <> and optional content in square brackets []. For example, the C if statement might have its syntax specified as follows: if ( <expression> ) <statement1> [else <statement2>] This says that the if statement must include an expression enclosed in parentheses, and the right parenthesis must be followed by a statement. This statement can be followed by an optional else clause. If the else clause is present, it must include a second statement
12 CHAPTER 1 Why Parallel Computing? 1.10 SUMMARY For many years we've enjoyed the fruits of ever faster processors.However,because of physical limitations the rate of performance improvement in conventional pro- cessors is decreasing.In order to increase the power of processors,chipmakers have turned to multicore integrated circuits,that is,integrated circuits with multiple conventional processors on a single chip. Ordinary serial programs,which are programs written for a conventional single- core processor,usually cannot exploit the presence of multiple cores,and it's unlikely that translation programs will be able to shoulder all the work of parallelizing serial programs,meaning converting them into parallel programs,which can make use of multiple cores.As software developers,we need to learn to write parallel programs. When we write parallel programs,we usually need to coordinate the work of the cores.This can involve communication among the cores,load balancing,and synchronization of the cores. In this book we'll be learning to program parallel systems so that we can maximize their performance.We'll be using the C language with either MPI, Pthreads,or OpenMP.MPI is used for programming distributed-memory systems, and Pthreads and OpenMP are used for programming shared-memory systems.In distributed-memory systems,the cores have their own private memories,while in shared-memory systems,it's possible,in principle,for each core to access each memory location. Concurrent programs can have multiple tasks in progress at any instant.Paral- lel and distributed programs usually have tasks that execute simultaneously.There isn't a hard and fast distinction between parallel and distributed,although in parallel programs,the tasks are usually more tightly coupled. Parallel programs are usually very complex.So it's even more important to use good program development techniques with parallel programs. 1.11 EXERCISES 1.1 Devise formulas for the functions that calculate my_first_i and my-last-i in the global sum example.Remember that each core should be assigned roughly the same number of elements of computations in the loop.Hint:First consider the case when n is evenly divisible by p. 1.2 We've implicitly assumed that each call to Compute-next-value requires roughly the same amount of work as the other calls.How would you change your answer to the preceding question if call i=k requires k+1 times as much work as the call with i=0?So if the first call (i=0)requires 2 milliseconds, the second call (i=1)requires 4,the third (i=2)requires 6,and so on. 1.3 Try to write pseudo-code for the tree-structured global sum illustrated in Figure 1.1.Assume the number of cores is a power of two (1,2,4,8,...)
12 CHAPTER 1 Why Parallel Computing? 1.10 SUMMARY For many years we’ve enjoyed the fruits of ever faster processors. However, because of physical limitations the rate of performance improvement in conventional processors is decreasing. In order to increase the power of processors, chipmakers have turned to multicore integrated circuits, that is, integrated circuits with multiple conventional processors on a single chip. Ordinary serial programs, which are programs written for a conventional singlecore processor, usually cannot exploit the presence of multiple cores, and it’s unlikely that translation programs will be able to shoulder all the work of parallelizing serial programs, meaning converting them into parallel programs, which can make use of multiple cores. As software developers, we need to learn to write parallel programs. When we write parallel programs, we usually need to coordinate the work of the cores. This can involve communication among the cores, load balancing, and synchronization of the cores. In this book we’ll be learning to program parallel systems so that we can maximize their performance. We’ll be using the C language with either MPI, Pthreads, or OpenMP. MPI is used for programming distributed-memory systems, and Pthreads and OpenMP are used for programming shared-memory systems. In distributed-memory systems, the cores have their own private memories, while in shared-memory systems, it’s possible, in principle, for each core to access each memory location. Concurrent programs can have multiple tasks in progress at any instant. Parallel and distributed programs usually have tasks that execute simultaneously. There isn’t a hard and fast distinction between parallel and distributed, although in parallel programs, the tasks are usually more tightly coupled. Parallel programs are usually very complex. So it’s even more important to use good program development techniques with parallel programs. 1.11 EXERCISES 1.1 Devise formulas for the functions that calculate my first i and my last i in the global sum example. Remember that each core should be assigned roughly the same number of elements of computations in the loop. Hint: First consider the case when n is evenly divisible by p. 1.2 We’ve implicitly assumed that each call to Compute next value requires roughly the same amount of work as the other calls. How would you change your answer to the preceding question if call i = k requires k + 1 times as much work as the call with i = 0? So if the first call (i = 0) requires 2 milliseconds, the second call (i = 1) requires 4, the third (i = 2) requires 6, and so on. 1.3 Try to write pseudo-code for the tree-structured global sum illustrated in Figure 1.1. Assume the number of cores is a power of two (1, 2, 4, 8, . . . )
1.11 Exercises 13 Hints:Use a variable divisor to determine whether a core should send its sum or receive and add.The divisor should start with the value 2 and be doubled after each iteration.Also use a variable core_di fference to deter- mine which core should be partnered with the current core.It should start with the value 1 and also be doubled after each iteration.For example,in the first iteration 0 divisor 0 and 1 divisor 1,so 0 receives and adds,while 1 sends.Also in the first iteration 0 core_difference 1 and 1 -core_difference =0,so 0 and 1 are paired in the first iteration. 1.4 As an alternative to the approach outlined in the preceding problem,we can use C's bitwise operators to implement the tree-structured global sum.In order to see how this works,it helps to write down the binary (base 2)representation of each of the core ranks,and note the pairings during each stage: Stages Cores 2 2 010=0002110=0012 210=0102410=1002 110=0012010=0002 210=0102310=0112 010=0002 310=0112 210=0102 410=1002510=1012 610=1102 010=0002 510=1012410=1002 610=1102.710=1112 410=1002 710=1112610=1102 From the table we see that during the first stage each core is paired with the core whose rank differs in the rightmost or first bit.During the second stage cores that continue are paired with the core whose rank differs in the second bit,and during the third stage cores are paired with the core whose rank differs in the third bit.Thus,if we have a binary value bitmask that is 0012 for the first stage,0102 for the second,and 1002 for the third,we can get the rank of the core we're paired with by "inverting"the bit in our rank that is nonzero in bi tmask.This can be done using the bitwise exclusive or A operator. Implement this algorithm in pseudo-code using the bitwise exclusive or and the left-shift operator. 1.5 What happens if your pseudo-code in Exercise 1.3 or Exercise 1.4 is run when the number of cores is not a power of two (e.g.,3,5,6,7)?Can you modify the pseudo-code so that it will work correctly regardless of the number of cores? 1.6 Derive formulas for the number of receives and additions that core 0 carries out using a.the original pseudo-code for a global sum,and b.the tree-structured global sum
1.11 Exercises 13 Hints: Use a variable divisor to determine whether a core should send its sum or receive and add. The divisor should start with the value 2 and be doubled after each iteration. Also use a variable core difference to determine which core should be partnered with the current core. It should start with the value 1 and also be doubled after each iteration. For example, in the first iteration 0 % divisor = 0 and 1 % divisor = 1, so 0 receives and adds, while 1 sends. Also in the first iteration 0 + core difference = 1 and 1 − core difference = 0, so 0 and 1 are paired in the first iteration. 1.4 As an alternative to the approach outlined in the preceding problem, we can use C’s bitwise operators to implement the tree-structured global sum. In order to see how this works, it helps to write down the binary (base 2) representation of each of the core ranks, and note the pairings during each stage: Stages Cores 1 2 3 010 = 0002 110 = 0012 210 = 0102 410 = 1002 110 = 0012 010 = 0002 × × 210 = 0102 310 = 0112 010 = 0002 × 310 = 0112 210 = 0102 × × 410 = 1002 510 = 1012 610 = 1102 010 = 0002 510 = 1012 410 = 1002 × × 610 = 1102 710 = 1112 410 = 1002 × 710 = 1112 610 = 1102 × × From the table we see that during the first stage each core is paired with the core whose rank differs in the rightmost or first bit. During the second stage cores that continue are paired with the core whose rank differs in the second bit, and during the third stage cores are paired with the core whose rank differs in the third bit. Thus, if we have a binary value bitmask that is 0012 for the first stage, 0102 for the second, and 1002 for the third, we can get the rank of the core we’re paired with by “inverting” the bit in our rank that is nonzero in bitmask. This can be done using the bitwise exclusive or ∧ operator. Implement this algorithm in pseudo-code using the bitwise exclusive or and the left-shift operator. 1.5 What happens if your pseudo-code in Exercise 1.3 or Exercise 1.4 is run when the number of cores is not a power of two (e.g., 3, 5, 6, 7)? Can you modify the pseudo-code so that it will work correctly regardless of the number of cores? 1.6 Derive formulas for the number of receives and additions that core 0 carries out using a. the original pseudo-code for a global sum, and b. the tree-structured global sum