12 CHAPTER 1 Introduction On the other hand,if 99%of the execution time is in the parallel portion, a 100x speedup will reduce the application execution to 1.99%of the original time.This gives the entire application a 50x speedup;therefore, it is very important that an application has the vast majority of its execution in the parallel portion for a massively parallel processor to effectively speedup its execution. Researchers have achieved speedups of more than 100x for some appli- cations;however,this is typically achieved only after extensive optimiza- tion and tuning after the algorithms have been enhanced so more than 99.9%of the application execution time is in parallel execution.In general, straightforward parallelization of applications often saturates the memory (DRAM)bandwidth,resulting in only about a 10x speedup.The trick is to figure out how to get around memory bandwidth limitations,which involves doing one of many transformations to utilize specialized GPU on-chip memories to drastically reduce the number of accesses to the DRAM.One must,however,further optimize the code to get around limita- tions such as limited on-chip memory capacity.An important goal of this book is to help you to fully understand these optimizations and become skilled in them. Keep in mind that the level of speedup achieved over CPU execution can also reflect the suitability of the CPU to the application.In some appli- cations,CPUs perform very well,making it more difficult to speed up per- formance using a GPU.Most applications have portions that can be much better executed by the CPU.Thus,one must give the CPU a fair chance to perform and make sure that code is written in such a way that GPUs complement CPU execution,thus properly exploiting the heterogeneous parallel computing capabilities of the combined CPU/GPU system.This is precisely what the CUDA programming model promotes,as we will further explain in the book. Figure 1.4 illustrates the key parts of a typical application.Much of the code of a real application tends to be sequential.These portions are consid- ered to be the pit area of the peach;trying to apply parallel computing tech- niques to these portions is like biting into the peach pit-not a good feeling! These portions are very difficult to parallelize.CPUs tend to do a very good job on these portions.The good news is that these portions,although they can take up a large portion of the code,tend to account for only a small por- tion of the execution time of superapplications. Then come the meat portions of the peach.These portions are easy to parallelize,as are some early graphics applications.For example,most of today's medical imaging applications are still running on combinations of
On the other hand, if 99% of the execution time is in the parallel portion, a 100 speedup will reduce the application execution to 1.99% of the original time. This gives the entire application a 50 speedup; therefore, it is very important that an application has the vast majority of its execution in the parallel portion for a massively parallel processor to effectively speedup its execution. Researchers have achieved speedups of more than 100 for some applications; however, this is typically achieved only after extensive optimization and tuning after the algorithms have been enhanced so more than 99.9% of the application execution time is in parallel execution. In general, straightforward parallelization of applications often saturates the memory (DRAM) bandwidth, resulting in only about a 10 speedup. The trick is to figure out how to get around memory bandwidth limitations, which involves doing one of many transformations to utilize specialized GPU on-chip memories to drastically reduce the number of accesses to the DRAM. One must, however, further optimize the code to get around limitations such as limited on-chip memory capacity. An important goal of this book is to help you to fully understand these optimizations and become skilled in them. Keep in mind that the level of speedup achieved over CPU execution can also reflect the suitability of the CPU to the application. In some applications, CPUs perform very well, making it more difficult to speed up performance using a GPU. Most applications have portions that can be much better executed by the CPU. Thus, one must give the CPU a fair chance to perform and make sure that code is written in such a way that GPUs complement CPU execution, thus properly exploiting the heterogeneous parallel computing capabilities of the combined CPU/GPU system. This is precisely what the CUDA programming model promotes, as we will further explain in the book. Figure 1.4 illustrates the key parts of a typical application. Much of the code of a real application tends to be sequential. These portions are considered to be the pit area of the peach; trying to apply parallel computing techniques to these portions is like biting into the peach pit—not a good feeling! These portions are very difficult to parallelize. CPUs tend to do a very good job on these portions. The good news is that these portions, although they can take up a large portion of the code, tend to account for only a small portion of the execution time of superapplications. Then come the meat portions of the peach. These portions are easy to parallelize, as are some early graphics applications. For example, most of today’s medical imaging applications are still running on combinations of 12 CHAPTER 1 Introduction
1.4 Parallel Programming Languages and Models 13 Sequential portions Traditional CPU coverage Data parallel portions GPGPU coverage Obstacles FIGURE 1.4 Coverage of sequential and parallel application portions. microprocessor clusters and special-purpose hardware.The cost and size benefit of the GPUs can drastically improve the quality of these applica- tions.As illustrated in Figure 1.4,early GPGPUs cover only a small portion of the meat section,which is analogous to a small portion of the most excit- ing applications coming in the next 10 years.As we will see,the CUDA programming model is designed to cover a much larger section of the peach meat portions of exciting applications. 1.4 PARALLEL PROGRAMMING LANGUAGES AND MODELS Many parallel programming languages and models have been proposed in the past several decades [Mattson 2004].The ones that are the most widely used are the Message Passing Interface (MPI)for scalable cluster com- puting and OpenMPTM for shared-memory multiprocessor systems.MPI is a model where computing nodes in a cluster do not share memory [MPI 2009];all data sharing and interaction must be done through explicit message passing.MPI has been successful in the high-performance scien- tific computing domain.Applications written in MPI have been known to run successfully on cluster computing systems with more than 100,000 nodes.The amount of effort required to port an application into MPI
microprocessor clusters and special-purpose hardware. The cost and size benefit of the GPUs can drastically improve the quality of these applications. As illustrated in Figure 1.4, early GPGPUs cover only a small portion of the meat section, which is analogous to a small portion of the most exciting applications coming in the next 10 years. As we will see, the CUDA programming model is designed to cover a much larger section of the peach meat portions of exciting applications. 1.4 PARALLEL PROGRAMMING LANGUAGES AND MODELS Many parallel programming languages and models have been proposed in the past several decades [Mattson 2004]. The ones that are the most widely used are the Message Passing Interface (MPI) for scalable cluster computing and OpenMP for shared-memory multiprocessor systems. MPI is a model where computing nodes in a cluster do not share memory [MPI 2009]; all data sharing and interaction must be done through explicit message passing. MPI has been successful in the high-performance scientific computing domain. Applications written in MPI have been known to run successfully on cluster computing systems with more than 100,000 nodes. The amount of effort required to port an application into MPI, Sequential portions Data parallel portions Traditional CPU coverage GPGPU coverage Obstacles FIGURE 1.4 Coverage of sequential and parallel application portions. 1.4 Parallel Programming Languages and Models 13
14 CHAPTER 1 Introduction however,can be extremely high due to lack of shared memory across com- puting nodes.CUDA,on the other hand,provides shared memory for par- allel execution in the GPU to address this difficulty.As for CPU and GPU communication,CUDA currently provides very limited shared mem- ory capability between the CPU and the GPU.Programmers need to man- age the data transfer between the CPU and GPU in a manner similar to "one-sided"message passing,a capability whose absence in MPI has been historically considered as a major weakness of MPI. OpenMP supports shared memory,so it offers the same advantage as CUDA in programming efforts;however,it has not been able to scale beyond a couple hundred computing nodes due to thread management over- heads and cache coherence hardware requirements.CUDA achieves much higher scalability with simple,low-overhead thread management and no cache coherence hardware requirements.As we will see,however,CUDA does not support as wide a range of applications as OpenMP due to these scalability tradeoffs.On the other hand,many superapplications fit well into the simple thread management model of CUDA and thus enjoy the scalability and performance. Aspects of CUDA are similar to both MPI and OpenMP in that the pro- grammer manages the parallel code constructs,although OpenMP compi- lers do more of the automation in managing parallel execution.Several ongoing research efforts aim at adding more automation of parallelism management and performance optimization to the CUDA tool chain.Devel- opers who are experienced with MPI and OpenMP will find CUDA easy to learn.Especially,many of the performance optimization techniques are common among these models. More recently,several major industry players,including Apple,Intel, AMD/ATI,and NVIDIA,have jointly developed a standardized program- ming model called OpenCLTM [Khronos 2009].Similar to CUDA,the OpenCL programming model defines language extensions and runtime APIs to allow programmers to manage parallelism and data delivery in massively parallel processors.OpenCL is a standardized programming model in that applications developed in OpenCL can run without modi- fication on all processors that support the OpenCL language extensions and API. The reader might ask why the book is not based on OpenCL.The main reason is that OpenCL was still in its infancy when this book was written. The level of programming constructs in OpenCL is still at a lower level than CUDA and much more tedious to use.Also,the speed achieved in an application expressed in OpenCL is still much lower than in CUDA on
however, can be extremely high due to lack of shared memory across computing nodes. CUDA, on the other hand, provides shared memory for parallel execution in the GPU to address this difficulty. As for CPU and GPU communication, CUDA currently provides very limited shared memory capability between the CPU and the GPU. Programmers need to manage the data transfer between the CPU and GPU in a manner similar to “one-sided” message passing, a capability whose absence in MPI has been historically considered as a major weakness of MPI. OpenMP supports shared memory, so it offers the same advantage as CUDA in programming efforts; however, it has not been able to scale beyond a couple hundred computing nodes due to thread management overheads and cache coherence hardware requirements. CUDA achieves much higher scalability with simple, low-overhead thread management and no cache coherence hardware requirements. As we will see, however, CUDA does not support as wide a range of applications as OpenMP due to these scalability tradeoffs. On the other hand, many superapplications fit well into the simple thread management model of CUDA and thus enjoy the scalability and performance. Aspects of CUDA are similar to both MPI and OpenMP in that the programmer manages the parallel code constructs, although OpenMP compilers do more of the automation in managing parallel execution. Several ongoing research efforts aim at adding more automation of parallelism management and performance optimization to the CUDA tool chain. Developers who are experienced with MPI and OpenMP will find CUDA easy to learn. Especially, many of the performance optimization techniques are common among these models. More recently, several major industry players, including Apple, Intel, AMD/ATI, and NVIDIA, have jointly developed a standardized programming model called OpenCL [Khronos 2009]. Similar to CUDA, the OpenCL programming model defines language extensions and runtime APIs to allow programmers to manage parallelism and data delivery in massively parallel processors. OpenCL is a standardized programming model in that applications developed in OpenCL can run without modification on all processors that support the OpenCL language extensions and API. The reader might ask why the book is not based on OpenCL. The main reason is that OpenCL was still in its infancy when this book was written. The level of programming constructs in OpenCL is still at a lower level than CUDA and much more tedious to use. Also, the speed achieved in an application expressed in OpenCL is still much lower than in CUDA on 14 CHAPTER 1 Introduction
1.5 Overarching Goals 15 the platforms that support both.Because programming massively parallel processors is motivated by speed,we expect that most who program mas- sively parallel processors will continue to use CUDA for the foreseeable future.Finally,those who are familiar with both OpenCL and CUDA know that there is a remarkable similarity between the key features of OpenCL and CUDA;that is,a CUDA programmer should be able to learn OpenCL programming with minimal effort.We will give a more detailed analysis of these similarities later in the book. 1.5 OVERARCHING GOALS Our primary goal is to teach you,the reader,how to program massively par- allel processors to achieve high performance,and our approach will not require a great deal of hardware expertise.Someone once said that if you don't care about performance parallel programming is very easy.You can literally write a parallel program in an hour.But,we're going to dedicate many pages to materials on how to do high-performance parallel program- ming,and we believe that it will become easy once you develop the right insight and go about it the right way.In particular,we will focus on compu- tational thinking techniques that will enable you to think about problems in ways that are amenable to high-performance parallel computing. Note that hardware architecture features have constraints.High- performance parallel programming on most of the chips will require some knowledge of how the hardware actually works.It will probably take 10 more years before we can build tools and machines so most program- mers can work without this knowledge.We will not be teaching computer architecture as a separate topic;instead,we will teach the essential computer architecture knowledge as part of our discussions on high- performance parallel programming techniques. Our second goal is to teach parallel programming for correct functionality and reliability,which constitute a subtle issue in parallel computing Those who have worked on parallel systems in the past know that achieving ini- tial performance is not enough.The challenge is to achieve it in such a way that you can debug the code and support the users.We will show that with the CUDA programming model that focuses on data parallelism,one can achieve both high performance and high reliability in their applications. Our third goal is achieving scalability across future hardware generations by exploring approaches to parallel programming such that future machines, which will be more and more parallel,can run your code faster than today's
the platforms that support both. Because programming massively parallel processors is motivated by speed, we expect that most who program massively parallel processors will continue to use CUDA for the foreseeable future. Finally, those who are familiar with both OpenCL and CUDA know that there is a remarkable similarity between the key features of OpenCL and CUDA; that is, a CUDA programmer should be able to learn OpenCL programming with minimal effort. We will give a more detailed analysis of these similarities later in the book. 1.5 OVERARCHING GOALS Our primary goal is to teach you, the reader, how to program massively parallel processors to achieve high performance, and our approach will not require a great deal of hardware expertise. Someone once said that if you don’t care about performance parallel programming is very easy. You can literally write a parallel program in an hour. But, we’re going to dedicate many pages to materials on how to do high-performance parallel programming, and we believe that it will become easy once you develop the right insight and go about it the right way. In particular, we will focus on computational thinking techniques that will enable you to think about problems in ways that are amenable to high-performance parallel computing. Note that hardware architecture features have constraints. Highperformance parallel programming on most of the chips will require some knowledge of how the hardware actually works. It will probably take 10 more years before we can build tools and machines so most programmers can work without this knowledge. We will not be teaching computer architecture as a separate topic; instead, we will teach the essential computer architecture knowledge as part of our discussions on highperformance parallel programming techniques. Our second goal is to teach parallel programming for correct functionality and reliability, which constitute a subtle issue in parallel computing. Those who have worked on parallel systems in the past know that achieving initial performance is not enough. The challenge is to achieve it in such a way that you can debug the code and support the users. We will show that with the CUDA programming model that focuses on data parallelism, one can achieve both high performance and high reliability in their applications. Our third goal is achieving scalability across future hardware generations by exploring approaches to parallel programming such that future machines, which will be more and more parallel, can run your code faster than today’s 1.5 Overarching Goals 15
16 CHAPTER 1 Introduction machines.We want to help you to master parallel programming so your programs can scale up to the level of performance of new generations of machines. Much technical knowledge will be required to achieve these goals,so we will cover quite a few principles and patterns of parallel programming in this book.We cannot guarantee that we will cover all of them,however, so we have selected several of the most useful and well-proven techniques to cover in detail.To complement your knowledge and expertise,we include a list of recommended literature.We are now ready to give you a quick overview of the rest of the book. 1.6 ORGANIZATION OF THE BOOK Chapter 2 reviews the history of GPU computing.It begins with a brief summary of the evolution of graphics hardware toward greater programma- bility and then discusses the historical GPGPU movement.Many of the cur- rent features and limitations of CUDA GPUs have their roots in these historic developments.A good understanding of these historic develop- ments will help the reader to better understand the current state and the future trends of hardware evolution that will continue to impact the types of applications that will benefit from CUDA. Chapter 3 introduces CUDA programming.This chapter relies on the fact that students have had previous experience with C programming.It first introduces CUDA as a simple,small extension to C that supports het- erogeneous CPU/GPU joint computing and the widely used single-program, multiple-data (SPMD)parallel programming model.It then covers the thought processes involved in:(1)identifying the part of application pro- grams to be parallelized,(2)isolating the data to be used by the parallelized code by using an API function to allocate memory on the parallel comput- ing device,(3)using an API function to transfer data to the parallel com- puting device,(4)developing a kernel function that will be executed by individual threads in the parallelized part,(5)launching a kernel function for execution by parallel threads,and (6)eventually transferring the data back to the host processor with an API function call.Although the objective of Chapter 3 is to teach enough concepts of the CUDA programming model so the readers can write a simple parallel CUDA program,it actually covers several basic skills needed to develop a parallel application based on any parallel programming model.We use a running example of matrix-matrix multiplication to make this chapter concrete
machines. We want to help you to master parallel programming so your programs can scale up to the level of performance of new generations of machines. Much technical knowledge will be required to achieve these goals, so we will cover quite a few principles and patterns of parallel programming in this book. We cannot guarantee that we will cover all of them, however, so we have selected several of the most useful and well-proven techniques to cover in detail. To complement your knowledge and expertise, we include a list of recommended literature. We are now ready to give you a quick overview of the rest of the book. 1.6 ORGANIZATION OF THE BOOK Chapter 2 reviews the history of GPU computing. It begins with a brief summary of the evolution of graphics hardware toward greater programmability and then discusses the historical GPGPU movement. Many of the current features and limitations of CUDA GPUs have their roots in these historic developments. A good understanding of these historic developments will help the reader to better understand the current state and the future trends of hardware evolution that will continue to impact the types of applications that will benefit from CUDA. Chapter 3 introduces CUDA programming. This chapter relies on the fact that students have had previous experience with C programming. It first introduces CUDA as a simple, small extension to C that supports heterogeneous CPU/GPU joint computing and the widely used single-program, multiple-data (SPMD) parallel programming model. It then covers the thought processes involved in: (1) identifying the part of application programs to be parallelized, (2) isolating the data to be used by the parallelized code by using an API function to allocate memory on the parallel computing device, (3) using an API function to transfer data to the parallel computing device, (4) developing a kernel function that will be executed by individual threads in the parallelized part, (5) launching a kernel function for execution by parallel threads, and (6) eventually transferring the data back to the host processor with an API function call. Although the objective of Chapter 3 is to teach enough concepts of the CUDA programming model so the readers can write a simple parallel CUDA program, it actually covers several basic skills needed to develop a parallel application based on any parallel programming model. We use a running example of matrix–matrix multiplication to make this chapter concrete. 16 CHAPTER 1 Introduction