《并行与分布式程序设计》课程教学参考书：CUDA《Programming Massively Parallel Processors》A Hands-on Approach（美，David B. Kirk and Wen-mei W. Hwu，英文版）.pdf_P21-P25

2 CHAPTER 1 Introduction processing power.This switch has exerted a tremendous impact on the software developer community [Sutter 2005]. Traditionally,the vast majority of software applications are written as sequential programs,as described by von Neumann [1945]in his seminal report.The execution of these programs can be understood by a human sequentially stepping through the code.Historically,computer users have become accustomed to the expectation that these programs run faster with each new generation of microprocessors.Such expectation is no longer strictly valid from this day onward.A sequential program will only run on one of the processor cores,which will not become significantly faster than those in use today.Without performance improvement,application develo- pers will no longer be able to introduce new features and capabilities into their software as new microprocessors are introduced,thus reducing the growth opportunities of the entire computer industry. Rather,the applications software that will continue to enjoy perfor- mance improvement with each new generation of microprocessors will be parallel programs,in which multiple threads of execution cooperate to com- plete the work faster.This new,dramatically escalated incentive for parallel program development has been referred to as the concurrency revolution [Sutter 2005].The practice of parallel programming is by no means new. The high-performance computing community has been developing parallel programs for decades.These programs run on large-scale,expensive com- puters.Only a few elite applications can justify the use of these expensive computers,thus limiting the practice of parallel programming to a small number of application developers.Now that all new microprocessors are parallel computers,the number of applications that must be developed as parallel programs has increased dramatically.There is now a great need for software developers to learn about parallel programming,which is the focus of this book. 1.1 GPUs AS PARALLEL COMPUTERS Since 2003,the semiconductor industry has settled on two main trajectories for designing microprocessor [Hwu 2008].The multicore trajectory seeks to maintain the execution speed of sequential programs while moving into multiple cores.The multicores began as two-core processors,with the number of cores approximately doubling with each semiconductor process generation.A current exemplar is the recent Intel CoreTM i7 microprocessor

processing power. This switch has exerted a tremendous impact on the software developer community [Sutter 2005]. Traditionally, the vast majority of software applications are written as sequential programs, as described by von Neumann [1945] in his seminal report. The execution of these programs can be understood by a human sequentially stepping through the code. Historically, computer users have become accustomed to the expectation that these programs run faster with each new generation of microprocessors. Such expectation is no longer strictly valid from this day onward. A sequential program will only run on one of the processor cores, which will not become significantly faster than those in use today. Without performance improvement, application developers will no longer be able to introduce new features and capabilities into their software as new microprocessors are introduced, thus reducing the growth opportunities of the entire computer industry. Rather, the applications software that will continue to enjoy performance improvement with each new generation of microprocessors will be parallel programs, in which multiple threads of execution cooperate to complete the work faster. This new, dramatically escalated incentive for parallel program development has been referred to as the concurrency revolution [Sutter 2005]. The practice of parallel programming is by no means new. The high-performance computing community has been developing parallel programs for decades. These programs run on large-scale, expensive computers. Only a few elite applications can justify the use of these expensive computers, thus limiting the practice of parallel programming to a small number of application developers. Now that all new microprocessors are parallel computers, the number of applications that must be developed as parallel programs has increased dramatically. There is now a great need for software developers to learn about parallel programming, which is the focus of this book. 1.1 GPUs AS PARALLEL COMPUTERS Since 2003, the semiconductor industry has settled on two main trajectories for designing microprocessor [Hwu 2008]. The multicore trajectory seeks to maintain the execution speed of sequential programs while moving into multiple cores. The multicores began as two-core processors, with the number of cores approximately doubling with each semiconductor process generation. A current exemplar is the recent Intel Core i7 microprocessor, 2 CHAPTER 1 Introduction

1.1 GPUs as Parallel Computers 5 G80,was capable of moving data at about 85 gigabytes per second (GB/s) in and out of its main dynamic random access memory (DRAM).Because of frame buffer requirements and the relaxed memory model-the way various system software,applications,and input/output (I/O)devices expect their memory accesses to work-general-purpose processors have to satisfy requirements from legacy operating systems,applications,and I/O devices that make memory bandwidth more difficult to increase.In contrast,with simpler memory models and fewer legacy constraints,the GPU designers can more easily achieve higher memory bandwidth.The more recent NVIDIA GT200 chip supports about 150 GB/s.Microprocessor system memory bandwidth will probably not grow beyond 50 GB/s for about 3 years,so CPUs will continue to be at a disadvantage in terms of memory bandwidth for some time. The design philosophy of the GPUs is shaped by the fast growing video game industry,which exerts tremendous economic pressure for the ability to perform a massive number of floating-point calculations per video frame in advanced games.This demand motivates the GPU vendors to look for ways to maximize the chip area and power budget dedicated to floating- point calculations.The prevailing solution to date is to optimize for the exe- cution throughput of massive numbers of threads.The hardware takes advantage of a large number of execution threads to find work to do when some of them are waiting for long-latency memory accesses,thus minimiz- ing the control logic required for each execution thread.Small cache mem- ories are provided to help control the bandwidth requirements of these applications so multiple threads that access the same memory data do not need to all go to the DRAM.As a result,much more chip area is dedicated to the floating-point calculations. It should be clear now that GPUs are designed as numeric computing engines,and they will not perform well on some tasks on which CPUs are designed to perform well;therefore,one should expect that most appli- cations will use both CPUs and GPUs,executing the sequential parts on the CPU and numerically intensive parts on the GPUs.This is why the CUDATM (Compute Unified Device Architecture)programming model, introduced by NVIDIA in 2007,is designed to support joint CPU/GPU exe- cution of an application. ISee Chapter 2 for more background on the evolution of GPU computing and the creation of CUDA

G80, was capable of moving data at about 85 gigabytes per second (GB/s) in and out of its main dynamic random access memory (DRAM). Because of frame buffer requirements and the relaxed memory model—the way various system software, applications, and input/output (I/O) devices expect their memory accesses to work—general-purpose processors have to satisfy requirements from legacy operating systems, applications, and I/O devices that make memory bandwidth more difficult to increase. In contrast, with simpler memory models and fewer legacy constraints, the GPU designers can more easily achieve higher memory bandwidth. The more recent NVIDIA GT200 chip supports about 150 GB/s. Microprocessor system memory bandwidth will probably not grow beyond 50 GB/s for about 3 years, so CPUs will continue to be at a disadvantage in terms of memory bandwidth for some time. The design philosophy of the GPUs is shaped by the fast growing video game industry, which exerts tremendous economic pressure for the ability to perform a massive number of floating-point calculations per video frame in advanced games. This demand motivates the GPU vendors to look for ways to maximize the chip area and power budget dedicated to floatingpoint calculations. The prevailing solution to date is to optimize for the execution throughput of massive numbers of threads. The hardware takes advantage of a large number of execution threads to find work to do when some of them are waiting for long-latency memory accesses, thus minimizing the control logic required for each execution thread. Small cache memories are provided to help control the bandwidth requirements of these applications so multiple threads that access the same memory data do not need to all go to the DRAM. As a result, much more chip area is dedicated to the floating-point calculations. It should be clear now that GPUs are designed as numeric computing engines, and they will not perform well on some tasks on which CPUs are designed to perform well; therefore, one should expect that most applications will use both CPUs and GPUs, executing the sequential parts on the CPU and numerically intensive parts on the GPUs. This is why the CUDA (Compute Unified Device Architecture) programming model, introduced by NVIDIA in 2007, is designed to support joint CPU/GPU execution of an application.1 1 See Chapter 2 for more background on the evolution of GPU computing and the creation of CUDA. 1.1 GPUs as Parallel Computers 5

6 CHAPTER 1 Introduction It is also important to note that performance is not the only decision factor when application developers choose the processors for running their applications.Several other factors can be even more important.First and foremost,the processors of choice must have a very large presence in the marketplace,referred to as the installation base of the processor.The reason is very simple.The cost of software development is best justified by a very large customer population.Applications that run on a processor with a small market presence will not have a large customer base.This has been a major problem with traditional parallel computing systems that have neg- ligible market presence compared to general-purpose microprocessors. Only a few elite applications funded by government and large corporations have been successfully developed on these traditional parallel computing systems.This has changed with the advent of many-core GPUs.Due to their popularity in the PC market,hundreds of millions of GPUs have been sold.Virtually all PCs have GPUs in them.The G80 processors and their successors have shipped more than 200 million units to date.This is the first time that massively parallel computing has been feasible with a mass-market product.Such a large market presence has made these GPUs economically attractive for application developers. Other important decision factors are practical form factors and easy accessibility.Until 2006,parallel software applications usually ran on data-center servers or departmental clusters,but such execution environ- ments tend to limit the use of these applications.For example,in an appli- cation such as medical imaging,it is fine to publish a paper based on a 64-node cluster machine,but actual clinical applications on magnetic reso- nance imaging (MRI)machines are all based on some combination of a PC and special hardware accelerators.The simple reason is that manufacturers such as GE and Siemens cannot sell MRIs with racks of clusters to clinical settings,but this is common in academic departmental settings.In fact,the National Institutes of Health (NIH)refused to fund parallel programming projects for some time;they felt that the impact of parallel software would be limited because huge cluster-based machines would not work in the clinical setting.Today,GE ships MRI products with GPUs,and NIH funds research using GPU computing. Yet another important consideration in selecting a processor for exe- cuting numeric computing applications is the support for the Institute of Electrical and Electronics Engineers (IEEE)floating-point standard.The standard makes it possible to have predictable results across processors from different vendors.While support for the IEEE floating-point standard

It is also important to note that performance is not the only decision factor when application developers choose the processors for running their applications. Several other factors can be even more important. First and foremost, the processors of choice must have a very large presence in the marketplace, referred to as the installation base of the processor. The reason is very simple. The cost of software development is best justified by a very large customer population. Applications that run on a processor with a small market presence will not have a large customer base. This has been a major problem with traditional parallel computing systems that have negligible market presence compared to general-purpose microprocessors. Only a few elite applications funded by government and large corporations have been successfully developed on these traditional parallel computing systems. This has changed with the advent of many-core GPUs. Due to their popularity in the PC market, hundreds of millions of GPUs have been sold. Virtually all PCs have GPUs in them. The G80 processors and their successors have shipped more than 200 million units to date. This is the first time that massively parallel computing has been feasible with a mass-market product. Such a large market presence has made these GPUs economically attractive for application developers. Other important decision factors are practical form factors and easy accessibility. Until 2006, parallel software applications usually ran on data-center servers or departmental clusters, but such execution environments tend to limit the use of these applications. For example, in an application such as medical imaging, it is fine to publish a paper based on a 64-node cluster machine, but actual clinical applications on magnetic resonance imaging (MRI) machines are all based on some combination of a PC and special hardware accelerators. The simple reason is that manufacturers such as GE and Siemens cannot sell MRIs with racks of clusters to clinical settings, but this is common in academic departmental settings. In fact, the National Institutes of Health (NIH) refused to fund parallel programming projects for some time; they felt that the impact of parallel software would be limited because huge cluster-based machines would not work in the clinical setting. Today, GE ships MRI products with GPUs, and NIH funds research using GPU computing. Yet another important consideration in selecting a processor for executing numeric computing applications is the support for the Institute of Electrical and Electronics Engineers (IEEE) floating-point standard. The standard makes it possible to have predictable results across processors from different vendors. While support for the IEEE floating-point standard 6 CHAPTER 1 Introduction