2 CHAPTER 1 Introduction processing power.This switch has exerted a tremendous impact on the software developer community [Sutter 2005]. Traditionally,the vast majority of software applications are written as sequential programs,as described by von Neumann [1945]in his seminal report.The execution of these programs can be understood by a human sequentially stepping through the code.Historically,computer users have become accustomed to the expectation that these programs run faster with each new generation of microprocessors.Such expectation is no longer strictly valid from this day onward.A sequential program will only run on one of the processor cores,which will not become significantly faster than those in use today.Without performance improvement,application develo- pers will no longer be able to introduce new features and capabilities into their software as new microprocessors are introduced,thus reducing the growth opportunities of the entire computer industry. Rather,the applications software that will continue to enjoy perfor- mance improvement with each new generation of microprocessors will be parallel programs,in which multiple threads of execution cooperate to com- plete the work faster.This new,dramatically escalated incentive for parallel program development has been referred to as the concurrency revolution [Sutter 2005].The practice of parallel programming is by no means new. The high-performance computing community has been developing parallel programs for decades.These programs run on large-scale,expensive com- puters.Only a few elite applications can justify the use of these expensive computers,thus limiting the practice of parallel programming to a small number of application developers.Now that all new microprocessors are parallel computers,the number of applications that must be developed as parallel programs has increased dramatically.There is now a great need for software developers to learn about parallel programming,which is the focus of this book. 1.1 GPUs AS PARALLEL COMPUTERS Since 2003,the semiconductor industry has settled on two main trajectories for designing microprocessor [Hwu 2008].The multicore trajectory seeks to maintain the execution speed of sequential programs while moving into multiple cores.The multicores began as two-core processors,with the number of cores approximately doubling with each semiconductor process generation.A current exemplar is the recent Intel CoreTM i7 microprocessor
processing power. This switch has exerted a tremendous impact on the software developer community [Sutter 2005]. Traditionally, the vast majority of software applications are written as sequential programs, as described by von Neumann [1945] in his seminal report. The execution of these programs can be understood by a human sequentially stepping through the code. Historically, computer users have become accustomed to the expectation that these programs run faster with each new generation of microprocessors. Such expectation is no longer strictly valid from this day onward. A sequential program will only run on one of the processor cores, which will not become significantly faster than those in use today. Without performance improvement, application developers will no longer be able to introduce new features and capabilities into their software as new microprocessors are introduced, thus reducing the growth opportunities of the entire computer industry. Rather, the applications software that will continue to enjoy performance improvement with each new generation of microprocessors will be parallel programs, in which multiple threads of execution cooperate to complete the work faster. This new, dramatically escalated incentive for parallel program development has been referred to as the concurrency revolution [Sutter 2005]. The practice of parallel programming is by no means new. The high-performance computing community has been developing parallel programs for decades. These programs run on large-scale, expensive computers. Only a few elite applications can justify the use of these expensive computers, thus limiting the practice of parallel programming to a small number of application developers. Now that all new microprocessors are parallel computers, the number of applications that must be developed as parallel programs has increased dramatically. There is now a great need for software developers to learn about parallel programming, which is the focus of this book. 1.1 GPUs AS PARALLEL COMPUTERS Since 2003, the semiconductor industry has settled on two main trajectories for designing microprocessor [Hwu 2008]. The multicore trajectory seeks to maintain the execution speed of sequential programs while moving into multiple cores. The multicores began as two-core processors, with the number of cores approximately doubling with each semiconductor process generation. A current exemplar is the recent Intel Core i7 microprocessor, 2 CHAPTER 1 Introduction
1.1 GPUs as Parallel Computers 3 which has four processor cores,each of which is an out-of-order,multiple- instruction issue processor implementing the full x86 instruction set;the microprocessor supports hyperthreading with two hardware threads and is designed to maximize the execution speed of sequential programs. In contrast,the many-core trajectory focuses more on the execution throughput of parallel applications.The many-cores began as a large num- ber of much smaller cores,and,once again,the number of cores doubles with each generation.A current exemplar is the NVIDIA GeForce GTX 280 graphics processing unit(GPU)with 240 cores,each of which is a heavily multithreaded,in-order,single-instruction issue processor that shares its control and instruction cache with seven other cores.Many-core processors,especially the GPUs,have led the race of floating-point perfor- mance since 2003.This phenomenon is illustrated in Figure 1.1.While the performance improvement of general-purpose microprocessors has slowed significantly,the GPUs have continued to improve relentlessly.As of 2009,the ratio between many-core GPUs and multicore CPUs for peak floating-point calculation throughput is about 10 to 1.These are not neces- sarily achievable application speeds but are merely the raw speed that the execution resources can potentially support in these chips:1 teraflops (1000 gigaflops)versus 100 gigaflops in 2009. 1200 ● AMD(GPU) NVIDIA(GPU) 1000 ◆◆Intel(CPU) 800- Many-core GPU 600 恶 400 200- Multicore CPU ◆ 0、 Dual-core Quad-core 20012002 200320042005 20062007 20082009 Year Courtesy:John Owens FIGURE 1.1 Enlarging performance gap between GPUs and CPUs
which has four processor cores, each of which is an out-of-order, multipleinstruction issue processor implementing the full x86 instruction set; the microprocessor supports hyperthreading with two hardware threads and is designed to maximize the execution speed of sequential programs. In contrast, the many-core trajectory focuses more on the execution throughput of parallel applications. The many-cores began as a large number of much smaller cores, and, once again, the number of cores doubles with each generation. A current exemplar is the NVIDIA GeForce GTX 280 graphics processing unit (GPU) with 240 cores, each of which is a heavily multithreaded, in-order, single-instruction issue processor that shares its control and instruction cache with seven other cores. Many-core processors, especially the GPUs, have led the race of floating-point performance since 2003. This phenomenon is illustrated in Figure 1.1. While the performance improvement of general-purpose microprocessors has slowed significantly, the GPUs have continued to improve relentlessly. As of 2009, the ratio between many-core GPUs and multicore CPUs for peak floating-point calculation throughput is about 10 to 1. These are not necessarily achievable application speeds but are merely the raw speed that the execution resources can potentially support in these chips: 1 teraflops (1000 gigaflops) versus 100 gigaflops in 2009. 1200 1000 800 600 GFLOPS AMD (GPU) NVIDIA (GPU) Intel (CPU) 400 200 0 2001 2002 2003 2004 2005 Year 2006 2007 Quad-core Dual-core Many-core GPU Multicore CPU Courtesy: John Owens 2008 2009 FIGURE 1.1 Enlarging performance gap between GPUs and CPUs. 1.1 GPUs as Parallel Computers 3
4 CHAPTER 1 Introduction Such a large performance gap between parallel and sequential execution has amounted to a significant "electrical potential"buildup,and at some point something will have to give.We have reached that point now.To date,this large performance gap has already motivated many applications developers to move the computationally intensive parts of their software to GPUs for execution.Not surprisingly,these computationally intensive parts are also the prime target of parallel programming-when there is more work to do,there is more opportunity to divide the work among coop- erating parallel workers. One might ask why there is such a large performance gap between many-core GPUs and general-purpose multicore CPUs.The answer lies in the differences in the fundamental design philosophies between the two types of processors,as illustrated in Figure 1.2.The design of a CPU is optimized for sequential code performance.It makes use of sophisticated control logic to allow instructions from a single thread of execution to exe- cute in parallel or even out of their sequential order while maintaining the appearance of sequential execution.More importantly,large cache mem- ories are provided to reduce the instruction and data access latencies of large complex applications.Neither control logic nor cache memories con- tribute to the peak calculation speed.As of 2009,the new general-purpose, multicore microprocessors typically have four large processor cores designed to deliver strong sequential code performance. Memory bandwidth is another important issue.Graphics chips have been operating at approximately 10 times the bandwidth of contemporaneously available CPU chips.In late 2006,the GeForce 8800 GTX,or simply ALU ALU Control ALU ALU CPU GPU Cache DRAM DRAM FIGURE 1.2 CPUs and GPUs have fundamentally different design philosophies
Such a large performance gap between parallel and sequential execution has amounted to a significant “electrical potential” buildup, and at some point something will have to give. We have reached that point now. To date, this large performance gap has already motivated many applications developers to move the computationally intensive parts of their software to GPUs for execution. Not surprisingly, these computationally intensive parts are also the prime target of parallel programming—when there is more work to do, there is more opportunity to divide the work among cooperating parallel workers. One might ask why there is such a large performance gap between many-core GPUs and general-purpose multicore CPUs. The answer lies in the differences in the fundamental design philosophies between the two types of processors, as illustrated in Figure 1.2. The design of a CPU is optimized for sequential code performance. It makes use of sophisticated control logic to allow instructions from a single thread of execution to execute in parallel or even out of their sequential order while maintaining the appearance of sequential execution. More importantly, large cache memories are provided to reduce the instruction and data access latencies of large complex applications. Neither control logic nor cache memories contribute to the peak calculation speed. As of 2009, the new general-purpose, multicore microprocessors typically have four large processor cores designed to deliver strong sequential code performance. Memory bandwidth is another important issue. Graphics chips have been operating at approximately 10 times the bandwidth of contemporaneously available CPU chips. In late 2006, the GeForce 8800 GTX, or simply Control Cache CPU GPU DRAM DRAM ALU ALU ALU ALU FIGURE 1.2 CPUs and GPUs have fundamentally different design philosophies. 4 CHAPTER 1 Introduction
1.1 GPUs as Parallel Computers 5 G80,was capable of moving data at about 85 gigabytes per second (GB/s) in and out of its main dynamic random access memory (DRAM).Because of frame buffer requirements and the relaxed memory model-the way various system software,applications,and input/output (I/O)devices expect their memory accesses to work-general-purpose processors have to satisfy requirements from legacy operating systems,applications,and I/O devices that make memory bandwidth more difficult to increase.In contrast,with simpler memory models and fewer legacy constraints,the GPU designers can more easily achieve higher memory bandwidth.The more recent NVIDIA GT200 chip supports about 150 GB/s.Microprocessor system memory bandwidth will probably not grow beyond 50 GB/s for about 3 years,so CPUs will continue to be at a disadvantage in terms of memory bandwidth for some time. The design philosophy of the GPUs is shaped by the fast growing video game industry,which exerts tremendous economic pressure for the ability to perform a massive number of floating-point calculations per video frame in advanced games.This demand motivates the GPU vendors to look for ways to maximize the chip area and power budget dedicated to floating- point calculations.The prevailing solution to date is to optimize for the exe- cution throughput of massive numbers of threads.The hardware takes advantage of a large number of execution threads to find work to do when some of them are waiting for long-latency memory accesses,thus minimiz- ing the control logic required for each execution thread.Small cache mem- ories are provided to help control the bandwidth requirements of these applications so multiple threads that access the same memory data do not need to all go to the DRAM.As a result,much more chip area is dedicated to the floating-point calculations. It should be clear now that GPUs are designed as numeric computing engines,and they will not perform well on some tasks on which CPUs are designed to perform well;therefore,one should expect that most appli- cations will use both CPUs and GPUs,executing the sequential parts on the CPU and numerically intensive parts on the GPUs.This is why the CUDATM (Compute Unified Device Architecture)programming model, introduced by NVIDIA in 2007,is designed to support joint CPU/GPU exe- cution of an application. ISee Chapter 2 for more background on the evolution of GPU computing and the creation of CUDA
G80, was capable of moving data at about 85 gigabytes per second (GB/s) in and out of its main dynamic random access memory (DRAM). Because of frame buffer requirements and the relaxed memory model—the way various system software, applications, and input/output (I/O) devices expect their memory accesses to work—general-purpose processors have to satisfy requirements from legacy operating systems, applications, and I/O devices that make memory bandwidth more difficult to increase. In contrast, with simpler memory models and fewer legacy constraints, the GPU designers can more easily achieve higher memory bandwidth. The more recent NVIDIA GT200 chip supports about 150 GB/s. Microprocessor system memory bandwidth will probably not grow beyond 50 GB/s for about 3 years, so CPUs will continue to be at a disadvantage in terms of memory bandwidth for some time. The design philosophy of the GPUs is shaped by the fast growing video game industry, which exerts tremendous economic pressure for the ability to perform a massive number of floating-point calculations per video frame in advanced games. This demand motivates the GPU vendors to look for ways to maximize the chip area and power budget dedicated to floatingpoint calculations. The prevailing solution to date is to optimize for the execution throughput of massive numbers of threads. The hardware takes advantage of a large number of execution threads to find work to do when some of them are waiting for long-latency memory accesses, thus minimizing the control logic required for each execution thread. Small cache memories are provided to help control the bandwidth requirements of these applications so multiple threads that access the same memory data do not need to all go to the DRAM. As a result, much more chip area is dedicated to the floating-point calculations. It should be clear now that GPUs are designed as numeric computing engines, and they will not perform well on some tasks on which CPUs are designed to perform well; therefore, one should expect that most applications will use both CPUs and GPUs, executing the sequential parts on the CPU and numerically intensive parts on the GPUs. This is why the CUDA (Compute Unified Device Architecture) programming model, introduced by NVIDIA in 2007, is designed to support joint CPU/GPU execution of an application.1 1 See Chapter 2 for more background on the evolution of GPU computing and the creation of CUDA. 1.1 GPUs as Parallel Computers 5
6 CHAPTER 1 Introduction It is also important to note that performance is not the only decision factor when application developers choose the processors for running their applications.Several other factors can be even more important.First and foremost,the processors of choice must have a very large presence in the marketplace,referred to as the installation base of the processor.The reason is very simple.The cost of software development is best justified by a very large customer population.Applications that run on a processor with a small market presence will not have a large customer base.This has been a major problem with traditional parallel computing systems that have neg- ligible market presence compared to general-purpose microprocessors. Only a few elite applications funded by government and large corporations have been successfully developed on these traditional parallel computing systems.This has changed with the advent of many-core GPUs.Due to their popularity in the PC market,hundreds of millions of GPUs have been sold.Virtually all PCs have GPUs in them.The G80 processors and their successors have shipped more than 200 million units to date.This is the first time that massively parallel computing has been feasible with a mass-market product.Such a large market presence has made these GPUs economically attractive for application developers. Other important decision factors are practical form factors and easy accessibility.Until 2006,parallel software applications usually ran on data-center servers or departmental clusters,but such execution environ- ments tend to limit the use of these applications.For example,in an appli- cation such as medical imaging,it is fine to publish a paper based on a 64-node cluster machine,but actual clinical applications on magnetic reso- nance imaging (MRI)machines are all based on some combination of a PC and special hardware accelerators.The simple reason is that manufacturers such as GE and Siemens cannot sell MRIs with racks of clusters to clinical settings,but this is common in academic departmental settings.In fact,the National Institutes of Health (NIH)refused to fund parallel programming projects for some time;they felt that the impact of parallel software would be limited because huge cluster-based machines would not work in the clinical setting.Today,GE ships MRI products with GPUs,and NIH funds research using GPU computing. Yet another important consideration in selecting a processor for exe- cuting numeric computing applications is the support for the Institute of Electrical and Electronics Engineers (IEEE)floating-point standard.The standard makes it possible to have predictable results across processors from different vendors.While support for the IEEE floating-point standard
It is also important to note that performance is not the only decision factor when application developers choose the processors for running their applications. Several other factors can be even more important. First and foremost, the processors of choice must have a very large presence in the marketplace, referred to as the installation base of the processor. The reason is very simple. The cost of software development is best justified by a very large customer population. Applications that run on a processor with a small market presence will not have a large customer base. This has been a major problem with traditional parallel computing systems that have negligible market presence compared to general-purpose microprocessors. Only a few elite applications funded by government and large corporations have been successfully developed on these traditional parallel computing systems. This has changed with the advent of many-core GPUs. Due to their popularity in the PC market, hundreds of millions of GPUs have been sold. Virtually all PCs have GPUs in them. The G80 processors and their successors have shipped more than 200 million units to date. This is the first time that massively parallel computing has been feasible with a mass-market product. Such a large market presence has made these GPUs economically attractive for application developers. Other important decision factors are practical form factors and easy accessibility. Until 2006, parallel software applications usually ran on data-center servers or departmental clusters, but such execution environments tend to limit the use of these applications. For example, in an application such as medical imaging, it is fine to publish a paper based on a 64-node cluster machine, but actual clinical applications on magnetic resonance imaging (MRI) machines are all based on some combination of a PC and special hardware accelerators. The simple reason is that manufacturers such as GE and Siemens cannot sell MRIs with racks of clusters to clinical settings, but this is common in academic departmental settings. In fact, the National Institutes of Health (NIH) refused to fund parallel programming projects for some time; they felt that the impact of parallel software would be limited because huge cluster-based machines would not work in the clinical setting. Today, GE ships MRI products with GPUs, and NIH funds research using GPU computing. Yet another important consideration in selecting a processor for executing numeric computing applications is the support for the Institute of Electrical and Electronics Engineers (IEEE) floating-point standard. The standard makes it possible to have predictable results across processors from different vendors. While support for the IEEE floating-point standard 6 CHAPTER 1 Introduction