1.1 GPUs as Parallel Computers 7 was not strong in early GPUs,this has also changed for new generations of GPUs since the introduction of the G80.As we will discuss in Chapter 7, GPU support for the IEEE floating-point standard has become comparable to that of the CPUs.As a result,one can expect that more numerical appli- cations will be ported to GPUs and yield comparable values as the CPUs. Today,a major remaining issue is that the floating-point arithmetic units of the GPUs are primarily single precision.Applications that truly require double-precision floating point were not suitable for GPU execution; however,this has changed with the recent GPUs,whose double-precision execution speed approaches about half that of single precision,a level that high-end CPU cores achieve.This makes the GPUs suitable for even more numerical applications. Until 2006,graphics chips were very difficult to use because programmers had to use the equivalent of graphic application programming interface (API)functions to access the processor cores,meaning that OpenGL or Direct3D techniques were needed to program these chips.This technique was called GPGPU,short for general-purpose programming using a graphics processing unit.Even with a higher level programming environment,the underlying code is still limited by the APIs.These APIs limit the kinds of applications that one can actually write for these chips.That's why only a few people could master the skills necessary to use these chips to achieve performance for a limited number of applications;consequently,it did not become a widespread programming phenomenon.Nonetheless,this technol- ogy was sufficiently exciting to inspire some heroic efforts and excellent results. Everything changed in 2007 with the release of CUDA [NVIDIA 2007]. NVIDIA actually devoted silicon area to facilitate the ease of parallel pro- gramming,so this did not represent a change in software alone;additional hardware was added to the chip.In the G80 and its successor chips for par- allel computing,CUDA programs no longer go through the graphics inter- face at all.Instead,a new general-purpose parallel programming interface on the silicon chip serves the requests of CUDA programs.Moreover,all of the other software layers were redone,as well,so the programmers can use the familiar C/C++programming tools.Some of our students tried to do their lab assignments using the old OpenGL-based programming inter- face,and their experience helped them to greatly appreciate the improve- ments that eliminated the need for using the graphics APIs for computing applications
was not strong in early GPUs, this has also changed for new generations of GPUs since the introduction of the G80. As we will discuss in Chapter 7, GPU support for the IEEE floating-point standard has become comparable to that of the CPUs. As a result, one can expect that more numerical applications will be ported to GPUs and yield comparable values as the CPUs. Today, a major remaining issue is that the floating-point arithmetic units of the GPUs are primarily single precision. Applications that truly require double-precision floating point were not suitable for GPU execution; however, this has changed with the recent GPUs, whose double-precision execution speed approaches about half that of single precision, a level that high-end CPU cores achieve. This makes the GPUs suitable for even more numerical applications. Until 2006, graphics chips were very difficult to use because programmers had to use the equivalent of graphic application programming interface (API) functions to access the processor cores, meaning that OpenGL or Direct3D techniques were needed to program these chips. This technique was called GPGPU, short for general-purpose programming using a graphics processing unit. Even with a higher level programming environment, the underlying code is still limited by the APIs. These APIs limit the kinds of applications that one can actually write for these chips. That’s why only a few people could master the skills necessary to use these chips to achieve performance for a limited number of applications; consequently, it did not become a widespread programming phenomenon. Nonetheless, this technology was sufficiently exciting to inspire some heroic efforts and excellent results. Everything changed in 2007 with the release of CUDA [NVIDIA 2007]. NVIDIA actually devoted silicon area to facilitate the ease of parallel programming, so this did not represent a change in software alone; additional hardware was added to the chip. In the G80 and its successor chips for parallel computing, CUDA programs no longer go through the graphics interface at all. Instead, a new general-purpose parallel programming interface on the silicon chip serves the requests of CUDA programs. Moreover, all of the other software layers were redone, as well, so the programmers can use the familiar C/Cþþ programming tools. Some of our students tried to do their lab assignments using the old OpenGL-based programming interface, and their experience helped them to greatly appreciate the improvements that eliminated the need for using the graphics APIs for computing applications. 1.1 GPUs as Parallel Computers 7
8 CHAPTER 1 Introduction 1.2 ARCHITECTURE OF A MODERN GPU Figure 1.3 shows the architecture of a typical CUDA-capable GPU.It is organized into an array of highly threaded streaming multiprocessors (SMs).In Figure 1.3,two SMs form a building block;however,the number of SMs in a building block can vary from one generation of CUDA GPUs to another generation.Also,each SM in Figure 1.3 has a number of stream- ing processors (SPs)that share control logic and instruction cache.Each GPU currently comes with up to 4 gigabytes of graphics double data rate (GDDR)DRAM,referred to as global memory in Figure 1.3.These GDDR DRAMs differ from the system DRAMs on the CPU motherboard in that they are essentially the frame buffer memory that is used for graphics. For graphics applications,they hold video images,and texture information for three-dimensional (3D)rendering,but for computing they function as very-high-bandwidth,off-chip memory,though with somewhat more latency than typical system memory.For massively parallel applications, the higher bandwidth makes up for the longer latency. The G80 that introduced the CUDA architecture had 86.4 GB/s of mem- ory bandwidth,plus an 8-GB/s communication bandwidth with the CPU. A CUDA application can transfer data from the system memory at 4 GB/s and at the same time upload data back to the system memory at 4 GB/s. Altogether,there is a combined total of 8 GB/s.The communication band- width is much lower than the memory bandwidth and may seem like a limitation;however,the PCI Express bandwidth is comparable to the CPU front-side bus bandwidth to the system memory,so it's really not the limitation it would seem at first.The communication bandwidth is also expected to grow as the CPU bus bandwidth of the system memory grows in the future. The massively parallel G80 chip has 128 SPs(16 SMs,each with 8 SPs). Each SP has a multiply-add (MAD)unit and an additional multiply unit. With 128 SPs,that's a total of over 500 gigaflops.In addition,special- function units perform floating-point functions such as square root(SQRT), as well as transcendental functions.With 240 SPs,the GT200 exceeds 1 ter- flops.Because each SP is massively threaded,it can run thousands of threads per application.A good application typically runs 5000-12,000 threads simultaneously on this chip.For those who are used to simultaneous multithreading,note that Intel CPUs support 2 or 4 threads,depending on the machine model,per core.The G80 chip supports up to 768 threads per SM,which sums up to about 12,000 threads for this chip.The more recent GT200 supports 1024 threads per SM and up to about 30,000 threads
1.2 ARCHITECTURE OF A MODERN GPU Figure 1.3 shows the architecture of a typical CUDA-capable GPU. It is organized into an array of highly threaded streaming multiprocessors (SMs). In Figure 1.3, two SMs form a building block; however, the number of SMs in a building block can vary from one generation of CUDA GPUs to another generation. Also, each SM in Figure 1.3 has a number of streaming processors (SPs) that share control logic and instruction cache. Each GPU currently comes with up to 4 gigabytes of graphics double data rate (GDDR) DRAM, referred to as global memory in Figure 1.3. These GDDR DRAMs differ from the system DRAMs on the CPU motherboard in that they are essentially the frame buffer memory that is used for graphics. For graphics applications, they hold video images, and texture information for three-dimensional (3D) rendering, but for computing they function as very-high-bandwidth, off-chip memory, though with somewhat more latency than typical system memory. For massively parallel applications, the higher bandwidth makes up for the longer latency. The G80 that introduced the CUDA architecture had 86.4 GB/s of memory bandwidth, plus an 8-GB/s communication bandwidth with the CPU. A CUDA application can transfer data from the system memory at 4 GB/s and at the same time upload data back to the system memory at 4 GB/s. Altogether, there is a combined total of 8 GB/s. The communication bandwidth is much lower than the memory bandwidth and may seem like a limitation; however, the PCI Express bandwidth is comparable to the CPU front-side bus bandwidth to the system memory, so it’s really not the limitation it would seem at first. The communication bandwidth is also expected to grow as the CPU bus bandwidth of the system memory grows in the future. The massively parallel G80 chip has 128 SPs (16 SMs, each with 8 SPs). Each SP has a multiply–add (MAD) unit and an additional multiply unit. With 128 SPs, that’s a total of over 500 gigaflops. In addition, specialfunction units perform floating-point functions such as square root (SQRT), as well as transcendental functions. With 240 SPs, the GT200 exceeds 1 terflops. Because each SP is massively threaded, it can run thousands of threads per application. A good application typically runs 5000–12,000 threads simultaneously on this chip. For those who are used to simultaneous multithreading, note that Intel CPUs support 2 or 4 threads, depending on the machine model, per core. The G80 chip supports up to 768 threads per SM, which sums up to about 12,000 threads for this chip. The more recent GT200 supports 1024 threads per SM and up to about 30,000 threads 8 CHAPTER 1 Introduction
↓ Input assembler Thread execution manager Parallel data Parallel data Parallel data Parallel data Parallel data Parallel data Parallel data Parallel data cache cache cache cache cache cache cache cache Texture Texture Texture exture Texture exture exture Load/store Load/store Load/store Load/store Load/store Load/store Global memory FIGURE 1.3 Architecture of a Modern GPU Architecture of a CUDA-capable GPU
Input assembler Host Thread execution manager Parallel data cache Parallel data cache Parallel data cache Parallel data cache Parallel data cache Parallel data cache Parallel data cache Parallel data cache Texture Texture Texture Texture Texture Texture Texture Texture Load/store Global memory Load/store Load/store Load/store Load/store Load/store FIGURE 1.3 Architecture of a CUDA-capable GPU. 9 1.2 Architecture of a Modern GPU
10 CHAPTER 1 Introduction for the chip.Thus,the level of parallelism supported by GPU hardware is increasing quickly.It is very important to strive for such levels of parallelism when developing GPU parallel computing applications. 1.3 WHY MORE SPEED OR PARALLELISM? As we stated in Section 1.1,the main motivation for massively parallel pro- gramming is for applications to enjoy a continued increase in speed in future hardware generations.One might ask why applications will continue to demand increased speed.Many applications that we have today seem to be running quite fast enough.As we will discuss in the case study chapters, when an application is suitable for parallel execution,a good implementa- tion on a GPU can achieve more than 100 times (100x)speedup over sequential execution.If the application includes what we call data parallel- ism,it is often a simple task to achieve a 10x speedup with just a few hours of work.For anything beyond that,we invite you to keep reading! Despite the myriad computing applications in today's world,many exciting mass-market applications of the future will be what we currently consider to be supercomputing applications,or superapplications.For example,the biology research community is moving more and more into the molecular level.Microscopes,arguably the most important instrument in molecular biology,used to rely on optics or electronic instrumentation, but there are limitations to the molecular-level observations that we can make with these instruments.These limitations can be effectively addressed by incorporating a computational model to simulate the underlying molec- ular activities with boundary conditions set by traditional instrumentation. From the simulation we can measure even more details and test more hypotheses than can ever be imagined with traditional instrumentation alone.These simulations will continue to benefit from the increasing com- puting speed in the foreseeable future in terms of the size of the biological system that can be modeled and the length of reaction time that can be simulated within a tolerable response time.These enhancements will have tremendous implications with regard to science and medicine. For applications such as video and audio coding and manipulation,con- sider our satisfaction with digital high-definition television(HDTV)versus older National Television System Committee (NTSC)television.Once we experience the level of details offered by HDTV,it is very hard to go back to older technology.But,consider all the processing that is necessary for that HDTV.It is a very parallel process,as are 3D imaging and
for the chip. Thus, the level of parallelism supported by GPU hardware is increasing quickly. It is very important to strive for such levels of parallelism when developing GPU parallel computing applications. 1.3 WHY MORE SPEED OR PARALLELISM? As we stated in Section 1.1, the main motivation for massively parallel programming is for applications to enjoy a continued increase in speed in future hardware generations. One might ask why applications will continue to demand increased speed. Many applications that we have today seem to be running quite fast enough. As we will discuss in the case study chapters, when an application is suitable for parallel execution, a good implementation on a GPU can achieve more than 100 times (100) speedup over sequential execution. If the application includes what we call data parallelism, it is often a simple task to achieve a 10 speedup with just a few hours of work. For anything beyond that, we invite you to keep reading! Despite the myriad computing applications in today’s world, many exciting mass-market applications of the future will be what we currently consider to be supercomputing applications, or superapplications. For example, the biology research community is moving more and more into the molecular level. Microscopes, arguably the most important instrument in molecular biology, used to rely on optics or electronic instrumentation, but there are limitations to the molecular-level observations that we can make with these instruments. These limitations can be effectively addressed by incorporating a computational model to simulate the underlying molecular activities with boundary conditions set by traditional instrumentation. From the simulation we can measure even more details and test more hypotheses than can ever be imagined with traditional instrumentation alone. These simulations will continue to benefit from the increasing computing speed in the foreseeable future in terms of the size of the biological system that can be modeled and the length of reaction time that can be simulated within a tolerable response time. These enhancements will have tremendous implications with regard to science and medicine. For applications such as video and audio coding and manipulation, consider our satisfaction with digital high-definition television (HDTV) versus older National Television System Committee (NTSC) television. Once we experience the level of details offered by HDTV, it is very hard to go back to older technology. But, consider all the processing that is necessary for that HDTV. It is a very parallel process, as are 3D imaging and 10 CHAPTER 1 Introduction
1.3 Why More Speed or Parallelism? 11 visualization.In the future,new functionalities such as view synthesis and high-resolution display of low-resolution videos will demand that televisions have more computing power. Among the benefits offered by greater computing speed are much better user interfaces.Consider the Apple iPhone interfaces;the user enjoys a much more natural interface with the touch screen compared to other cell phone devices,even though the iPhone has a limited-size window. Undoubtedly,future versions of these devices will incorporate higher defi- nition,three-dimensional perspectives,voice and computer vision based interfaces,requiring even more computing speed. Similar developments are underway in consumer electronic gaming. Imagine driving a car in a game today;the game is,in fact,simply a prear- ranged set of scenes.If your car bumps into an obstacle,the course of your vehicle does not change;only the game score changes.Your wheels are not bent or damaged,and it is no more difficult to drive,regardless of whether you bumped your wheels or even lost a wheel.With increased computing speed,the games can be based on dynamic simulation rather than prear- ranged scenes.We can expect to see more of these realistic effects in the future-accidents will damage your wheels,and your online driving expe- rience will be much more realistic.Realistic modeling and simulation of physics effects are known to demand large amounts of computing power. All of the new applications that we mentioned involve simulating a con- current world in different ways and at different levels,with tremendous amounts of data being processed.And,with this huge quantity of data, much of the computation can be done on different parts of the data in par- allel,although they will have to be reconciled at some point.Techniques for doing so are well known to those who work with such applications on a regular basis.Thus,various granularities of parallelism do exist,but the programming model must not hinder parallel implementation,and the data delivery must be properly managed.CUDA includes such a programming model along with hardware support that facilitates parallel implementation. We aim to teach application developers the fundamental techniques for managing parallel execution and delivering data. How many times speedup can be expected from parallelizing these super- application?It depends on the portion of the application that can be paral- lelized.If the percentage of time spent in the part that can be parallelized is 30%,a 100x speedup of the parallel portion will reduce the execution time by 29.7%.The speedup for the entire application will be only 1.4x.In fact, even an infinite amount of speedup in the parallel portion can only slash less 30%off execution time,achieving no more than 1.43x speedup
visualization. In the future, new functionalities such as view synthesis and high-resolution display of low-resolution videos will demand that televisions have more computing power. Among the benefits offered by greater computing speed are much better user interfaces. Consider the Apple iPhone interfaces; the user enjoys a much more natural interface with the touch screen compared to other cell phone devices, even though the iPhone has a limited-size window. Undoubtedly, future versions of these devices will incorporate higher definition, three-dimensional perspectives, voice and computer vision based interfaces, requiring even more computing speed. Similar developments are underway in consumer electronic gaming. Imagine driving a car in a game today; the game is, in fact, simply a prearranged set of scenes. If your car bumps into an obstacle, the course of your vehicle does not change; only the game score changes. Your wheels are not bent or damaged, and it is no more difficult to drive, regardless of whether you bumped your wheels or even lost a wheel. With increased computing speed, the games can be based on dynamic simulation rather than prearranged scenes. We can expect to see more of these realistic effects in the future—accidents will damage your wheels, and your online driving experience will be much more realistic. Realistic modeling and simulation of physics effects are known to demand large amounts of computing power. All of the new applications that we mentioned involve simulating a concurrent world in different ways and at different levels, with tremendous amounts of data being processed. And, with this huge quantity of data, much of the computation can be done on different parts of the data in parallel, although they will have to be reconciled at some point. Techniques for doing so are well known to those who work with such applications on a regular basis. Thus, various granularities of parallelism do exist, but the programming model must not hinder parallel implementation, and the data delivery must be properly managed. CUDA includes such a programming model along with hardware support that facilitates parallel implementation. We aim to teach application developers the fundamental techniques for managing parallel execution and delivering data. How many times speedup can be expected from parallelizing these superapplication? It depends on the portion of the application that can be parallelized. If the percentage of time spent in the part that can be parallelized is 30%, a 100 speedup of the parallel portion will reduce the execution time by 29.7%. The speedup for the entire application will be only 1.4. In fact, even an infinite amount of speedup in the parallel portion can only slash less 30% off execution time, achieving no more than 1.43 speedup. 1.3 Why More Speed or Parallelism? 11