8 CHAPTER 1 HETEROGENEOUS PARALLEL COMPUTING WITH CUDA Processor Processor Processor Cache Cache Cache Bus Shared Memory FIGURE 1-8 GPUs represent a many-core architecture,and have virtually every type of parallelism described previously:multithreading,MIMD,SIMD,and instruction-level parallelism.NVIDIA coined the phrase Single Instruction,Multiple Thread (SIMT)for this type of architecture. GPUs and CPUs do not share a common ancestor.Historically,GPUs are graphics accelerators. Only recently have GPUs evolved to be powerful,general-purpose,fully programmable,task and data parallel processors,ideally suited to tackle massively parallel computing problems. GPU CORE VERSUS CPU CORE Even though many-core and multicore are used to label GPU and CPU architec- tures,a GPU core is quite different than a CPU core. A CPU core,relatively heavy-weight,is designed for very complex control logic, seeking to optimize the execution of sequential programs. A GPU core,relatively light-weight,is optimized for data-parallel tasks with sim- pler control logic,focusing on the throughput of parallel programs HETEROGENEOUS COMPUTING In the earliest days,computers contained only central processing units(CPUs)designed to run gen- eral programming tasks.Since the last decade,mainstream computers in the high-performance com- puting community have been switching to include other processing elements.The most prevalent is the GPU,originally designed to perform specialized graphics computations in parallel.Over time, GPUs have become more powerful and more generalized,enabling them to be applied to general- purpose parallel computing tasks with excellent performance and high power efficiency. Typically,CPUs and GPUs are discrete processing components connected by the PCI-Express bus within a single compute node.In this type of architecture,GPUs are referred to as discrete devices. www.it-ebooks.info
8 ❘ CHAPTER 1 HETEROGENEOUS PARALLEL COMPUTING WITH CUDA c01.indd 08/19/2014 Page 8 Processor Processor Cache Cache Bus Cache Processor Shared Memory ...... ...... FIGURE 1-8 GPUs represent a many-core architecture, and have virtually every type of parallelism described previously: multithreading, MIMD, SIMD, and instruction-level parallelism. NVIDIA coined the phrase Single Instruction, Multiple Thread (SIMT) for this type of architecture. GPUs and CPUs do not share a common ancestor. Historically, GPUs are graphics accelerators. Only recently have GPUs evolved to be powerful, general-purpose, fully programmable, task and data parallel processors, ideally suited to tackle massively parallel computing problems. GPU CORE VERSUS CPU CORE Even though many-core and multicore are used to label GPU and CPU architectures, a GPU core is quite different than a CPU core. A CPU core, relatively heavy-weight, is designed for very complex control logic, seeking to optimize the execution of sequential programs. A GPU core, relatively light-weight, is optimized for data-parallel tasks with simpler control logic, focusing on the throughput of parallel programs. HETEROGENEOUS COMPUTING In the earliest days, computers contained only central processing units (CPUs) designed to run general programming tasks. Since the last decade, mainstream computers in the high-performance computing community have been switching to include other processing elements. The most prevalent is the GPU, originally designed to perform specialized graphics computations in parallel. Over time, GPUs have become more powerful and more generalized, enabling them to be applied to generalpurpose parallel computing tasks with excellent performance and high power effi ciency. Typically, CPUs and GPUs are discrete processing components connected by the PCI-Express bus within a single compute node. In this type of architecture, GPUs are referred to as discrete devices. www.it-ebooks.info
Heterogeneous Computing 9 The switch from homogeneous systems to heterogeneous systems is a milestone in the history of high-performance computing.Homogeneous computing uses one or more processor of the same architecture to execute an application.Heterogeneous computing instead uses a suite of processor architectures to execute an application,applying tasks to architectures to which they are well-suited, yielding performance improvement as a result. Although heterogeneous systems provide significant advantages compared to traditional high- performance computing systems,effective use of such systems is currently limited by the increased application design complexity.While parallel programming has received much recent attention,the inclusion of heterogeneous resources adds complexity. If you are new to parallel programming,then you can benefit from the performance improvements and advanced software tools now available on heterogeneous architectures.If you are already a good parallel programmer,adapting to parallel programming on heterogeneous architectures is straightforward. Heterogeneous Architecture A typical heterogeneous compute node nowadays consists of two multicore CPU sockets and two or more many-core GPUs.A GPU is currently not a standalone platform but a co-processor to a CPU. Therefore,GPUs must operate in conjunction with a CPU-based host through a PCI-Express bus,as shown in Figure 1-9.That is why,in GPU computing terms,the CPU is called the host and the GPU is called the device. ALU ALU Control ALU ALU Cache PCle Bus DRAM DRAM CPU GPU FIGURE 1-9 A heterogeneous application consists of two parts: >Host code Device code Host code runs on CPUs and device code runs on GPUs.An application executing on a heteroge- neous platform is typically initialized by the CPU.The CPU code is responsible for managing the environment,code,and data for the device before loading compute-intensive tasks on the device. With computational intensive applications,program sections often exhibit a rich amount of data parallelism.GPUs are used to accelerate the execution of this portion of data parallelism.When a www.it-ebooks.info
Heterogeneous Computing ❘ 9 c01.indd 08/19/2014 Page 9 The switch from homogeneous systems to heterogeneous systems is a milestone in the history of high-performance computing. Homogeneous computing uses one or more processor of the same architecture to execute an application. Heterogeneous computing instead uses a suite of processor architectures to execute an application, applying tasks to architectures to which they are well-suited, yielding performance improvement as a result. Although heterogeneous systems provide signifi cant advantages compared to traditional highperformance computing systems, effective use of such systems is currently limited by the increased application design complexity. While parallel programming has received much recent attention, the inclusion of heterogeneous resources adds complexity. If you are new to parallel programming, then you can benefi t from the performance improvements and advanced software tools now available on heterogeneous architectures. If you are already a good parallel programmer, adapting to parallel programming on heterogeneous architectures is straightforward. Heterogeneous Architecture A typical heterogeneous compute node nowadays consists of two multicore CPU sockets and two or more many-core GPUs. A GPU is currently not a standalone platform but a co-processor to a CPU. Therefore, GPUs must operate in conjunction with a CPU-based host through a PCI-Express bus, as shown in Figure 1-9. That is why, in GPU computing terms, the CPU is called the host and the GPU is called the device. Control Cache DRAM DRAM CPU GPU PCle Bus ALU ALU ALU ALU FIGURE 1-9 A heterogeneous application consists of two parts: ➤ Host code ➤ Device code Host code runs on CPUs and device code runs on GPUs. An application executing on a heterogeneous platform is typically initialized by the CPU. The CPU code is responsible for managing the environment, code, and data for the device before loading compute-intensive tasks on the device. With computational intensive applications, program sections often exhibit a rich amount of data parallelism. GPUs are used to accelerate the execution of this portion of data parallelism. When a www.it-ebooks.info
10 CHAPTER 1 HETEROGENEOUS PARALLEL COMPUTING WITH CUDA hardware component that is physically separate from the CPU is used to accelerate computationally intensive sections of an application,it is referred to as a hardware accelerator.GPUs are arguably the most common example of a hardware accelerator. NVIDIA's GPU computing platform is enabled on the following product families: Tegra GeForce Quadro Tesla The Tegra product family is designed for mobile and embedded devices such as tablets and phones, GeForce for consumer graphics,Quadro for professional visualization,and Tesla for datacenter par- allel computing.Fermi,the GPU accelerator in the Tesla product family,has recently gained wide- spread use as a computing accelerator for high-performance computing applications.Fermi,released by NVIDIA in 2010,is the world's first complete GPU computing architecture.Fermi GPU accel- erators have already redefined and accelerated high-performance computing capabilities in many areas,such as seismic processing,biochemistry simulations,weather and climate modeling,signal processing,computational finance,computer-aided engineering,computational fluid dynamics,and data analysis.Kepler,the current generation of GPU computing architecture after Fermi,released in the fall of 2012,offers much higher processing power than the prior GPU generation and provides new methods to optimize and increase parallel workload execution on the GPU,expecting to fur- ther revolutionize high-performance computing.The Tegra K1 contains a Kepler GPU and provides everything you need to unlock the power of the GPU for embedded applications. There are two important features that describe GPU capability: >Number of CUDA cores >Memory size Accordingly,there are two different metrics for describing GPU performance: >Peak computational performance Memory bandwidth Peak computational performance is a measure of computational capability,usually defined as how many single-precision or double-precision floating point calculations can be processed per second. Peak performance is usually expressed in gflops(billion floating-point operations per second)or tflops(trillion floating-point calculations per second).Memory bandwidth is a measure of the ratio at which data can be read from or stored to memory.Memory bandwidth is usually expressed in gigabytes per second,GB/s.Table 1-1 provides a brief summary of Fermi and Kepler architectural and performance features. www.it-ebooks.info
10 ❘ CHAPTER 1 HETEROGENEOUS PARALLEL COMPUTING WITH CUDA c01.indd 08/19/2014 Page 10 hardware component that is physically separate from the CPU is used to accelerate computationally intensive sections of an application, it is referred to as a hardware accelerator. GPUs are arguably the most common example of a hardware accelerator. NVIDIA’s GPU computing platform is enabled on the following product families: ➤ Tegra ➤ GeForce ➤ Quadro ➤ Tesla The Tegra product family is designed for mobile and embedded devices such as tablets and phones, GeForce for consumer graphics, Quadro for professional visualization, and Tesla for datacenter parallel computing. Fermi, the GPU accelerator in the Tesla product family, has recently gained widespread use as a computing accelerator for high-performance computing applications. Fermi, released by NVIDIA in 2010, is the world’s fi rst complete GPU computing architecture. Fermi GPU accelerators have already redefi ned and accelerated high-performance computing capabilities in many areas, such as seismic processing, biochemistry simulations, weather and climate modeling, signal processing, computational fi nance, computer-aided engineering, computational fl uid dynamics, and data analysis. Kepler, the current generation of GPU computing architecture after Fermi, released in the fall of 2012, offers much higher processing power than the prior GPU generation and provides new methods to optimize and increase parallel workload execution on the GPU, expecting to further revolutionize high-performance computing. The Tegra K1 contains a Kepler GPU and provides everything you need to unlock the power of the GPU for embedded applications. There are two important features that describe GPU capability: ➤ Number of CUDA cores ➤ Memory size Accordingly, there are two different metrics for describing GPU performance: ➤ Peak computational performance ➤ Memory bandwidth Peak computational performance is a measure of computational capability, usually defi ned as how many single-precision or double-precision fl oating point calculations can be processed per second. Peak performance is usually expressed in gflops (billion fl oating-point operations per second) or tflops (trillion fl oating-point calculations per second). Memory bandwidth is a measure of the ratio at which data can be read from or stored to memory. Memory bandwidth is usually expressed in gigabytes per second, GB/s. Table 1-1 provides a brief summary of Fermi and Kepler architectural and performance features. www.it-ebooks.info
Heterogeneous Computing 11 TABLE 1-1:Fermi and Kepler FERMI KEPLER (TESLA C2050) (TESLA K10) CUDA Cores 448 2×1536 Memory 6GB 8GB Peak Performance* 1.03 Tflops 4.58 Tflops Memory Bandwidth 144 GB/s 320 GB/s *Peak single-precision floating point performance Most examples in this book can be run on both Fermi and Kepler GPUs.Some examples require special architectural features only included with Kepler GPUs. COMPUTE CAPABILITIES NVIDIA uses a special term,compute capability,to describe hardware versions of GPU accelerators that belong to the entire Tesla product family.The version of Tesla products is given in Table 1-2. Devices with the same major revision number are of the same core architecture. Kepler class architecture is major version number 3. Fermi class architecture is major version number 2. Tesla class architecture is major version number 1. The first class of GPUs delivered by NVIDIA contains the same Tesla name as the entire family of Tesla GPU accelerators. All examples in this book require compute capability above 2. TABLE 1-2:Compute Capabilities of Tesla GPU Computing Products GPU COMPUTE CAPABILITY Tesla K40 3.5 Tesla K20 3.5 Tesla K10 3.0 Tesla C2070 2.0 Tesla C1060 1.3 www.it-ebooks.info
Heterogeneous Computing ❘ 11 c01.indd 08/19/2014 Page 11 TABLE 1-1: Fermi and Kepler FERMI (TESLA C2050) KEPLER (TESLA K10) CUDA Cores 448 2 x 1536 Memory 6 GB 8 GB Peak Performance* 1.03 Tfl ops 4.58 Tfl ops Memory Bandwidth 144 GB/s 320 GB/s * Peak single-precision fl oating point performance Most examples in this book can be run on both Fermi and Kepler GPUs. Some examples require special architectural features only included with Kepler GPUs. COMPUTE CAPABILITIES NVIDIA uses a special term, compute capability, to describe hardware versions of GPU accelerators that belong to the entire Tesla product family. The version of Tesla products is given in Table 1-2. Devices with the same major revision number are of the same core architecture. ➤ Kepler class architecture is major version number 3. ➤ Fermi class architecture is major version number 2. ➤ Tesla class architecture is major version number 1. The fi rst class of GPUs delivered by NVIDIA contains the same Tesla name as the entire family of Tesla GPU accelerators. All examples in this book require compute capability above 2. TABLE 1-2: Compute Capabilities of Tesla GPU Computing Products GPU COMPUTE CAPABILITY Tesla K40 3.5 Tesla K20 3.5 Tesla K10 3.0 Tesla C2070 2.0 Tesla C1060 1.3 www.it-ebooks.info
12 CHAPTER 1 HETEROGENEOUS PARALLEL COMPUTING WITH CUDA Paradigm of Heterogeneous Computing GPU computing is not meant to replace CPU computing.Each approach has advantages for certain kinds of programs.CPU computing is good for control-intensive tasks,and GPU computing is good for data-parallel computation-intensive tasks.When CPUs are complemented by GPUs,it makes for a powerful combination.The CPU is optimized for dynamic workloads marked by short sequences of computational operations and unpredictable control flow;and GPUs aim at the other end of the spectrum:workloads that are dominated by computational tasks with simple control flow.As shown in Figure 1-10,there are two dimensions that differentiate the scope of applications for CPU and GPU: >Parallelism level Data size If a problem has a small data size,sophisticated control logic,and/or low-level parallelism,the CPU is a good choice because of its ability to handle complex logic and instruction-level parallelism.If the problem at hand instead processes a huge amount of data and exhibits massive data parallelism, the GPU is the right choice because it has a large number of programmable cores,can support mas- sive multi-threading,and has a larger peak bandwidth compared to the CPU. Graphics GPU Parallel Computing CPU Sequential Computing Data size from small to large FIGURE 1-10 CPU GPU heterogeneous parallel computing architectures evolved because the CPU and GPU have complementary attributes that enable applications to perform best using both types of proces- sors.Therefore,for optimal performance you may need to use both CPU and GPU for your appli- cation,executing the sequential parts or task parallel parts on the CPU and intensive data parallel parts on the GPU,as shown in Figure 1-11. www.it-ebooks.info
12 ❘ CHAPTER 1 HETEROGENEOUS PARALLEL COMPUTING WITH CUDA c01.indd 08/19/2014 Page 12 Paradigm of Heterogeneous Computing GPU computing is not meant to replace CPU computing. Each approach has advantages for certain kinds of programs. CPU computing is good for control-intensive tasks, and GPU computing is good for data-parallel computation-intensive tasks. When CPUs are complemented by GPUs, it makes for a powerful combination. The CPU is optimized for dynamic workloads marked by short sequences of computational operations and unpredictable control fl ow; and GPUs aim at the other end of the spectrum: workloads that are dominated by computational tasks with simple control fl ow. As shown in Figure 1-10, there are two dimensions that differentiate the scope of applications for CPU and GPU: ➤ Parallelism level ➤ Data size If a problem has a small data size, sophisticated control logic, and/or low-level parallelism, the CPU is a good choice because of its ability to handle complex logic and instruction-level parallelism. If the problem at hand instead processes a huge amount of data and exhibits massive data parallelism, the GPU is the right choice because it has a large number of programmable cores, can support massive multi-threading, and has a larger peak bandwidth compared to the CPU. Data size from small to large CPU Sequential Computing GPU Parallel Computing Parallelism from low to high Graphics FIGURE 1-10 CPU + GPU heterogeneous parallel computing architectures evolved because the CPU and GPU have complementary attributes that enable applications to perform best using both types of processors. Therefore, for optimal performance you may need to use both CPU and GPU for your application, executing the sequential parts or task parallel parts on the CPU and intensive data parallel parts on the GPU, as shown in Figure 1-11. www.it-ebooks.info