Introduction Theoretical Peak GB/s 800 ◆-GeForce GPU 700 -◆-Tesla GPU ◆-Intel CPU 600 500 400 300 200 100 0 2003 2005 2007 2009 2011 2013 2015 Figure 2 Memory Bandwidth for the CPU and GPU The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive,highly parallel computation -exactly what graphics rendering is about-and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control,as schematically illustrated by Figure 3. Control ALU ALU ALU ALU Cache DRAM DRAM CPU GPU Figure 3 The GPU Devotes More Transistors to Data Processing More specifically,the GPU is especially well-suited to address problems that can be expressed as data-parallel computations-the same program is executed on many data elements in parallel-with high arithmetic intensity-the ratio of arithmetic operations to memory operations.Because the same program is executed for each data element, www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2|2
Introduction www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | 2 Figure 2 Memory Bandwidth for the CPU and GPU The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation - exactly what graphics rendering is about - and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 3. Cache Control ALU ALU ALU ALU DRAM CPU DRAM GPU Figure 3 The GPU Devotes More Transistors to Data Processing More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations - the same program is executed on many data elements in parallel - with high arithmetic intensity - the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element
Introduction there is a lower requirement for sophisticated flow control,and because it is executed on many data elements and has high arithmetic intensity,the memory access latency can be hidden with calculations instead of big data caches. Data-parallel processing maps data elements to parallel processing threads.Many applications that process large data sets can use a data-parallel programming model to speed up the computations.In 3D rendering,large sets of pixels and vertices are mapped to parallel threads.Similarly,image and media processing applications such as post-processing of rendered images,video encoding and decoding,image scaling,stereo vision,and pattern recognition can map image blocks and pixels to parallel processing threads.In fact,many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing,from general signal processing or physics simulation to computational finance or computational biology. 1.2.CUDA:A General-Purpose Parallel Computing Platform and Programming Model In November 2006,NVIDIA introduced CUDA,a general purpose parallel computing platform and programming model that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a CPU. CUDA comes with a software environment that allows developers to use C as a high- level programming language.As illustrated by Figure 4,other languages,application programming interfaces,or directives-based approaches are supported,such as FORTRAN,DirectCompute,OpenACC. www.nvidia.com CUDA C Programming Guide PG-02829-001v9.2|3
Introduction www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | 3 there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches. Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology. 1.2. CUDA® : A General-Purpose Parallel Computing Platform and Programming Model In November 2006, NVIDIA introduced CUDA® , a general purpose parallel computing platform and programming model that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a CPU. CUDA comes with a software environment that allows developers to use C as a highlevel programming language. As illustrated by Figure 4, other languages, application programming interfaces, or directives-based approaches are supported, such as FORTRAN, DirectCompute, OpenACC
Introduction GPU Computing Applications Libraries and Middleware cuFFT CDNN CUBLAS CULA Thrust VSIPL PhysX MATLAB CURAND MAGMA SVM TensorRT NPP Optix Mathematica CuSPARSE OpenCurrent iRay Programming Languages Java C C++ Fortran DirectCompute Directives Python Wrappers (e.g.OpenACC) CUDA-Enabled NVIDIA GPUs Volta Architecture Tesla V Senes (compute capabilities 7.x) Pascal Architecture GeForce 1000 Senes Quadro P Senies Tesla P Series (compute capabilities 6.x) Maxwell Architecture Tegra X1 GeForce 900 Senes Quadro M Series Tesla M Series (compute capabilities 5.x) Kepler Architecture Tegra K1 GeForce 700 Series Quadro K Series Tesla K Series (compute capabilities 3.x) GeForce 600 Series Embedded sumer Professional Dota Center Desktop/Laptop Workstation Figure 4 GPU Computing Applications CUDA is designed to support various languages and application programming interfaces. 1.3.A Scalable Programming Model The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems.Furthermore,their parallelism continues to scale with Moore's law.The challenge is to develop application software that transparently scales its parallelism to leverage the increasing number of processor cores,much as 3D graphics applications transparently scale their parallelism to manycore GPUs with widely varying numbers of cores. The CUDA parallel programming model is designed to overcome this challenge while maintaining a low learning curve for programmers familiar with standard programming languages such as C. At its core are three key abstractions-a hierarchy of thread groups,shared memories, and barrier synchronization-that are simply exposed to the programmer as a minimal set of language extensions. www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2|4
Introduction www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | 4 Figure 4 GPU Computing Applications CUDA is designed to support various languages and application programming interfaces. 1.3. A Scalable Programming Model The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. Furthermore, their parallelism continues to scale with Moore's law. The challenge is to develop application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism to manycore GPUs with widely varying numbers of cores. The CUDA parallel programming model is designed to overcome this challenge while maintaining a low learning curve for programmers familiar with standard programming languages such as C. At its core are three key abstractions - a hierarchy of thread groups, shared memories, and barrier synchronization - that are simply exposed to the programmer as a minimal set of language extensions
Introduction These abstractions provide fine-grained data parallelism and thread parallelism, nested within coarse-grained data parallelism and task parallelism.They guide the programmer to partition the problem into coarse sub-problems that can be solved independently in parallel by blocks of threads,and each sub-problem into finer pieces that can be solved cooperatively in parallel by all threads within the block. This decomposition preserves language expressivity by allowing threads to cooperate when solving each sub-problem,and at the same time enables automatic scalability. Indeed,each block of threads can be scheduled on any of the available multiprocessors within a GPU,in any order,concurrently or sequentially,so that a compiled CUDA program can execute on any number of multiprocessors as illustrated by Figure 5,and only the runtime system needs to know the physical multiprocessor count. This scalable programming model allows the GPU architecture to span a wide market range by simply scaling the number of multiprocessors and memory partitions:from the high-performance enthusiast GeForce GPUs and professional Quadro and Tesla computing products to a variety of inexpensive,mainstream GeForce GPUs(see CUDA- Enabled GPUs for a list of all CUDA-enabled GPUs). www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2|5
Introduction www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | 5 These abstractions provide fine-grained data parallelism and thread parallelism, nested within coarse-grained data parallelism and task parallelism. They guide the programmer to partition the problem into coarse sub-problems that can be solved independently in parallel by blocks of threads, and each sub-problem into finer pieces that can be solved cooperatively in parallel by all threads within the block. This decomposition preserves language expressivity by allowing threads to cooperate when solving each sub-problem, and at the same time enables automatic scalability. Indeed, each block of threads can be scheduled on any of the available multiprocessors within a GPU, in any order, concurrently or sequentially, so that a compiled CUDA program can execute on any number of multiprocessors as illustrated by Figure 5, and only the runtime system needs to know the physical multiprocessor count. This scalable programming model allows the GPU architecture to span a wide market range by simply scaling the number of multiprocessors and memory partitions: from the high-performance enthusiast GeForce GPUs and professional Quadro and Tesla computing products to a variety of inexpensive, mainstream GeForce GPUs (see CUDAEnabled GPUs for a list of all CUDA-enabled GPUs)
Introduction Multithreaded CUDA Program Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 GPU with 2 SMs GPU with 4 SMs SM0 SM 1 SM 0 SM 1 SM 2 SM3 Block 0 Block 1 Block 0 Block 1 Block 2 Block 3 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 4 Block 5 Block 6 Block 7 A GPU is built around an array of Streaming Multiprocessors(SMs)(see Hardware Implementation for more details).A multithreaded program is partitioned into blocks of threads that execute independently from each other,so that a GPU with more multiprocessors will automatically execute the program in less time than a GPU with fewer multiprocessors. Figure 5 Automatic Scalability 1.4.Document Structure This document is organized into the following chapters: Chapter Introduction is a general introduction to CUDA Chapter Programming Model outlines the CUDA programming model. Chapter Programming Interface describes the programming interface. Chapter Hardware Implementation describes the hardware implementation. Chapter Performance Guidelines gives some guidance on how to achieve maximum performance. Appendix CUDA-Enabled GPUs lists all CUDA-enabled devices. Appendix C Language Extensions is a detailed description of all extensions to the C language. www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.216
Introduction www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | 6 GPU w ith 2 SMs SM 0 SM 1 GPU w ith 4 SMs SM 0 SM 1 SM 2 SM 3 Block 5 Block 6 Multithreaded CUDA Program Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 A GPU is built around an array of Streaming Multiprocessors (SMs) (see Hardware Implementation for more details). A multithreaded program is partitioned into blocks of threads that execute independently from each other, so that a GPU with more multiprocessors will automatically execute the program in less time than a GPU with fewer multiprocessors. Figure 5 Automatic Scalability 1.4. Document Structure This document is organized into the following chapters: ‣ Chapter Introduction is a general introduction to CUDA. ‣ Chapter Programming Model outlines the CUDA programming model. ‣ Chapter Programming Interface describes the programming interface. ‣ Chapter Hardware Implementation describes the hardware implementation. ‣ Chapter Performance Guidelines gives some guidance on how to achieve maximum performance. ‣ Appendix CUDA-Enabled GPUs lists all CUDA-enabled devices. ‣ Appendix C Language Extensions is a detailed description of all extensions to the C language