Heterogeneous Computing 13 Application Code GPU Compute intensive portion CPU Sequential portion FIGURE 1-11 Writing code this way ensures that the characteristics of the GPU and CPU complement each other, leading to full utilization of the computational power of the combined CPU GPU system.To sup- port joint CPU+GPU execution of an application,NVIDIA designed a programming model called CUDA.This new programming model is the focus for the rest of this book. CPU THREAD VERSUS GPU THREAD Threads on a CPU are generally heavyweight entities.The operating system must swap threads on and off CPU execution channels to provide multithreading capa- bility.Context switches are slow and expensive. Threads on GPUs are extremely lightweight.In a typical system,thousands of threads are queued up for work.If the GPU must wait on one group of threads,it simply begins executing work on another. CPU cores are designed to minimize latency for one or two threads at a time, whereas GPU cores are designed to handle a large number of concurrent,light- weight threads in order to maximize throughput. Today,a CPU with four quad core processors can run only 16 threads concurrently, or 32 if the CPUs support hyper-threading. Modern NVIDIA GPUs can support up to 1,536 active threads concurrently per multiprocessor.On GPUs with 16 multiprocessors,this leads to more than 24,000 concurrently active threads. www.it-ebooks.info
Heterogeneous Computing ❘ 13 c01.indd 08/19/2014 Page 13 GPU Application Code CPU Compute intensive portion Sequential portion FIGURE 1-11 Writing code this way ensures that the characteristics of the GPU and CPU complement each other, leading to full utilization of the computational power of the combined CPU + GPU system. To support joint CPU + GPU execution of an application, NVIDIA designed a programming model called CUDA. This new programming model is the focus for the rest of this book. CPU THREAD VERSUS GPU THREAD Threads on a CPU are generally heavyweight entities. The operating system must swap threads on and off CPU execution channels to provide multithreading capability. Context switches are slow and expensive. Threads on GPUs are extremely lightweight. In a typical system, thousands of threads are queued up for work. If the GPU must wait on one group of threads, it simply begins executing work on another. CPU cores are designed to minimize latency for one or two threads at a time, whereas GPU cores are designed to handle a large number of concurrent, lightweight threads in order to maximize throughput. Today, a CPU with four quad core processors can run only 16 threads concurrently, or 32 if the CPUs support hyper-threading. Modern NVIDIA GPUs can support up to 1,536 active threads concurrently per multiprocessor. On GPUs with 16 multiprocessors, this leads to more than 24,000 concurrently active threads. www.it-ebooks.info
14 CHAPTER 1 HETEROGENEOUS PARALLEL COMPUTING WITH CUDA CUDA:A Platform for Heterogeneous Computing CUDA is a general-purpose parallel computing platform and programming model that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way.Using CUDA,you can access the GPU for computation,as has been traditionally done on the CPU. The CUDA platform is accessible through CUDA-accelerated libraries,compiler directives,applica- tion programming interfaces,and extensions to industry-standard programming languages,includ- ing C,C++,Fortran,and Python(as illustrated by Figure 1-12).This book focuses on CUDA C programming. GPU Computing Applications Libraries and Middleware CUFFT CUBLAS CULA VSIPL Thrust CURAND MAGMA SVM PhysX MATLAB NPP OptiX iray CUSPARSE OpenCurrent Mathematica Programming Languages C++ Fortran Python DirectCompute Directives (e.g.OpenACC) FIGURE 1-12 CUDA C is an extension of standard ANSI C with a handful of language extensions to enable het- erogeneous programming,and also straightforward APIs to manage devices,memory,and other tasks.CUDA is also a scalable programming model that enables programs to transparently scale their parallelism to GPUs with varying numbers of cores,while maintaining a shallow learning curve for programmers familiar with the C programming language. CUDA provides two API levels for managing the GPU device and organizing threads,as shown in Figure 1-13. CUDA Driver API CUDA Runtime API The driver API is a low-level API and is relatively hard to program,but it provides more control over how the GPU device is used.The runtime API is a higher-level API implemented on top of the www.it-ebooks.info
14 ❘ CHAPTER 1 HETEROGENEOUS PARALLEL COMPUTING WITH CUDA c01.indd 08/19/2014 Page 14 CUDA: A Platform for Heterogeneous Computing CUDA is a general-purpose parallel computing platform and programming model that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more effi cient way. Using CUDA, you can access the GPU for computation, as has been traditionally done on the CPU. The CUDA platform is accessible through CUDA-accelerated libraries, compiler directives, application programming interfaces, and extensions to industry-standard programming languages, including C, C++, Fortran, and Python (as illustrated by Figure 1-12). This book focuses on CUDA C programming. GPU Computing Applications Programming Languages Libraries and Middleware CUFFT CUBLAS CURAND CUSPARSE C Fortran Python DirectCompute Directives (e.g. OpenACC) C++ CULA MAGMA Thrust NPP PhysX OptiX iray MATLAB Mathematica VSIPL SVM OpenCurrent FIGURE 1-12 CUDA C is an extension of standard ANSI C with a handful of language extensions to enable heterogeneous programming, and also straightforward APIs to manage devices, memory, and other tasks. CUDA is also a scalable programming model that enables programs to transparently scale their parallelism to GPUs with varying numbers of cores, while maintaining a shallow learning curve for programmers familiar with the C programming language. CUDA provides two API levels for managing the GPU device and organizing threads, as shown in Figure 1-13. ➤ CUDA Driver API ➤ CUDA Runtime API The driver API is a low-level API and is relatively hard to program, but it provides more control over how the GPU device is used. The runtime API is a higher-level API implemented on top of the www.it-ebooks.info
Heterogeneous Computing 15 driver API.Each function of the runtime API is broken down into more basic operations issued to the driver API. CPU Applications CUDA Libraries CUDA Runtime CUDA Driver GPU FIGURE 1-13 RUNTIME API VERSUS DRIVER API There is no noticeable performance difference between the runtime and driver APIs.How your kernels use memory and how you organize your threads on the device have a much more pronounced effect. These two APIs are mutually exclusive.You must use one or the other,but it is not possible to mix function calls from both.All examples throughout this book use the runtime API. A CUDA program consists of a mixture of the following two parts: >The host code runs on CPU. >The device code runs on GPU. NVIDIA's CUDA nvec compiler separates the device code from the host code during the compila- tion process.As shown in Figure 1-14,the host code is standard C code and is further compiled with C compilers.The device code is written using CUDA C extended with keywords for labeling data-parallel functions,called kernels.The device code is further compiled by nvcc.During the link stage,CUDA runtime libraries are added for kernel procedure calls and explicit GPU device manipulation. www.it-ebooks.info
Heterogeneous Computing ❘ 15 c01.indd 08/19/2014 Page 15 driver API. Each function of the runtime API is broken down into more basic operations issued to the driver API. CPU Applications CUDA Libraries CUDA Runtime CUDA Driver GPU FIGURE 1-13 RUNTIME API VERSUS DRIVER API There is no noticeable performance difference between the runtime and driver APIs. How your kernels use memory and how you organize your threads on the device have a much more pronounced effect. These two APIs are mutually exclusive. You must use one or the other, but it is not possible to mix function calls from both. All examples throughout this book use the runtime API. A CUDA program consists of a mixture of the following two parts: ➤ The host code runs on CPU. ➤ The device code runs on GPU. NVIDIA’s CUDA nvcc compiler separates the device code from the host code during the compilation process. As shown in Figure 1-14, the host code is standard C code and is further compiled with C compilers. The device code is written using CUDA C extended with keywords for labeling data-parallel functions, called kernels. The device code is further compiled by nvcc. During the link stage, CUDA runtime libraries are added for kernel procedure calls and explicit GPU device manipulation. www.it-ebooks.info
16 CHAPTER 1 HETEROGENEOUS PARALLEL COMPUTING WITH CUDA CUDA Libraries Integrated CPU+GPU Code CUDA Compiler CUDA Assembly for Computing(PTX) CPU Host Code CUDA Driver Debugger Runtime Profiler C Compiler GPU CPU FIGURE 1-14 The CUDA nvcc compiler is based on the widely used LLVM open source compiler infrastructure. You can create or extend programming languages with support for GPU acceleration using the CUDA Compiler SDK,as shown in Figure 1-15. CUDA New Language C,C++,Fortran Support LLVM Compiler For CUDA NVIDIA ×86 New Processor GPUs CPUs Support FIGURE 1-15 The CUDA platform is also a foundation that supports a diverse parallel computing ecosystem,as shown in Figure 1-16.Today,the CUDA ecosystem is growing rapidly as more and more companies provide world-class tools,services,and solutions.If you want to build your applications on GPUs, the easiest way to harness the performance of GPUs is with the CUDA Toolkit(https:/ developer.nvidia.com/cuda-toolkit),which provides a comprehensive development environ- ment for C and C++developers.The CUDA Toolkit includes a compiler,math libraries,and tools for debugging and optimizing the performance of your applications.You will also find code samples, programming guides,user manuals,API references,and other documentation to help you get started. www.it-ebooks.info
16 ❘ CHAPTER 1 HETEROGENEOUS PARALLEL COMPUTING WITH CUDA c01.indd 08/19/2014 Page 16 CUDA Libraries CUDA Compiler CPU Host Code C Compiler CPU CUDA Assembly for Computing (PTX) CUDA Driver & Runtime Debugger Profiler GPU Integrated CPU+GPU Code FIGURE 1-14 The CUDA nvcc compiler is based on the widely used LLVM open source compiler infrastructure. You can create or extend programming languages with support for GPU acceleration using the CUDA Compiler SDK, as shown in Figure 1-15. CUDA C, C++, Fortran New Language Support LLVM Compiler For CUDA NVIDIA GPUs New Processor Support ×86 CPUs FIGURE 1-15 The CUDA platform is also a foundation that supports a diverse parallel computing ecosystem, as shown in Figure 1-16. Today, the CUDA ecosystem is growing rapidly as more and more companies provide world-class tools, services, and solutions. If you want to build your applications on GPUs, the easiest way to harness the performance of GPUs is with the CUDA Toolkit (https:// developer.nvidia.com/cuda-toolkit), which provides a comprehensive development environment for C and C++ developers. The CUDA Toolkit includes a compiler, math libraries, and tools for debugging and optimizing the performance of your applications. You will also fi nd code samples, programming guides, user manuals, API references, and other documentation to help you get started. www.it-ebooks.info
Hello World from GPU 17 Compiler Tool Chain Programmin ⊙ nVIDIA CUDA Platform Libraries Developer FIGURE 1-16 HELLO WORLD FROM GPU The best way to learn a new programming language is by writing programs using the new language. In this section,you are going to write your first kernel code running on the GPU.The first program is the same for all languages:Print the string "Hello World." If this is your first time working with CUDA,you may want to check that the CUDA compiler is installed properly with the following command on a Linux system: which nvcc A typical response would be: /usr/local/cuda/bin/nvcc You also need to check if a GPU accelerator card is attached in your machine.You can do so with the following command on a Linux system: ls-1 /dev/nv+ A typical response would be: crw-rw-rw-1 root root 195,0 Jul 3 13:44 /dev/nvidia0 crw-rw-rw-1 rootroot 195,1 Jul 3 13:44 /dev/nvidial crw-rw-rw-1 root root 195,255 Jul 3 13:44 /dev/nvidiactl crw-rw----1 root root 10,144 Jul 3 13:39 /dev/nvram www.it-ebooks.info
Hello World from GPU ❘ 17 c01.indd 08/19/2014 Page 17 Compiler Tool Chain Programming Libraries Languages Developer Tools Platform FIGURE 1-16 HELLO WORLD FROM GPU The best way to learn a new programming language is by writing programs using the new language. In this section, you are going to write your fi rst kernel code running on the GPU. The fi rst program is the same for all languages: Print the string “Hello World.” If this is your fi rst time working with CUDA, you may want to check that the CUDA compiler is installed properly with the following command on a Linux system: $ which nvcc A typical response would be: /usr/local/cuda/bin/nvcc You also need to check if a GPU accelerator card is attached in your machine. You can do so with the following command on a Linux system: $ ls -l /dev/nv* A typical response would be: crw-rw-rw- 1 root root 195, 0 Jul 3 13:44 /dev/nvidia0 crw-rw-rw- 1 root root 195, 1 Jul 3 13:44 /dev/nvidia1 crw-rw-rw- 1 root root 195, 255 Jul 3 13:44 /dev/nvidiactl crw-rw---- 1 root root 10, 144 Jul 3 13:39 /dev/nvram www.it-ebooks.info