Programming Interface an overview of nvee workflow and command options.A complete description can be found in the nvee user manual. 3.1.1.Compilation Workflow 3.1.1.1.Offline Compilation Source files compiled with nvec can include a mix of host code(i.e.,code that executes on the host)and device code (i.e.,code that executes on the device).nvee's basic workflow consists in separating device code from host code and then: compiling the device code into an assembly form(PTX code)and/or binary form (cubin object), and modifying the host code by replacing the <<<...>>syntax introduced in Kernels(and described in more details in Execution Configuration)by the necessary CUDA C runtime function calls to load and launch each compiled kernel from the PTX code and/or cubin object. The modified host code is output either as C code that is left to be compiled using another tool or as object code directly by letting nvee invoke the host compiler during the last compilation stage. Applications can then: Either link to the compiled host code (this is the most common case), Or ignore the modified host code(if any)and use the CUDA driver API(see Driver APD)to load and execute the PTX code or cubin object. 3.1.1.2.Just-in-Time Compilation Any PTX code loaded by an application at runtime is compiled further to binary code by the device driver.This is called just-in-time compilation.Just-in-time compilation increases application load time,but allows the application to benefit from any new compiler improvements coming with each new device driver.It is also the only way for applications to run on devices that did not exist at the time the application was compiled,as detailed in Application Compatibility. When the device driver just-in-time compiles some PTX code for some application,it automatically caches a copy of the generated binary code in order to avoid repeating the compilation in subsequent invocations of the application.The cache-referred to as compute cache-is automatically invalidated when the device driver is upgraded,so that applications can benefit from the improvements in the new just-in-time compiler built into the device driver. Environment variables are available to control just-in-time compilation as described in CUDA Environment Variables 3.1.2.Binary Compatibility Binary code is architecture-specific.A cubin object is generated using the compiler option -code that specifies the targeted architecture:For example,compiling with -code=sm 35 produces binary code for devices of compute capability 3.5.Binary www.nvidia.com CUDA C Programming Guide PG-02829-001v8.0117
Programming Interface www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | 17 an overview of nvcc workflow and command options. A complete description can be found in the nvcc user manual. 3.1.1. Compilation Workflow 3.1.1.1. Offline Compilation Source files compiled with nvcc can include a mix of host code (i.e., code that executes on the host) and device code (i.e., code that executes on the device). nvcc's basic workflow consists in separating device code from host code and then: ‣ compiling the device code into an assembly form (PTX code) and/or binary form (cubin object), ‣ and modifying the host code by replacing the <<<...>>> syntax introduced in Kernels (and described in more details in Execution Configuration) by the necessary CUDA C runtime function calls to load and launch each compiled kernel from the PTX code and/or cubin object. The modified host code is output either as C code that is left to be compiled using another tool or as object code directly by letting nvcc invoke the host compiler during the last compilation stage. Applications can then: ‣ Either link to the compiled host code (this is the most common case), ‣ Or ignore the modified host code (if any) and use the CUDA driver API (see Driver API) to load and execute the PTX code or cubin object. 3.1.1.2. Just-in-Time Compilation Any PTX code loaded by an application at runtime is compiled further to binary code by the device driver. This is called just-in-time compilation. Just-in-time compilation increases application load time, but allows the application to benefit from any new compiler improvements coming with each new device driver. It is also the only way for applications to run on devices that did not exist at the time the application was compiled, as detailed in Application Compatibility. When the device driver just-in-time compiles some PTX code for some application, it automatically caches a copy of the generated binary code in order to avoid repeating the compilation in subsequent invocations of the application. The cache - referred to as compute cache - is automatically invalidated when the device driver is upgraded, so that applications can benefit from the improvements in the new just-in-time compiler built into the device driver. Environment variables are available to control just-in-time compilation as described in CUDA Environment Variables 3.1.2. Binary Compatibility Binary code is architecture-specific. A cubin object is generated using the compiler option -code that specifies the targeted architecture: For example, compiling with -code=sm_35 produces binary code for devices of compute capability 3.5. Binary
Programming Interface compatibility is guaranteed from one minor revision to the next one,but not from one minor revision to the previous one or across major revisions.In other words,a cubin object generated for compute capability X.y will only execute on devices of compute capability X.z where zzy. 3.1.3.PTX Compatibility Some PTX instructions are only supported on devices of higher compute capabilities. For example,warp shuffle instructions are only supported on devices of compute capability 3.0 and above.The-arch compiler option specifies the compute capability that is assumed when compiling C to PTX code.So,code that contains warp shuffle,for example,must be compiled with-arch=sm_30(or higher). PTX code produced for some specific compute capability can always be compiled to binary code of greater or equal compute capability. 3.1.4.Application Compatibility To execute code on devices of specific compute capability,an application must load binary or PTX code that is compatible with this compute capability as described in Binary Compatibility and PTX Compatibility.In particular,to be able to execute code on future architectures with higher compute capability(for which no binary code can be generated yet),an application must load PTX code that will be just-in-time compiled for these devices(see Just-in-Time Compilation). Which PTX and binary code gets embedded in a CUDA C application is controlled by the -arch and -code compiler options or the -gencode compiler option as detailed in the nvec user manual.For example, nvcc x.cu -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=\'compute_35,sm_35\' embeds binary code compatible with compute capability 2.0 and 3.0(first and second -gencode options)and PTX and binary code compatible with compute capability 3.5 (third -gencode option). Host code is generated to automatically select at runtime the most appropriate code to load and execute,which,in the above example,will be: 2.0 binary code for devices with compute capability 2.0 and 2.1, 3.0 binary code for devices with compute capability 3.0, 3.5 binary code for devices with compute capability 3.5 and 3.7, PTX code which is compiled to binary code at runtime for devices with compute capability 5.0 and higher. x.cu can have an optimized code path that uses warp shuffle operations,for example, which are only supported in devices of compute capability 3.0 and higher.The CUDA ARCH macro can be used to differentiate various code paths based on compute capability.It is only defined for device code.When compiling with- arch=compute 35 for example,CUDA ARCH is equal to 350. www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0|18
Programming Interface www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | 18 compatibility is guaranteed from one minor revision to the next one, but not from one minor revision to the previous one or across major revisions. In other words, a cubin object generated for compute capability X.y will only execute on devices of compute capability X.z where z≥y. 3.1.3. PTX Compatibility Some PTX instructions are only supported on devices of higher compute capabilities. For example, warp shuffle instructions are only supported on devices of compute capability 3.0 and above. The -arch compiler option specifies the compute capability that is assumed when compiling C to PTX code. So, code that contains warp shuffle, for example, must be compiled with -arch=sm_30 (or higher). PTX code produced for some specific compute capability can always be compiled to binary code of greater or equal compute capability. 3.1.4. Application Compatibility To execute code on devices of specific compute capability, an application must load binary or PTX code that is compatible with this compute capability as described in Binary Compatibility and PTX Compatibility. In particular, to be able to execute code on future architectures with higher compute capability (for which no binary code can be generated yet), an application must load PTX code that will be just-in-time compiled for these devices (see Just-in-Time Compilation). Which PTX and binary code gets embedded in a CUDA C application is controlled by the -arch and -code compiler options or the -gencode compiler option as detailed in the nvcc user manual. For example, nvcc x.cu -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=\'compute_35,sm_35\' embeds binary code compatible with compute capability 2.0 and 3.0 (first and second -gencode options) and PTX and binary code compatible with compute capability 3.5 (third -gencode option). Host code is generated to automatically select at runtime the most appropriate code to load and execute, which, in the above example, will be: ‣ 2.0 binary code for devices with compute capability 2.0 and 2.1, ‣ 3.0 binary code for devices with compute capability 3.0, ‣ 3.5 binary code for devices with compute capability 3.5 and 3.7, ‣ PTX code which is compiled to binary code at runtime for devices with compute capability 5.0 and higher. x.cu can have an optimized code path that uses warp shuffle operations, for example, which are only supported in devices of compute capability 3.0 and higher. The __CUDA_ARCH__ macro can be used to differentiate various code paths based on compute capability. It is only defined for device code. When compiling with - arch=compute_35 for example, __CUDA_ARCH__ is equal to 350
Programming Interface Applications using the driver API must compile code to separate files and explicitly load and execute the most appropriate file at runtime. The nvec user manual lists various shorthand for the-arch,-code,and-gencode compiler options.For example,-arch=sm 35 is a shorthand for -arch=compute 35 code=compute_35,sm_35(which is the same as-gencode arch=compute_35,code= \'compute_35,sm_35\'). 3.1.5.C/C++Compatibility The front end of the compiler processes CUDA source files according to C++syntax rules.Full C++is supported for the host code.However,only a subset of C++is fully supported for the device code as described in C/C++Language Support. 3.1.6.64-Bit Compatibility The 64-bit version of nvee compiles device code in 64-bit mode(i.e.,pointers are 64-bit) Device code compiled in 64-bit mode is only supported with host code compiled in 64- bit mode. Similarly,the 32-bit version of nvee compiles device code in 32-bit mode and device code compiled in 32-bit mode is only supported with host code compiled in 32-bit mode. The 32-bit version of nvee can compile device code in 64-bit mode also using the -m64 compiler option. The 64-bit version of nvcc can compile device code in 32-bit mode also using the -m32 compiler option. 3.2.CUDA C Runtime The runtime is implemented in the cudart library,which is linked to the application, either statically via cudart.lib or libcudart.a,or dynamically via cudart.dll or libcudart.so.Applications that require cudart.dll and/or cudart.so for dynamic linking typically include them as part of the application installation package. All its entry points are prefixed with cuda. As mentioned in Heterogeneous Programming,the CUDA programming model assumes a system composed of a host and a device,each with their own separate memory.Device Memory gives an overview of the runtime functions used to manage device memory. Shared Memory illustrates the use of shared memory,introduced in Thread Hierarchy, to maximize performance. Page-Locked Host Memory introduces page-locked host memory that is required to overlap kernel execution with data transfers between host and device memory. Asynchronous Concurrent Execution describes the concepts and API used to enable asynchronous concurrent execution at various levels in the system. www.nvidia.com CUDA C Programming Guide PG-02829-001v8.0119
Programming Interface www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | 19 Applications using the driver API must compile code to separate files and explicitly load and execute the most appropriate file at runtime. The nvcc user manual lists various shorthand for the -arch, -code, and -gencode compiler options. For example, -arch=sm_35 is a shorthand for -arch=compute_35 - code=compute_35,sm_35 (which is the same as -gencode arch=compute_35,code= \'compute_35,sm_35\'). 3.1.5. C/C++ Compatibility The front end of the compiler processes CUDA source files according to C++ syntax rules. Full C++ is supported for the host code. However, only a subset of C++ is fully supported for the device code as described in C/C++ Language Support. 3.1.6. 64-Bit Compatibility The 64-bit version of nvcc compiles device code in 64-bit mode (i.e., pointers are 64-bit). Device code compiled in 64-bit mode is only supported with host code compiled in 64- bit mode. Similarly, the 32-bit version of nvcc compiles device code in 32-bit mode and device code compiled in 32-bit mode is only supported with host code compiled in 32-bit mode. The 32-bit version of nvcc can compile device code in 64-bit mode also using the -m64 compiler option. The 64-bit version of nvcc can compile device code in 32-bit mode also using the -m32 compiler option. 3.2. CUDA C Runtime The runtime is implemented in the cudart library, which is linked to the application, either statically via cudart.lib or libcudart.a, or dynamically via cudart.dll or libcudart.so. Applications that require cudart.dll and/or cudart.so for dynamic linking typically include them as part of the application installation package. All its entry points are prefixed with cuda. As mentioned in Heterogeneous Programming, the CUDA programming model assumes a system composed of a host and a device, each with their own separate memory. Device Memory gives an overview of the runtime functions used to manage device memory. Shared Memory illustrates the use of shared memory, introduced in Thread Hierarchy, to maximize performance. Page-Locked Host Memory introduces page-locked host memory that is required to overlap kernel execution with data transfers between host and device memory. Asynchronous Concurrent Execution describes the concepts and API used to enable asynchronous concurrent execution at various levels in the system
Programming Interface Multi-Device System shows how the programming model extends to a system with multiple devices attached to the same host. Error Checking describes how to properly check the errors generated by the runtime. Call Stack mentions the runtime functions used to manage the CUDA C call stack. Texture and Surface Memory presents the texture and surface memory spaces that provide another way to access device memory;they also expose a subset of the GPU texturing hardware. Graphics Interoperability introduces the various functions the runtime provides to interoperate with the two main graphics APIs,OpenGL and Direct3D. 3.2.1.Initialization There is no explicit initialization function for the runtime;it initializes the first time a runtime function is called(more specifically any function other than functions from the device and version management sections of the reference manual).One needs to keep this in mind when timing runtime function calls and when interpreting the error code from the first call into the runtime. During initialization,the runtime creates a CUDA context for each device in the system (see Context for more details on CUDA contexts).This context is the primary context for this device and it is shared among all the host threads of the application.As part of this context creation,the device code is just-in-time compiled if necessary(see Just-in-Time Compilation)and loaded into device memory.This all happens under the hood and the runtime does not expose the primary context to the application. When a host thread calls cudaDeviceReset(),this destroys the primary context of the device the host thread currently operates on (i.e.,the current device as defined in Device Selection).The next runtime function call made by any host thread that has this device as current will create a new primary context for this device. 3.2.2.Device Memory As mentioned in Heterogeneous Programming,the CUDA programming model assumes a system composed of a host and a device,each with their own separate memory.Kernels operate out of device memory,so the runtime provides functions to allocate,deallocate,and copy device memory,as well as transfer data between host memory and device memory. Device memory can be allocated either as linear memory or as CUDA arrays. CUDA arrays are opaque memory layouts optimized for texture fetching.They are described in Texture and Surface Memory. Linear memory exists on the device in a 40-bit address space,so separately allocated entities can reference one another via pointers,for example,in a binary tree. Linear memory is typically allocated using cudaMalloc()and freed using cudaFree() and data transfer between host memory and device memory are typically done using www.nvidia.com CUDA C Programming Guide PG-02829-001v8.0120
Programming Interface www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | 20 Multi-Device System shows how the programming model extends to a system with multiple devices attached to the same host. Error Checking describes how to properly check the errors generated by the runtime. Call Stack mentions the runtime functions used to manage the CUDA C call stack. Texture and Surface Memory presents the texture and surface memory spaces that provide another way to access device memory; they also expose a subset of the GPU texturing hardware. Graphics Interoperability introduces the various functions the runtime provides to interoperate with the two main graphics APIs, OpenGL and Direct3D. 3.2.1. Initialization There is no explicit initialization function for the runtime; it initializes the first time a runtime function is called (more specifically any function other than functions from the device and version management sections of the reference manual). One needs to keep this in mind when timing runtime function calls and when interpreting the error code from the first call into the runtime. During initialization, the runtime creates a CUDA context for each device in the system (see Context for more details on CUDA contexts). This context is the primary context for this device and it is shared among all the host threads of the application. As part of this context creation, the device code is just-in-time compiled if necessary (see Just-in-Time Compilation) and loaded into device memory. This all happens under the hood and the runtime does not expose the primary context to the application. When a host thread calls cudaDeviceReset(), this destroys the primary context of the device the host thread currently operates on (i.e., the current device as defined in Device Selection). The next runtime function call made by any host thread that has this device as current will create a new primary context for this device. 3.2.2. Device Memory As mentioned in Heterogeneous Programming, the CUDA programming model assumes a system composed of a host and a device, each with their own separate memory. Kernels operate out of device memory, so the runtime provides functions to allocate, deallocate, and copy device memory, as well as transfer data between host memory and device memory. Device memory can be allocated either as linear memory or as CUDA arrays. CUDA arrays are opaque memory layouts optimized for texture fetching. They are described in Texture and Surface Memory. Linear memory exists on the device in a 40-bit address space, so separately allocated entities can reference one another via pointers, for example, in a binary tree. Linear memory is typically allocated using cudaMalloc() and freed using cudaFree() and data transfer between host memory and device memory are typically done using
Programming Interface cudaMemepy ()In the vector addition code sample of Kernels,the vectors need to be copied from host memory to device memory: /Device code _global void VecAdd(float*A,float+B,float*C,int N) int i blockDim.x blockIdx.x threadIdx.x; if (i N) C[i]=A[i]+B[i]; /Host code int main() intN=,·.i size t size N sizeof(float); /Allocate input vectors h A and h B in host memory float*h A (float*)malloc(size); float*h B=(float*)malloc(size); /Initialize input vectors /Allocate vectors in device memory float*d A; cudaMalloc(&d A,size); float*d B; cudaMalloc(&d B,size); float*d C; cudaMalloc(&d C,size); //Copy vectors from host memory to device memory cudaMemcpy(d A,h A,size,cudaMemcpyHostToDevice); cudaMemcpy(d_B,h_B,size,cudaMemcpyHostToDevice); /Invoke kernel int threadsPerBlock 256; int blocksPerGrid (N threadsPerBlock-1)/threadsPerBlock; VecAdd<<<blocksPerGrid,threadsPerBlock>>>(d A,d B,d C,N); /Copy result from device memory to host memory /h C contains the result in host memory cudaMemcpy(h_C,d_C,size,cudaMemcpyDeviceToHost); /Free device memory cudaFree(d A); cudaFree(d B); cudaFree(d C); /Free host memory Linear memory can also be allocated through cudaMallocPitch (and cudaMalloc3D().These functions are recommended for allocations of 2D or 3D arrays as it makes sure that the allocation is appropriately padded to meet the alignment requirements described in Device Memory Accesses,therefore ensuring best performance when accessing the row addresses or performing copies between 2D arrays and other regions of device memory(using the cudaMemepy2D()and cudaMemepy3D() functions).The returned pitch(or stride)must be used to access array elements.The www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0|21
Programming Interface www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | 21 cudaMemcpy(). In the vector addition code sample of Kernels, the vectors need to be copied from host memory to device memory: // Device code __global__ void VecAdd(float* A, float* B, float* C, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } // Host code int main() { int N = ...; size_t size = N * sizeof(float); // Allocate input vectors h_A and h_B in host memory float* h_A = (float*)malloc(size); float* h_B = (float*)malloc(size); // Initialize input vectors ... // Allocate vectors in device memory float* d_A; cudaMalloc(&d_A, size); float* d_B; cudaMalloc(&d_B, size); float* d_C; cudaMalloc(&d_C, size); // Copy vectors from host memory to device memory cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); // Invoke kernel int threadsPerBlock = 256; int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock; VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N); // Copy result from device memory to host memory // h_C contains the result in host memory cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); // Free host memory ... } Linear memory can also be allocated through cudaMallocPitch() and cudaMalloc3D(). These functions are recommended for allocations of 2D or 3D arrays as it makes sure that the allocation is appropriately padded to meet the alignment requirements described in Device Memory Accesses, therefore ensuring best performance when accessing the row addresses or performing copies between 2D arrays and other regions of device memory (using the cudaMemcpy2D() and cudaMemcpy3D() functions). The returned pitch (or stride) must be used to access array elements. The