nVIDIA. CUDA C PROGRAMMING GUIDE PG-02829-001v8.01June2017 Design Guide
CUDA C PROGRAMMING GUIDE PG-02829-001_v8.0 | June 2017 Design Guide
CHANGES FROM VERSION 7.5 Updates to add compute capabilities 6.0,6.1 and 6.2,including: Updated Table 13 to mention support of 64-bit floating point atomicAdd on devices of compute capabilities 6.x. Added compute capabilities 6.0,6.1,and 6.2 to Table 14. Added atomicAdd()support of 64-bit floating point for compute capability 6.x to atomicAdd(). Added scoped atomics support for compute capability 6.x to Atomic Functions. Updated section Unified Memory Programming: Documented new features and behavior of architectures with compute capabilities 6.x. Added new section Performance Tuning. www.nvidia.com CUDA C Programming Guide PG-02829-001v8.01i
www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | ii CHANGES FROM VERSION 7.5 ‣ Updates to add compute capabilities 6.0, 6.1 and 6.2, including: ‣ Updated Table 13 to mention support of 64-bit floating point atomicAdd on devices of compute capabilities 6.x. ‣ Added compute capabilities 6.0, 6.1, and 6.2 to Table 14. ‣ Added atomicAdd() support of 64-bit floating point for compute capability 6.x to atomicAdd(). ‣ Added scoped atomics support for compute capability 6.x to Atomic Functions. ‣ Updated section Unified Memory Programming: ‣ Documented new features and behavior of architectures with compute capabilities 6.x. ‣ Added new section Performance Tuning
TABLE OF CONTENTS Chapter 1.Introduction..1 1.1.From Graphics Processing to General Purpose Parallel Computing...............................1 1.2.CUDA:A General-Purpose Parallel Computing Platform and Programming Model.............3 1.3.A Scalable Programming Model.....................................4 1.4.Document Structure....................... …6 8 Chapter 2.Programming Model................ 2.1.Kernels..… .8 2.2.Thread Hierarchy........................... 9 2.3.Memory Hierarchy................... 11 2.4.Heterogeneous Programming.................. 3 2.5.Compute Capability.............. 15 Chapter 3.Programming Interface........ …16 3.1.Compilation with NVCC............ 16 3.1.1.Compilation Workflow.... 17 3.1.1.1.Offline Compilation....... 17 3.1.1.2.Just-in-Time Compilation.. .17 3.1.2.Binary Compatibility............. .17 3.1.3.PTX Compatibility............... 18 3.1.4.Application Compatibility........... 18 3.1.5.C/C++Compatibility...... .19 3.1.6.64-Bit Compatibility................ 19 3.2.cUDA C Runtime.............. .19 3.2.1.Initiatization................. …20 3.2.2.Device Memory.… 20 3.2.3.Shared Memory.................. .23 3.2.4.Page-Locked Host Memory..................... 28 3.2.4.1.Portable Memory.................... …29 3.2.4.2.Write-combining Memory.........29 3.2.4.3.Mapped Memory.................. 29 3.2.5.Asynchronous Concurrent Execution............... .30 3.2.5.1.Concurrent Execution between Host and Device..... .31 3.2.5.2.Concurrent Kernel Execution....................... 31 3.2.5.3.Overlap of Data Transfer and Kernel Execution.... ,31 3.2.5.4.Concurrent Data Transfers...................... 32 3.2.5.5.Streams.… .32 3.2.5.6.Events.… 36 3.2.5.7.Synchronous Calls.... .36 3.2.6.Multi-Device System...................... .37 3.2.6.1.Device Enumeration...... .37 3.2.6.2.Device Selection......................37 www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0|ii
www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | iii TABLE OF CONTENTS Chapter 1. Introduction.........................................................................................1 1.1. From Graphics Processing to General Purpose Parallel Computing............................... 1 1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model.............3 1.3. A Scalable Programming Model.........................................................................4 1.4. Document Structure...................................................................................... 6 Chapter 2. Programming Model............................................................................... 8 2.1. Kernels......................................................................................................8 2.2. Thread Hierarchy......................................................................................... 9 2.3. Memory Hierarchy....................................................................................... 11 2.4. Heterogeneous Programming.......................................................................... 13 2.5. Compute Capability..................................................................................... 15 Chapter 3. Programming Interface..........................................................................16 3.1. Compilation with NVCC................................................................................ 16 3.1.1. Compilation Workflow.............................................................................17 3.1.1.1. Offline Compilation.......................................................................... 17 3.1.1.2. Just-in-Time Compilation....................................................................17 3.1.2. Binary Compatibility...............................................................................17 3.1.3. PTX Compatibility..................................................................................18 3.1.4. Application Compatibility.........................................................................18 3.1.5. C/C++ Compatibility............................................................................... 19 3.1.6. 64-Bit Compatibility............................................................................... 19 3.2. CUDA C Runtime.........................................................................................19 3.2.1. Initialization.........................................................................................20 3.2.2. Device Memory..................................................................................... 20 3.2.3. Shared Memory..................................................................................... 23 3.2.4. Page-Locked Host Memory........................................................................28 3.2.4.1. Portable Memory..............................................................................29 3.2.4.2. Write-Combining Memory....................................................................29 3.2.4.3. Mapped Memory...............................................................................29 3.2.5. Asynchronous Concurrent Execution............................................................ 30 3.2.5.1. Concurrent Execution between Host and Device........................................31 3.2.5.2. Concurrent Kernel Execution............................................................... 31 3.2.5.3. Overlap of Data Transfer and Kernel Execution......................................... 31 3.2.5.4. Concurrent Data Transfers.................................................................. 32 3.2.5.5. Streams.........................................................................................32 3.2.5.6. Events...........................................................................................36 3.2.5.7. Synchronous Calls.............................................................................36 3.2.6. Multi-Device System............................................................................... 37 3.2.6.1. Device Enumeration.......................................................................... 37 3.2.6.2. Device Selection.............................................................................. 37
3.2.6.3.Stream and Event Behavior..... .37 3.2.6.4.Peer-to-Peer Memory Access.....38 3.2.6.5.Peer-to-Peer Memory Copy............ 38 3.2.7.Unified Virtual Address Space.....39 3.2.8.Interprocess Communication...... 40 3.2.9.Error checking..................... ……40 3.2.10.call Stack............. .41 3.2.11.Texture and Surface Memory......... 41 3.2.11.1.Texture Memory................ .41 3.2.11.2.Surface Memory.............. 51 3.2.11.3.CUDA Arrays................. .55 3.2.11.4.Read/Write Coherency..... 55 3.2.12.Graphics Interoperability.......... .55 3.2.12.1.OpenGL Interoperability.... 56 3.2.12.2.Direct3D Interoperability...... .58 3.2.12.3.SLI Interoperability....... .64 3.3.Versioning and Compatibility................ 65 3.4.Compute Modes.................... 66 3.5.Mode Switches..67 3.6.Tesla Compute Cluster Mode for Windows... .67 Chapter 4.Hardware Implementation.................. .68 4.1.SIMT Architecture................. 68 4.2.Hardware Multithreading............... .70 Chapter 5.Performance Guidelines.................. .71 5.1.Overall Performance Optimization Strategies.. 71 5.2.Maximize Utilization.......................... 71 5.2.1.Application Level.................... .71 5.2.2.Device Level..… .72 5.2.3.Multiprocessor Level...... 72 5.2.3.1.Occupancy Calculator............ .74 5.3.Maximize Memory Throughput...... 76 5.3.1.Data Transfer between Host and Device .77 5.3.2.Device Memory Accesses.......... 78 5.4.Maximize Instruction Throughput........... .82 5.4.1.Arithmetic Instructions...... .82 5.4.2.Control Flow Instructions.............. .86 5.4.3.Synchronization Instruction... .87 Appendix A.CUDA-Enabled GPUs......... …88 Appendix B.C Language Extensions.. 89 B.1.Function Type Qualifiers........... …89 B.1.1.device......... 89 B.1.2.global .89 B.1.3.ho5t..... .89 www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.01iv
www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | iv 3.2.6.3. Stream and Event Behavior................................................................. 37 3.2.6.4. Peer-to-Peer Memory Access................................................................38 3.2.6.5. Peer-to-Peer Memory Copy..................................................................38 3.2.7. Unified Virtual Address Space................................................................... 39 3.2.8. Interprocess Communication..................................................................... 40 3.2.9. Error Checking......................................................................................40 3.2.10. Call Stack.......................................................................................... 41 3.2.11. Texture and Surface Memory................................................................... 41 3.2.11.1. Texture Memory............................................................................. 41 3.2.11.2. Surface Memory............................................................................. 51 3.2.11.3. CUDA Arrays..................................................................................55 3.2.11.4. Read/Write Coherency..................................................................... 55 3.2.12. Graphics Interoperability........................................................................55 3.2.12.1. OpenGL Interoperability................................................................... 56 3.2.12.2. Direct3D Interoperability...................................................................58 3.2.12.3. SLI Interoperability..........................................................................64 3.3. Versioning and Compatibility.......................................................................... 65 3.4. Compute Modes..........................................................................................66 3.5. Mode Switches........................................................................................... 67 3.6. Tesla Compute Cluster Mode for Windows.......................................................... 67 Chapter 4. Hardware Implementation......................................................................68 4.1. SIMT Architecture....................................................................................... 68 4.2. Hardware Multithreading...............................................................................70 Chapter 5. Performance Guidelines........................................................................ 71 5.1. Overall Performance Optimization Strategies...................................................... 71 5.2. Maximize Utilization.................................................................................... 71 5.2.1. Application Level...................................................................................71 5.2.2. Device Level........................................................................................ 72 5.2.3. Multiprocessor Level...............................................................................72 5.2.3.1. Occupancy Calculator........................................................................ 74 5.3. Maximize Memory Throughput........................................................................ 76 5.3.1. Data Transfer between Host and Device....................................................... 77 5.3.2. Device Memory Accesses..........................................................................78 5.4. Maximize Instruction Throughput.....................................................................82 5.4.1. Arithmetic Instructions............................................................................82 5.4.2. Control Flow Instructions......................................................................... 86 5.4.3. Synchronization Instruction.......................................................................87 Appendix A. CUDA-Enabled GPUs........................................................................... 88 Appendix B. C Language Extensions........................................................................89 B.1. Function Type Qualifiers............................................................................... 89 B.1.1. __device__.......................................................................................... 89 B.1.2. __global__...........................................................................................89 B.1.3. __host__............................................................................................. 89
B.1.4.noinline and forceinline. 90 B.2.Variable Type Quatifiers......................... 90 B.2.1.device._… 90 B.2.2.constant_ 91 B.2.3.shared .91 B.2.4.managed.… .92 B.2.5.restrict........... .92 B.3.Built-in Vector Types............. 93 B.3.1.char,short,int,long,longlong,float,double .93 B.3.2.dim3. 94 B.4.Built-in Variables....................... 95 B.4.1.gridDim..… 95 B.4.2.blockldx........... .95 B.4.3.blockDim.… .95 B.4.4.threadldx................... .95 B.4.5.warpSize.… 95 B.5.Memory Fence Functions................. 95 B.6.Synchronization Functions.... 98 B.7.Mathematical Functions................ 99 B.8.Texture Functions.......... .99 B.8.1.Texture object APl.......... …100 B.8.1.1.tex1Dfetch()...... ..100 B.8.1.2.tex1D0. 100 B.8.1.3.tex1DLod0.… .100 B.8.1.4.tex1DGrad()........... 100 B.8.1.5.tex2D0.. 100 B.8.1.6.tex2DLod()....... ..100 B.8.1.7.tex2DGrad()........ .101 B.8.1.8.tex3D0. 101 B.8.1.9.tex3DLod()........... ..101 B.8.1.10.tex3DGrad().... ,101 B.8.1.11.tex1DLayered()........ .101 B.8.1.12.tex1DLayeredLod().. 101 B.8.1.13.tex1DLayeredGrad(). … 102 B.8.1.14.tex2DLayered()...... 102 B.8.1.15.tex2DLayeredLod()......... .102 B.8.1.16.tex2DLayeredGrad()... .102 B.8.1.17.texCubemap()............ .102 B.8.1.18.texCubemapLod()...... .102 B.8.1.19.texCubemapLayered()...... .103 B.8.1.20.texCubemapLayeredLod(). 103 B.8.1.21.tex2Dgather()............. .103 B.8.2.Texture Reference APl......... 104 www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.01V
www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | v B.1.4. __noinline__ and __forceinline__............................................................... 90 B.2. Variable Type Qualifiers................................................................................90 B.2.1. __device__.......................................................................................... 90 B.2.2. __constant__........................................................................................91 B.2.3. __shared__.......................................................................................... 91 B.2.4. __managed__....................................................................................... 92 B.2.5. __restrict__......................................................................................... 92 B.3. Built-in Vector Types................................................................................... 93 B.3.1. char, short, int, long, longlong, float, double................................................ 93 B.3.2. dim3..................................................................................................94 B.4. Built-in Variables........................................................................................ 95 B.4.1. gridDim.............................................................................................. 95 B.4.2. blockIdx..............................................................................................95 B.4.3. blockDim.............................................................................................95 B.4.4. threadIdx............................................................................................ 95 B.4.5. warpSize............................................................................................. 95 B.5. Memory Fence Functions...............................................................................95 B.6. Synchronization Functions............................................................................. 98 B.7. Mathematical Functions................................................................................ 99 B.8. Texture Functions....................................................................................... 99 B.8.1. Texture Object API...............................................................................100 B.8.1.1. tex1Dfetch()..................................................................................100 B.8.1.2. tex1D()........................................................................................ 100 B.8.1.3. tex1DLod()....................................................................................100 B.8.1.4. tex1DGrad().................................................................................. 100 B.8.1.5. tex2D()........................................................................................ 100 B.8.1.6. tex2DLod()....................................................................................100 B.8.1.7. tex2DGrad().................................................................................. 101 B.8.1.8. tex3D()........................................................................................ 101 B.8.1.9. tex3DLod()....................................................................................101 B.8.1.10. tex3DGrad().................................................................................101 B.8.1.11. tex1DLayered()............................................................................. 101 B.8.1.12. tex1DLayeredLod().........................................................................101 B.8.1.13. tex1DLayeredGrad()....................................................................... 102 B.8.1.14. tex2DLayered()............................................................................. 102 B.8.1.15. tex2DLayeredLod().........................................................................102 B.8.1.16. tex2DLayeredGrad()....................................................................... 102 B.8.1.17. texCubemap().............................................................................. 102 B.8.1.18. texCubemapLod().......................................................................... 102 B.8.1.19. texCubemapLayered().....................................................................103 B.8.1.20. texCubemapLayeredLod()................................................................ 103 B.8.1.21. tex2Dgather()...............................................................................103 B.8.2. Texture Reference API...........................................................................104