《并行与分布式程序设计》课程教学参考书：NVIDIA《CUDA C PROGRAMMING GUIDE》（Design Guide，CHANGES FROM VERSION 9.0）.pdf

www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | iii TABLE OF CONTENTS Chapter 1. Introduction.........................................................................................1 1.1. From Graphics Processing to General Purpose Parallel Computing............................... 1 1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model.............3 1.3. A Scalable Programming Model.........................................................................4 1.4. Document Structure...................................................................................... 6 Chapter 2. Programming Model............................................................................... 8 2.1. Kernels......................................................................................................8 2.2. Thread Hierarchy......................................................................................... 9 2.3. Memory Hierarchy....................................................................................... 11 2.4. Heterogeneous Programming.......................................................................... 13 2.5. Compute Capability..................................................................................... 15 Chapter 3. Programming Interface..........................................................................16 3.1. Compilation with NVCC................................................................................ 16 3.1.1. Compilation Workflow.............................................................................17 3.1.1.1. Offline Compilation.......................................................................... 17 3.1.1.2. Just-in-Time Compilation....................................................................17 3.1.2. Binary Compatibility...............................................................................17 3.1.3. PTX Compatibility..................................................................................18 3.1.4. Application Compatibility.........................................................................18 3.1.5. C/C++ Compatibility............................................................................... 19 3.1.6. 64-Bit Compatibility............................................................................... 19 3.2. CUDA C Runtime.........................................................................................19 3.2.1. Initialization.........................................................................................20 3.2.2. Device Memory..................................................................................... 20 3.2.3. Shared Memory..................................................................................... 24 3.2.4. Page-Locked Host Memory........................................................................29 3.2.4.1. Portable Memory..............................................................................30 3.2.4.2. Write-Combining Memory....................................................................30 3.2.4.3. Mapped Memory...............................................................................30 3.2.5. Asynchronous Concurrent Execution............................................................ 31 3.2.5.1. Concurrent Execution between Host and Device........................................32 3.2.5.2. Concurrent Kernel Execution............................................................... 32 3.2.5.3. Overlap of Data Transfer and Kernel Execution......................................... 32 3.2.5.4. Concurrent Data Transfers.................................................................. 33 3.2.5.5. Streams.........................................................................................33 3.2.5.6. Events...........................................................................................37 3.2.5.7. Synchronous Calls.............................................................................38 3.2.6. Multi-Device System............................................................................... 38 3.2.6.1. Device Enumeration.......................................................................... 38 3.2.6.2. Device Selection.............................................................................. 38

www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | iv 3.2.6.3. Stream and Event Behavior................................................................. 39 3.2.6.4. Peer-to-Peer Memory Access................................................................39 3.2.6.5. Peer-to-Peer Memory Copy..................................................................40 3.2.7. Unified Virtual Address Space................................................................... 41 3.2.8. Interprocess Communication..................................................................... 41 3.2.9. Error Checking......................................................................................42 3.2.10. Call Stack.......................................................................................... 42 3.2.11. Texture and Surface Memory................................................................... 42 3.2.11.1. Texture Memory............................................................................. 43 3.2.11.2. Surface Memory............................................................................. 52 3.2.11.3. CUDA Arrays..................................................................................56 3.2.11.4. Read/Write Coherency..................................................................... 56 3.2.12. Graphics Interoperability........................................................................56 3.2.12.1. OpenGL Interoperability................................................................... 57 3.2.12.2. Direct3D Interoperability...................................................................59 3.2.12.3. SLI Interoperability..........................................................................65 3.3. Versioning and Compatibility.......................................................................... 66 3.4. Compute Modes..........................................................................................67 3.5. Mode Switches........................................................................................... 68 3.6. Tesla Compute Cluster Mode for Windows.......................................................... 68 Chapter 4. Hardware Implementation......................................................................70 4.1. SIMT Architecture....................................................................................... 70 4.2. Hardware Multithreading...............................................................................72 Chapter 5. Performance Guidelines........................................................................ 74 5.1. Overall Performance Optimization Strategies...................................................... 74 5.2. Maximize Utilization.................................................................................... 74 5.2.1. Application Level...................................................................................74 5.2.2. Device Level........................................................................................ 75 5.2.3. Multiprocessor Level...............................................................................75 5.2.3.1. Occupancy Calculator........................................................................ 77 5.3. Maximize Memory Throughput........................................................................ 79 5.3.1. Data Transfer between Host and Device....................................................... 80 5.3.2. Device Memory Accesses..........................................................................81 5.4. Maximize Instruction Throughput.....................................................................85 5.4.1. Arithmetic Instructions............................................................................85 5.4.2. Control Flow Instructions......................................................................... 89 5.4.3. Synchronization Instruction.......................................................................90 Appendix A. CUDA-Enabled GPUs........................................................................... 91 Appendix B. C Language Extensions........................................................................92 B.1. Function Execution Space Specifiers.................................................................92 B.1.1. __device__.......................................................................................... 92 B.1.2. __global__...........................................................................................92 B.1.3. __host__............................................................................................. 93

www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | v B.1.4. __noinline__ and __forceinline__............................................................... 93 B.2. Variable Memory Space Specifiers....................................................................93 B.2.1. __device__.......................................................................................... 94 B.2.2. __constant__........................................................................................94 B.2.3. __shared__.......................................................................................... 94 B.2.4. __managed__....................................................................................... 95 B.2.5. __restrict__......................................................................................... 95 B.3. Built-in Vector Types................................................................................... 97 B.3.1. char, short, int, long, longlong, float, double................................................ 97 B.3.2. dim3..................................................................................................98 B.4. Built-in Variables........................................................................................ 98 B.4.1. gridDim.............................................................................................. 98 B.4.2. blockIdx..............................................................................................98 B.4.3. blockDim.............................................................................................98 B.4.4. threadIdx............................................................................................ 99 B.4.5. warpSize............................................................................................. 99 B.5. Memory Fence Functions...............................................................................99 B.6. Synchronization Functions............................................................................102 B.7. Mathematical Functions...............................................................................103 B.8. Texture Functions......................................................................................103 B.8.1. Texture Object API...............................................................................103 B.8.1.1. tex1Dfetch()..................................................................................103 B.8.1.2. tex1D()........................................................................................ 103 B.8.1.3. tex1DLod()....................................................................................103 B.8.1.4. tex1DGrad().................................................................................. 104 B.8.1.5. tex2D()........................................................................................ 104 B.8.1.6. tex2DLod()....................................................................................104 B.8.1.7. tex2DGrad().................................................................................. 104 B.8.1.8. tex3D()........................................................................................ 104 B.8.1.9. tex3DLod()....................................................................................104 B.8.1.10. tex3DGrad().................................................................................105 B.8.1.11. tex1DLayered()............................................................................. 105 B.8.1.12. tex1DLayeredLod().........................................................................105 B.8.1.13. tex1DLayeredGrad()....................................................................... 105 B.8.1.14. tex2DLayered()............................................................................. 105 B.8.1.15. tex2DLayeredLod().........................................................................105 B.8.1.16. tex2DLayeredGrad()....................................................................... 106 B.8.1.17. texCubemap().............................................................................. 106 B.8.1.18. texCubemapLod().......................................................................... 106 B.8.1.19. texCubemapLayered().....................................................................106 B.8.1.20. texCubemapLayeredLod()................................................................ 106 B.8.1.21. tex2Dgather()...............................................................................106 B.8.2. Texture Reference API...........................................................................107