《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）NVIDIA CUDA C Programming Guide（Design Guide，June 2017）.pdf

www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | iii TABLE OF CONTENTS Chapter 1. Introduction.........................................................................................1 1.1. From Graphics Processing to General Purpose Parallel Computing............................... 1 1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model.............3 1.3. A Scalable Programming Model.........................................................................4 1.4. Document Structure...................................................................................... 6 Chapter 2. Programming Model............................................................................... 8 2.1. Kernels......................................................................................................8 2.2. Thread Hierarchy......................................................................................... 9 2.3. Memory Hierarchy....................................................................................... 11 2.4. Heterogeneous Programming.......................................................................... 13 2.5. Compute Capability..................................................................................... 15 Chapter 3. Programming Interface..........................................................................16 3.1. Compilation with NVCC................................................................................ 16 3.1.1. Compilation Workflow.............................................................................17 3.1.1.1. Offline Compilation.......................................................................... 17 3.1.1.2. Just-in-Time Compilation....................................................................17 3.1.2. Binary Compatibility...............................................................................17 3.1.3. PTX Compatibility..................................................................................18 3.1.4. Application Compatibility.........................................................................18 3.1.5. C/C++ Compatibility............................................................................... 19 3.1.6. 64-Bit Compatibility............................................................................... 19 3.2. CUDA C Runtime.........................................................................................19 3.2.1. Initialization.........................................................................................20 3.2.2. Device Memory..................................................................................... 20 3.2.3. Shared Memory..................................................................................... 23 3.2.4. Page-Locked Host Memory........................................................................28 3.2.4.1. Portable Memory..............................................................................29 3.2.4.2. Write-Combining Memory....................................................................29 3.2.4.3. Mapped Memory...............................................................................29 3.2.5. Asynchronous Concurrent Execution............................................................ 30 3.2.5.1. Concurrent Execution between Host and Device........................................31 3.2.5.2. Concurrent Kernel Execution............................................................... 31 3.2.5.3. Overlap of Data Transfer and Kernel Execution......................................... 31 3.2.5.4. Concurrent Data Transfers.................................................................. 32 3.2.5.5. Streams.........................................................................................32 3.2.5.6. Events...........................................................................................36 3.2.5.7. Synchronous Calls.............................................................................36 3.2.6. Multi-Device System............................................................................... 37 3.2.6.1. Device Enumeration.......................................................................... 37 3.2.6.2. Device Selection.............................................................................. 37

www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | iv 3.2.6.3. Stream and Event Behavior................................................................. 37 3.2.6.4. Peer-to-Peer Memory Access................................................................38 3.2.6.5. Peer-to-Peer Memory Copy..................................................................38 3.2.7. Unified Virtual Address Space................................................................... 39 3.2.8. Interprocess Communication..................................................................... 40 3.2.9. Error Checking......................................................................................40 3.2.10. Call Stack.......................................................................................... 41 3.2.11. Texture and Surface Memory................................................................... 41 3.2.11.1. Texture Memory............................................................................. 41 3.2.11.2. Surface Memory............................................................................. 51 3.2.11.3. CUDA Arrays..................................................................................55 3.2.11.4. Read/Write Coherency..................................................................... 55 3.2.12. Graphics Interoperability........................................................................55 3.2.12.1. OpenGL Interoperability................................................................... 56 3.2.12.2. Direct3D Interoperability...................................................................58 3.2.12.3. SLI Interoperability..........................................................................64 3.3. Versioning and Compatibility.......................................................................... 65 3.4. Compute Modes..........................................................................................66 3.5. Mode Switches........................................................................................... 67 3.6. Tesla Compute Cluster Mode for Windows.......................................................... 67 Chapter 4. Hardware Implementation......................................................................68 4.1. SIMT Architecture....................................................................................... 68 4.2. Hardware Multithreading...............................................................................70 Chapter 5. Performance Guidelines........................................................................ 71 5.1. Overall Performance Optimization Strategies...................................................... 71 5.2. Maximize Utilization.................................................................................... 71 5.2.1. Application Level...................................................................................71 5.2.2. Device Level........................................................................................ 72 5.2.3. Multiprocessor Level...............................................................................72 5.2.3.1. Occupancy Calculator........................................................................ 74 5.3. Maximize Memory Throughput........................................................................ 76 5.3.1. Data Transfer between Host and Device....................................................... 77 5.3.2. Device Memory Accesses..........................................................................78 5.4. Maximize Instruction Throughput.....................................................................82 5.4.1. Arithmetic Instructions............................................................................82 5.4.2. Control Flow Instructions......................................................................... 86 5.4.3. Synchronization Instruction.......................................................................87 Appendix A. CUDA-Enabled GPUs........................................................................... 88 Appendix B. C Language Extensions........................................................................89 B.1. Function Type Qualifiers............................................................................... 89 B.1.1. __device__.......................................................................................... 89 B.1.2. __global__...........................................................................................89 B.1.3. __host__............................................................................................. 89

www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | v B.1.4. __noinline__ and __forceinline__............................................................... 90 B.2. Variable Type Qualifiers................................................................................90 B.2.1. __device__.......................................................................................... 90 B.2.2. __constant__........................................................................................91 B.2.3. __shared__.......................................................................................... 91 B.2.4. __managed__....................................................................................... 92 B.2.5. __restrict__......................................................................................... 92 B.3. Built-in Vector Types................................................................................... 93 B.3.1. char, short, int, long, longlong, float, double................................................ 93 B.3.2. dim3..................................................................................................94 B.4. Built-in Variables........................................................................................ 95 B.4.1. gridDim.............................................................................................. 95 B.4.2. blockIdx..............................................................................................95 B.4.3. blockDim.............................................................................................95 B.4.4. threadIdx............................................................................................ 95 B.4.5. warpSize............................................................................................. 95 B.5. Memory Fence Functions...............................................................................95 B.6. Synchronization Functions............................................................................. 98 B.7. Mathematical Functions................................................................................ 99 B.8. Texture Functions....................................................................................... 99 B.8.1. Texture Object API...............................................................................100 B.8.1.1. tex1Dfetch()..................................................................................100 B.8.1.2. tex1D()........................................................................................ 100 B.8.1.3. tex1DLod()....................................................................................100 B.8.1.4. tex1DGrad().................................................................................. 100 B.8.1.5. tex2D()........................................................................................ 100 B.8.1.6. tex2DLod()....................................................................................100 B.8.1.7. tex2DGrad().................................................................................. 101 B.8.1.8. tex3D()........................................................................................ 101 B.8.1.9. tex3DLod()....................................................................................101 B.8.1.10. tex3DGrad().................................................................................101 B.8.1.11. tex1DLayered()............................................................................. 101 B.8.1.12. tex1DLayeredLod().........................................................................101 B.8.1.13. tex1DLayeredGrad()....................................................................... 102 B.8.1.14. tex2DLayered()............................................................................. 102 B.8.1.15. tex2DLayeredLod().........................................................................102 B.8.1.16. tex2DLayeredGrad()....................................................................... 102 B.8.1.17. texCubemap().............................................................................. 102 B.8.1.18. texCubemapLod().......................................................................... 102 B.8.1.19. texCubemapLayered().....................................................................103 B.8.1.20. texCubemapLayeredLod()................................................................ 103 B.8.1.21. tex2Dgather()...............................................................................103 B.8.2. Texture Reference API...........................................................................104