F6.Code Samples..... 231 F.6.1.Data Aggregation class............... 231 E.6.2.Derived class................ 231 E6.3.class Template............. 232 F.6.4.Function Template....... 232 F6.5.Functor class..................... 233 Appendix G.Texture Fetching................. .234 G.1.Nearest-Point Sampling................ 234 G.2.Linear Filtering...................... 235 G.3.Table Lookup......... 236 Appendix H.Compute Capabilities.......... .238 H.1.Features and Technical Specifications 238 H.2.Floating-Point Standard........... .242 H.3.Compute Capability 3.x...... 243 H.3.1.Architecture................. ..243 H.3.2.Global Memory...... 244 H.3.3.Shared Memory................. .246 H.4.Compute Capability 5.x..... 247 H.4.1.Architecture................ …247 H.4.2.Global Memory............. .248 H.4.3.Shared Memory................ .248 H.5.Compute Capability 6.x........ 252 H.5.1.Architecture................ .252 H.5.2.Global Memory.............. .252 H.5.3.Shared Memory................. 252 H.6.Compute Capability 7.x.............. 253 H.6.1.Architecture................ 253 H.6.2.Independent Thread Scheduling... 253 H.6.3.Global Memory....... 255 H.6.4.Shared Memory.............. 256 Appendix I.Driver APl...... 258 1.1.C0 ntext..… 261 1.2.Module........... A. 262 13.Kernel Execution...........263 1.4.Interoperability between Runtime and Driver APls..... 265 Appendix J.CUDA Environment Variables...........................266 Appendix K.Unified Memory Programming..... .269 K.1.Unified Memory Introduction............... ..269 K.1.1.System Requirements.............. 270 K.1.2.Simplifying GPU Programming........... 270 K.1.3.Data Migration and Coherency........ .272 K.1.4.GPU Memory Oversubscription.. 272 K.1.5.Multi-GPU.................. .273 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2|xi
www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | xi F.6. Code Samples........................................................................................... 231 F.6.1. Data Aggregation Class...........................................................................231 F.6.2. Derived Class...................................................................................... 231 F.6.3. Class Template.....................................................................................232 F.6.4. Function Template................................................................................ 232 F.6.5. Functor Class...................................................................................... 233 Appendix G. Texture Fetching..............................................................................234 G.1. Nearest-Point Sampling............................................................................... 234 G.2. Linear Filtering........................................................................................ 235 G.3. Table Lookup........................................................................................... 236 Appendix H. Compute Capabilities........................................................................ 238 H.1. Features and Technical Specifications............................................................. 238 H.2. Floating-Point Standard...............................................................................242 H.3. Compute Capability 3.x.............................................................................. 243 H.3.1. Architecture....................................................................................... 243 H.3.2. Global Memory....................................................................................244 H.3.3. Shared Memory................................................................................... 246 H.4. Compute Capability 5.x.............................................................................. 247 H.4.1. Architecture....................................................................................... 247 H.4.2. Global Memory....................................................................................248 H.4.3. Shared Memory................................................................................... 248 H.5. Compute Capability 6.x.............................................................................. 252 H.5.1. Architecture....................................................................................... 252 H.5.2. Global Memory....................................................................................252 H.5.3. Shared Memory................................................................................... 252 H.6. Compute Capability 7.x.............................................................................. 253 H.6.1. Architecture....................................................................................... 253 H.6.2. Independent Thread Scheduling............................................................... 253 H.6.3. Global Memory....................................................................................255 H.6.4. Shared Memory................................................................................... 256 Appendix I. Driver API....................................................................................... 258 I.1. Context................................................................................................... 261 I.2. Module....................................................................................................262 I.3. Kernel Execution........................................................................................263 I.4. Interoperability between Runtime and Driver APIs............................................... 265 Appendix J. CUDA Environment Variables...............................................................266 Appendix K. Unified Memory Programming..............................................................269 K.1. Unified Memory Introduction........................................................................ 269 K.1.1. System Requirements............................................................................ 270 K.1.2. Simplifying GPU Programming.................................................................. 270 K.1.3. Data Migration and Coherency................................................................. 272 K.1.4. GPU Memory Oversubscription................................................................. 272 K.1.5. Multi-GPU.......................................................................................... 273
K.1.6.System Allocator................. 273 K.1.7.Hardware Coherency..274 K.1.8.Access Counters............ 275 K.2.Programming Model...276 K.2.1.Managed Memory Opt In............. 276 K.2.1.1.Explicit Allocation Using cudaMallocManaged()..................................276 K.2.1.2.Global-Scope Managed Variables Usingmanaged_................................2.77 K.2.2.Coherency and Concurrency..........278 K.2.2.1.GPU Exclusive Access To Managed Memory........................278 K.2.2.2.Explicit Synchronization and Logical GPU Activity.......................279 K.2.2.3.Managing Data Visibility and Concurrent CPU+GPU Access with Streams.........280 K.2.2.4.Stream Association Examples........281 K.2.2.5.Stream Attach With Multithreaded Host Programs............................282 K.2.2.6.Advanced Topic:Modular Programs and Data Access Constraints..................283 K.2.2.7.Memcpy()/Memset()Behavior With Managed Memory..........................2.84 K.2.3.Language Integration................. 284 K.2.3.1.Host Program Errors withmanaged_Variables....................................285 K.2.4.Querying Unified Memory Support.................. 286 K.2.4.1.Device Properties.286 K.2.4.2.Pointer Attributes.................... .286 K.2.5.Advanced Topics...286 K.2.5.1.Managed Memory with Multi-GPU Programs on pre-6.x Architectures..............286 K.2.5.2.Using fork()with Managed Memory.287 K.3.Performance Tuning........287 K.3.1.Data Prefetching..288 K.3.2.Data Usage Hints....... 289 K.3.3.Querying Usage Attributes..................... 290 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2|xii
www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | xii K.1.6. System Allocator..................................................................................273 K.1.7. Hardware Coherency.............................................................................274 K.1.8. Access Counters.................................................................................. 275 K.2. Programming Model....................................................................................276 K.2.1. Managed Memory Opt In........................................................................ 276 K.2.1.1. Explicit Allocation Using cudaMallocManaged()........................................ 276 K.2.1.2. Global-Scope Managed Variables Using __managed__.................................277 K.2.2. Coherency and Concurrency.................................................................... 278 K.2.2.1. GPU Exclusive Access To Managed Memory............................................. 278 K.2.2.2. Explicit Synchronization and Logical GPU Activity.....................................279 K.2.2.3. Managing Data Visibility and Concurrent CPU + GPU Access with Streams......... 280 K.2.2.4. Stream Association Examples............................................................. 281 K.2.2.5. Stream Attach With Multithreaded Host Programs.................................... 282 K.2.2.6. Advanced Topic: Modular Programs and Data Access Constraints................... 283 K.2.2.7. Memcpy()/Memset() Behavior With Managed Memory................................ 284 K.2.3. Language Integration............................................................................ 284 K.2.3.1. Host Program Errors with __managed__ Variables.....................................285 K.2.4. Querying Unified Memory Support.............................................................286 K.2.4.1. Device Properties........................................................................... 286 K.2.4.2. Pointer Attributes........................................................................... 286 K.2.5. Advanced Topics.................................................................................. 286 K.2.5.1. Managed Memory with Multi-GPU Programs on pre-6.x Architectures.............. 286 K.2.5.2. Using fork() with Managed Memory...................................................... 287 K.3. Performance Tuning................................................................................... 287 K.3.1. Data Prefetching..................................................................................288 K.3.2. Data Usage Hints................................................................................. 289 K.3.3. Querying Usage Attributes...................................................................... 290
LIST OF FIGURES Figure 1 Floating-Point Operations per Second for the CPU and GPU...................................1 Figure 2 Memory Bandwidth for the cpu and GPU.....................................2 Figure 3 The GPU Devotes More Transistors to Data Processing....................................2 Figure 4 GPU Computing Applications........................ 4 Figure 5 Automatic Scalability............ ..6 Figure 6 Grid of Thread Blocks..................... .10 Figure 7 Memory Hierarchy............ .12 Figure 8 Heterogeneous Programming............. .14 Figure 9 Matrix Multiplication without Shared Memory... …26 Figure 10 Matrix Multiplication with Shared Memory..................................... .29 Figure 11 The Driver API Is Backward but Not Forward Compatible....... .67 Figure 12 Parent-Child Launch Nesting........................... 158 Figure 13 Nearest-Point Sampling Filtering Mode................ 235 Figure 14 Linear Filtering Mode..................... 236 Figure 15 One-Dimensional Table Lookup Using Linear Filtering..... 237 Figure 16 Examples of Global Memory Accesses............. 246 Figure 17 Strided Shared Memory Accesses........... 250 Figure 18 Irregular Shared Memory Accesses............. 251 Figure 19 Library Context Management....... 262 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2|xii
www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | xiii LIST OF FIGURES Figure 1 Floating-Point Operations per Second for the CPU and GPU ...................................1 Figure 2 Memory Bandwidth for the CPU and GPU .........................................................2 Figure 3 The GPU Devotes More Transistors to Data Processing ......................................... 2 Figure 4 GPU Computing Applications ........................................................................ 4 Figure 5 Automatic Scalability ................................................................................. 6 Figure 6 Grid of Thread Blocks ...............................................................................10 Figure 7 Memory Hierarchy ................................................................................... 12 Figure 8 Heterogeneous Programming ...................................................................... 14 Figure 9 Matrix Multiplication without Shared Memory .................................................. 26 Figure 10 Matrix Multiplication with Shared Memory .....................................................29 Figure 11 The Driver API Is Backward but Not Forward Compatible ................................... 67 Figure 12 Parent-Child Launch Nesting .................................................................... 158 Figure 13 Nearest-Point Sampling Filtering Mode ........................................................235 Figure 14 Linear Filtering Mode ............................................................................ 236 Figure 15 One-Dimensional Table Lookup Using Linear Filtering ...................................... 237 Figure 16 Examples of Global Memory Accesses ......................................................... 246 Figure 17 Strided Shared Memory Accesses ...............................................................250 Figure 18 Irregular Shared Memory Accesses .............................................................251 Figure 19 Library Context Management ................................................................... 262
LIST OF TABLES Table 1 Cubemap Fetch............... .51 Table 2 Throughput of Native Arithmetic Instructions............................85 Table 3 Alignment Requirements..9 Table 4 New Device-only Launch Implementation Functions.......................................... 167 Table 5 Supported API Functions.................. 167 Table 6 Single-Precision Mathematical Standard Library Functions with Maximum ULP Error....176 Table 7 Double-Precision Mathematical Standard Library Functions with Maximum ULP Error...180 Table 8 Functions Affected by -use_fast_math........................................... …184 Table 9 Single-Precision Floating-Point Intrinsic Functions....... ..185 Table 10 Double-Precision Floating-Point Intrinsic Functions....................... 186 Table 11 C++11 Language Features........................ 187 Table 12 C++14 Language Features........................ 190 Table 13 Feature Support per Compute Capability........ 238 Table 14 Technical Specifications per Compute Capability...... 239 Table 15 Objects Available in the CUDA Driver APl..... 258 Table 16 CUDA Environment Variables.................. 266 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.21xiv
www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | xiv LIST OF TABLES Table 1 Cubemap Fetch ........................................................................................51 Table 2 Throughput of Native Arithmetic Instructions ................................................... 85 Table 3 Alignment Requirements .............................................................................97 Table 4 New Device-only Launch Implementation Functions .......................................... 167 Table 5 Supported API Functions ........................................................................... 167 Table 6 Single-Precision Mathematical Standard Library Functions with Maximum ULP Error .... 176 Table 7 Double-Precision Mathematical Standard Library Functions with Maximum ULP Error... 180 Table 8 Functions Affected by -use_fast_math .......................................................... 184 Table 9 Single-Precision Floating-Point Intrinsic Functions .............................................185 Table 10 Double-Precision Floating-Point Intrinsic Functions .......................................... 186 Table 11 C++11 Language Features ........................................................................ 187 Table 12 C++14 Language Features ........................................................................ 190 Table 13 Feature Support per Compute Capability ......................................................238 Table 14 Technical Specifications per Compute Capability ............................................ 239 Table 15 Objects Available in the CUDA Driver API ..................................................... 258 Table 16 CUDA Environment Variables .....................................................................266
Chapter 1. INTRODUCTION 1.1.From Graphics Processing to General Purpose Parallel Computing Driven by the insatiable market demand for realtime,high-definition 3D graphics, the programmable Graphic Processor Unit or GPU has evolved into a highly parallel, multithreaded,manycore processor with tremendous computational horsepower and very high memory bandwidth,as illustrated by Figure 1 and Figure 2. Theoretical GFLOP/s at base clock 11000 10500 ◆-MDLA GPU Single Precision 10000 9500 9900 8500 8000 750 7000 6500 6000 500 5090 350 2500 200 1500 1 500 0 2005 2007 2011 2013 2015 Figure 1 Floating-Point Operations per Second for the CPU and GPU www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2|1
www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | 1 Chapter 1. INTRODUCTION 1.1. From Graphics Processing to General Purpose Parallel Computing Driven by the insatiable market demand for realtime, high-definition 3D graphics, the programmable Graphic Processor Unit or GPU has evolved into a highly parallel, multithreaded, manycore processor with tremendous computational horsepower and very high memory bandwidth, as illustrated by Figure 1 and Figure 2. Figure 1 Floating-Point Operations per Second for the CPU and GPU