当前位置：和泉文库 > 计算机 > 《并行与分布式程序设计》课程教学参考书：NVIDIA《CUDA C PROGRAMMING GUIDE》（Design Guide，CHANGES FROM VERSION 9.0）

《并行与分布式程序设计》课程教学参考书：NVIDIA《CUDA C PROGRAMMING GUIDE》（Design Guide，CHANGES FROM VERSION 9.0）

文件格式：PDF，文件大小：5.69MB，售价：28.95元

文档详细内容（约306页）

www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | xi F.6. Code Samples........................................................................................... 231 F.6.1. Data Aggregation Class...........................................................................231 F.6.2. Derived Class...................................................................................... 231 F.6.3. Class Template.....................................................................................232 F.6.4. Function Template................................................................................ 232 F.6.5. Functor Class...................................................................................... 233 Appendix G. Texture Fetching..............................................................................234 G.1. Nearest-Point Sampling............................................................................... 234 G.2. Linear Filtering........................................................................................ 235 G.3. Table Lookup........................................................................................... 236 Appendix H. Compute Capabilities........................................................................ 238 H.1. Features and Technical Specifications............................................................. 238 H.2. Floating-Point Standard...............................................................................242 H.3. Compute Capability 3.x.............................................................................. 243 H.3.1. Architecture....................................................................................... 243 H.3.2. Global Memory....................................................................................244 H.3.3. Shared Memory................................................................................... 246 H.4. Compute Capability 5.x.............................................................................. 247 H.4.1. Architecture....................................................................................... 247 H.4.2. Global Memory....................................................................................248 H.4.3. Shared Memory................................................................................... 248 H.5. Compute Capability 6.x.............................................................................. 252 H.5.1. Architecture....................................................................................... 252 H.5.2. Global Memory....................................................................................252 H.5.3. Shared Memory................................................................................... 252 H.6. Compute Capability 7.x.............................................................................. 253 H.6.1. Architecture....................................................................................... 253 H.6.2. Independent Thread Scheduling............................................................... 253 H.6.3. Global Memory....................................................................................255 H.6.4. Shared Memory................................................................................... 256 Appendix I. Driver API....................................................................................... 258 I.1. Context................................................................................................... 261 I.2. Module....................................................................................................262 I.3. Kernel Execution........................................................................................263 I.4. Interoperability between Runtime and Driver APIs............................................... 265 Appendix J. CUDA Environment Variables...............................................................266 Appendix K. Unified Memory Programming..............................................................269 K.1. Unified Memory Introduction........................................................................ 269 K.1.1. System Requirements............................................................................ 270 K.1.2. Simplifying GPU Programming.................................................................. 270 K.1.3. Data Migration and Coherency................................................................. 272 K.1.4. GPU Memory Oversubscription................................................................. 272 K.1.5. Multi-GPU.......................................................................................... 273

www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | xii K.1.6. System Allocator..................................................................................273 K.1.7. Hardware Coherency.............................................................................274 K.1.8. Access Counters.................................................................................. 275 K.2. Programming Model....................................................................................276 K.2.1. Managed Memory Opt In........................................................................ 276 K.2.1.1. Explicit Allocation Using cudaMallocManaged()........................................ 276 K.2.1.2. Global-Scope Managed Variables Using __managed__.................................277 K.2.2. Coherency and Concurrency.................................................................... 278 K.2.2.1. GPU Exclusive Access To Managed Memory............................................. 278 K.2.2.2. Explicit Synchronization and Logical GPU Activity.....................................279 K.2.2.3. Managing Data Visibility and Concurrent CPU + GPU Access with Streams......... 280 K.2.2.4. Stream Association Examples............................................................. 281 K.2.2.5. Stream Attach With Multithreaded Host Programs.................................... 282 K.2.2.6. Advanced Topic: Modular Programs and Data Access Constraints................... 283 K.2.2.7. Memcpy()/Memset() Behavior With Managed Memory................................ 284 K.2.3. Language Integration............................................................................ 284 K.2.3.1. Host Program Errors with __managed__ Variables.....................................285 K.2.4. Querying Unified Memory Support.............................................................286 K.2.4.1. Device Properties........................................................................... 286 K.2.4.2. Pointer Attributes........................................................................... 286 K.2.5. Advanced Topics.................................................................................. 286 K.2.5.1. Managed Memory with Multi-GPU Programs on pre-6.x Architectures.............. 286 K.2.5.2. Using fork() with Managed Memory...................................................... 287 K.3. Performance Tuning................................................................................... 287 K.3.1. Data Prefetching..................................................................................288 K.3.2. Data Usage Hints................................................................................. 289 K.3.3. Querying Usage Attributes...................................................................... 290

www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | xiii LIST OF FIGURES Figure 1 Floating-Point Operations per Second for the CPU and GPU ...................................1 Figure 2 Memory Bandwidth for the CPU and GPU .........................................................2 Figure 3 The GPU Devotes More Transistors to Data Processing ......................................... 2 Figure 4 GPU Computing Applications ........................................................................ 4 Figure 5 Automatic Scalability ................................................................................. 6 Figure 6 Grid of Thread Blocks ...............................................................................10 Figure 7 Memory Hierarchy ................................................................................... 12 Figure 8 Heterogeneous Programming ...................................................................... 14 Figure 9 Matrix Multiplication without Shared Memory .................................................. 26 Figure 10 Matrix Multiplication with Shared Memory .....................................................29 Figure 11 The Driver API Is Backward but Not Forward Compatible ................................... 67 Figure 12 Parent-Child Launch Nesting .................................................................... 158 Figure 13 Nearest-Point Sampling Filtering Mode ........................................................235 Figure 14 Linear Filtering Mode ............................................................................ 236 Figure 15 One-Dimensional Table Lookup Using Linear Filtering ...................................... 237 Figure 16 Examples of Global Memory Accesses ......................................................... 246 Figure 17 Strided Shared Memory Accesses ...............................................................250 Figure 18 Irregular Shared Memory Accesses .............................................................251 Figure 19 Library Context Management ................................................................... 262

点击进入文档下载页（PDF格式）

共306页，可试读40页，点击继续阅读 ↓↓

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录