B.8.2.1.tex1Dfetch(). .107 B.8.2.2.tex1D0.… 107 B.8.2.3.tex1DLod()....... .108 B.8.2.4.tex1DGrad()............... .108 B.8.2.5.tex2D0.. 108 B.8.2.6.tex2DLod().............. .108 B.8.2.7.tex2DGrad().......... ...108 B.8.2.8.tex3D0. …109 B.8.2.9.tex3DLod()............ ..109 B.8.2.10.tex3DGrad()............ .109 B.8.2.11.tex1DLayered()......... .109 B.8.2.12.tex1DLayeredLod().. .110 B.8.2.13.tex1DLayeredGrad()..... .110 B.8.2.14.tex2DLayered()...... ,110 B.8.2.15.tex2DLayeredLod()...... .110 B.8.2.16.tex2DLayeredGrad()... 111 B.8.2.17.texCubemap().............. .111 B.8.2.18.texCubemapLod()..... 111 B.8.2.19.texCubemapLayered()............ ….111 B.8.2.20.texCubemapLayeredLod(). …111 B.8.2.21.tex2Dgather()............... .112 B.9.Surface Functions............ .112 B.9.1.Surface object APl............... .112 B.9.1.1.surf1Dread()................ .112 B.9.1.2.surf1Dwrite............. .112 B.9.1.3.surf2Dread()............. .113 B.9.1.4.surf2Dwrite()............ ,113 B.9.1.5.surf3Dread()............ .113 B.9.1.6.surf3Dwrite().... .113 B.9.1.7.surf1DLayeredread().... 114 B.9.1.8.surf1DLayeredwrite()... 114 B.9.1.9.surf2DLayeredread()..... .114 B.9.1.10.surf2DLayeredwrite() .114 B.9.1.11.surfCubemapread()......... .115 B.9.1.12.surfCubemapwrite().... 115 B.9.1.13.surfCubemapLayeredread() .115 B.9.1.14.surfCubemapLayeredwrite() .115 B.9.2.Surface Reference APl.... .116 B.9.2.1.surf1Dread().......... .116 B.9.2.2.surf1Dwrite............. …116 B.9.2.3.surf2Dread().......... 116 B.9.2.4.surf2Dwrite()........ …116 B.9.2.5.surf3Dread()....... .117 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.21
www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | vi B.8.2.1. tex1Dfetch()..................................................................................107 B.8.2.2. tex1D()........................................................................................ 107 B.8.2.3. tex1DLod()....................................................................................108 B.8.2.4. tex1DGrad().................................................................................. 108 B.8.2.5. tex2D()........................................................................................ 108 B.8.2.6. tex2DLod()....................................................................................108 B.8.2.7. tex2DGrad().................................................................................. 108 B.8.2.8. tex3D()........................................................................................ 109 B.8.2.9. tex3DLod()....................................................................................109 B.8.2.10. tex3DGrad().................................................................................109 B.8.2.11. tex1DLayered()............................................................................. 109 B.8.2.12. tex1DLayeredLod().........................................................................110 B.8.2.13. tex1DLayeredGrad()....................................................................... 110 B.8.2.14. tex2DLayered()............................................................................. 110 B.8.2.15. tex2DLayeredLod().........................................................................110 B.8.2.16. tex2DLayeredGrad()....................................................................... 111 B.8.2.17. texCubemap().............................................................................. 111 B.8.2.18. texCubemapLod().......................................................................... 111 B.8.2.19. texCubemapLayered().....................................................................111 B.8.2.20. texCubemapLayeredLod()................................................................ 111 B.8.2.21. tex2Dgather()...............................................................................112 B.9. Surface Functions...................................................................................... 112 B.9.1. Surface Object API............................................................................... 112 B.9.1.1. surf1Dread()..................................................................................112 B.9.1.2. surf1Dwrite................................................................................... 112 B.9.1.3. surf2Dread()..................................................................................113 B.9.1.4. surf2Dwrite()................................................................................. 113 B.9.1.5. surf3Dread()..................................................................................113 B.9.1.6. surf3Dwrite()................................................................................. 113 B.9.1.7. surf1DLayeredread()........................................................................ 114 B.9.1.8. surf1DLayeredwrite()....................................................................... 114 B.9.1.9. surf2DLayeredread()........................................................................ 114 B.9.1.10. surf2DLayeredwrite()......................................................................114 B.9.1.11. surfCubemapread()........................................................................ 115 B.9.1.12. surfCubemapwrite()....................................................................... 115 B.9.1.13. surfCubemapLayeredread()...............................................................115 B.9.1.14. surfCubemapLayeredwrite()..............................................................115 B.9.2. Surface Reference API........................................................................... 116 B.9.2.1. surf1Dread()..................................................................................116 B.9.2.2. surf1Dwrite................................................................................... 116 B.9.2.3. surf2Dread()..................................................................................116 B.9.2.4. surf2Dwrite()................................................................................. 116 B.9.2.5. surf3Dread()..................................................................................117
B.9.2.6.surf3Dwrite()..... ,117 B.9.2.7.surf1DLayeredread()............. 117 B.9.2.8.surf1DLayeredwrite().... ..117 B.9.2.9.surf2DLayeredread()............ 118 B.9.2.10.surf2DLayeredwrite(). .118 B.9.2.11.surfCubemapread()............ 118 B.9.2.12.surfCubemapwrite().......... .118 B.9.2.13.surfCubemapLayeredread(). .119 B.9.2.14.surfCubemapLayeredwrite(). .119 B.10.Read-Only Data Cache Load Function. .119 B.11.Time Function.......................... ...119 B.12.Atomic Functions........... 120 B.12.1.Arithmetic Functions......... 121 B.12.1.1.atomicAdd().... .121 B.12.1.2.atomicSub0..... .121 B.12.1.3.atomicExch()... .122 B.12.1.4.atomicMin()............. .122 B.12.1.5.atomicMax()... ..122 B.12.1.6.atomiclnc().............. .122 B.12.1.7.atomicDec()... .123 B.12.1.8.atomiccAS()........ .123 B.12.2.Bitwise Functions... 123 B.12.2.1.atomicAnd().......... .123 B.12.2.2.atomic0r0.… 123 B.12.2.3.atomicXor().......... 124 B.13.Warp Vote Functions........... 124 B.14.Warp Match Functions...... 125 B.14.1.Synop5ys.… 125 B.14.2.Description...... 125 B.15.Warp Shuffle Functions...... .126 B.15.1.Synopsis.… 126 B.15.2.Description................. 126 B.15.3.Return Value............... 127 B.15.4.Notes..… 128 B.15.5.xamples..… 128 B.15.5.1.Broadcast of a single value across a warp................................... 128 B.15.5.2.Inclusive plus-scan across sub-partitions of 8 threads.... .129 B.15.5.3.Reduction across a warp...................... 129 B.16.Warp matrix functions [PREVIEW FEATURE].......... .129 B.16.1.Description.......... 130 B.16.2.xample.… 132 B.17.Profiler Counter Function............. 132 B.18.Assertion............... 133 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2|vii
www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | vii B.9.2.6. surf3Dwrite()................................................................................. 117 B.9.2.7. surf1DLayeredread()........................................................................ 117 B.9.2.8. surf1DLayeredwrite()....................................................................... 117 B.9.2.9. surf2DLayeredread()........................................................................ 118 B.9.2.10. surf2DLayeredwrite()......................................................................118 B.9.2.11. surfCubemapread()........................................................................ 118 B.9.2.12. surfCubemapwrite()....................................................................... 118 B.9.2.13. surfCubemapLayeredread()...............................................................119 B.9.2.14. surfCubemapLayeredwrite()..............................................................119 B.10. Read-Only Data Cache Load Function.............................................................119 B.11. Time Function.........................................................................................119 B.12. Atomic Functions..................................................................................... 120 B.12.1. Arithmetic Functions........................................................................... 121 B.12.1.1. atomicAdd().................................................................................121 B.12.1.2. atomicSub()................................................................................. 121 B.12.1.3. atomicExch()................................................................................122 B.12.1.4. atomicMin()................................................................................. 122 B.12.1.5. atomicMax().................................................................................122 B.12.1.6. atomicInc()..................................................................................122 B.12.1.7. atomicDec().................................................................................123 B.12.1.8. atomicCAS().................................................................................123 B.12.2. Bitwise Functions............................................................................... 123 B.12.2.1. atomicAnd().................................................................................123 B.12.2.2. atomicOr().................................................................................. 123 B.12.2.3. atomicXor()................................................................................. 124 B.13. Warp Vote Functions................................................................................. 124 B.14. Warp Match Functions............................................................................... 125 B.14.1. Synopsys.......................................................................................... 125 B.14.2. Description....................................................................................... 125 B.15. Warp Shuffle Functions..............................................................................126 B.15.1. Synopsis........................................................................................... 126 B.15.2. Description....................................................................................... 126 B.15.3. Return Value..................................................................................... 127 B.15.4. Notes.............................................................................................. 128 B.15.5. Examples..........................................................................................128 B.15.5.1. Broadcast of a single value across a warp............................................ 128 B.15.5.2. Inclusive plus-scan across sub-partitions of 8 threads............................... 129 B.15.5.3. Reduction across a warp................................................................. 129 B.16. Warp matrix functions [PREVIEW FEATURE]......................................................129 B.16.1. Description....................................................................................... 130 B.16.2. Example...........................................................................................132 B.17. Profiler Counter Function........................................................................... 132 B.18. Assertion............................................................................................... 133
B.19.Formatted Output.......... 134 B.19.1.Format Specifiers............... 134 B.19.2.Limitations......... 135 B.19.3.Associated Host-Side APl.........136 B.19.4.Examples...... .136 B.20.Dynamic Global Memory Allocation and Operations.137 B.20.1.Heap Memory Allocation....................138 B.20.2.Interoperability with Host Memory APl..... .138 B.20.3.Examples.… 138 B.20.3.1.Per Thread Allocation.................... .139 B.20.3.2.Per Thread Block Allocation............... …140 B.20.3.3.Allocation Persisting Between Kernel Launches. 141 B.21.Execution Configuration........................ .142 B.22.Launch Bounds..… .142 B.23.#pragma unrou........ .145 B.24.SIMD Video Instructions................. .145 Appendix C.Cooperative Groups......147 C.1.Introduction.................. ..147 C.2.Intra-block Groups...........148 C.2.1.Thread Groups and Thread Blocks... .148 C.2.2.Tiled Partitions......... ….149 C.2.3.Thread Block Tiles................ 149 C.2.4.Coalesced Groups......... 150 C.2.5.Uses of Intra-block Cooperative Groups....... 150 C.2.5.1.Discovery Pattern................. 150 C.2.5.2.Warp-Synchronous Code Pattern....... .151 C.2.5.3.Composition................... 152 C.3.Grid Synchronization.................... 152 C.4.Multi-Device Synchronization.... 154 Appendix D.CUDA Dynamic Parallelism...... …156 D.1.Introduction..… .156 D.1.1.Overview................. ...156 D.1.2.Glossary..… 156 D.2.Execution Environment and Memory Model. 157 D.2.1.Execution Environment............. 157 D.2.1.1.Parent and Child Grids.............. .157 D.2.1.2.Scope of CUDA Primitives.... 158 D.2.1.3.Synchronization............... …158 D.2.1.4.Streams and Events......... .158 D.2.1.5.Ordering and Concurrency........ .159 D.2.1.6.Device Management............ 159 D.2.2.Memory Model..… 159 D.2.2.1.Coherence and Consistency....... .160 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.21iii
www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | viii B.19. Formatted Output.................................................................................... 134 B.19.1. Format Specifiers............................................................................... 134 B.19.2. Limitations....................................................................................... 135 B.19.3. Associated Host-Side API.......................................................................136 B.19.4. Examples..........................................................................................136 B.20. Dynamic Global Memory Allocation and Operations............................................ 137 B.20.1. Heap Memory Allocation....................................................................... 138 B.20.2. Interoperability with Host Memory API......................................................138 B.20.3. Examples..........................................................................................138 B.20.3.1. Per Thread Allocation.....................................................................139 B.20.3.2. Per Thread Block Allocation............................................................. 140 B.20.3.3. Allocation Persisting Between Kernel Launches...................................... 141 B.21. Execution Configuration.............................................................................142 B.22. Launch Bounds........................................................................................ 142 B.23. #pragma unroll........................................................................................145 B.24. SIMD Video Instructions..............................................................................145 Appendix C. Cooperative Groups.......................................................................... 147 C.1. Introduction.............................................................................................147 C.2. Intra-block Groups.....................................................................................148 C.2.1. Thread Groups and Thread Blocks.............................................................148 C.2.2. Tiled Partitions....................................................................................149 C.2.3. Thread Block Tiles............................................................................... 149 C.2.4. Coalesced Groups................................................................................ 150 C.2.5. Uses of Intra-block Cooperative Groups...................................................... 150 C.2.5.1. Discovery Pattern........................................................................... 150 C.2.5.2. Warp-Synchronous Code Pattern..........................................................151 C.2.5.3. Composition.................................................................................. 152 C.3. Grid Synchronization.................................................................................. 152 C.4. Multi-Device Synchronization........................................................................ 154 Appendix D. CUDA Dynamic Parallelism..................................................................156 D.1. Introduction.............................................................................................156 D.1.1. Overview........................................................................................... 156 D.1.2. Glossary............................................................................................ 156 D.2. Execution Environment and Memory Model....................................................... 157 D.2.1. Execution Environment.......................................................................... 157 D.2.1.1. Parent and Child Grids.....................................................................157 D.2.1.2. Scope of CUDA Primitives................................................................. 158 D.2.1.3. Synchronization..............................................................................158 D.2.1.4. Streams and Events.........................................................................158 D.2.1.5. Ordering and Concurrency.................................................................159 D.2.1.6. Device Management........................................................................ 159 D.2.2. Memory Model.................................................................................... 159 D.2.2.1. Coherence and Consistency............................................................... 160
D.3.Programming Interface. ..162 D.3.1.CUDA C/C++Reference.................. 162 D.3.1.1.Device-Side Kernel Launch.......... , 162 D.3.1.2.Stream5..163 D.3.1.3.Events..… .164 D.3.1.4.Synchronization............................ 164 D.3.1.5.Device Management.................. .164 D.3.1.6.Memory Declarations.................. 165 D.3.1.7.API Errors and Launch Failures............ .166 D.3.1.8.APl Reference................. .167 D.3.2.Device-side Launch from PTX................ .168 D.3.2.1.Kernel Launch APls..................... 168 D.3.2.2.Parameter Buffer Layout..................... 170 D.3.3.Toolkit Support for Dynamic Parallelism..... .170 D.3.3.1.Including Device Runtime API in CUDA Code... 170 D.3.3.2.Compiling and Linking................ .171 D.4.Programming Guidelines.................. .171 D.4.1.Basics........... 171 D.4.2.Performance.................T2 D.4.2.1.Synchronization............... .172 D.4.2.2.Dynamic-parallelism-enabled Kernel Overhead...........................172 D.4.3.Implementation Restrictions and Limitations..... .173 D.4.3.1.Runtime.....4473 Appendix E.Mathematical Functions................ ............ 176 E.1.Standard Functions............ 176 E.2.Intrinsic Functions........................... 184 Appendix F.C/C++Language Support............ 187 F.1.C++11 Language Features............... 187 F.2.C++14 Language Features........ ..190 f.3.Restrictions...… 190 F.3.1.Host Compiler Extensions.... ..190 F.3.2.Preprocessor Symbols............... .191 F3.2.1._CUDA_ARCH._… 191 F3.3.Qualifiers...................... 192 F.3.3.1.Device Memory Space Specifiers... 192 F.3.3.2.managed_Memory Space Specifier. .193 F.3.3.3.Volatile Qualifier.......... .195 f3.4.Pointers.… …196 F.3.5.Operators......... .196 F.3.5.1.Assignment Operator............... …196 F.3.5.2.Address Operator.............. 196 F.3.6.Run Time Type Information (RTTI). 196 F.3.7.Exception Handling............ 196 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.21ix
www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | ix D.3. Programming Interface................................................................................162 D.3.1. CUDA C/C++ Reference..........................................................................162 D.3.1.1. Device-Side Kernel Launch................................................................ 162 D.3.1.2. Streams....................................................................................... 163 D.3.1.3. Events......................................................................................... 164 D.3.1.4. Synchronization..............................................................................164 D.3.1.5. Device Management........................................................................ 164 D.3.1.6. Memory Declarations....................................................................... 165 D.3.1.7. API Errors and Launch Failures........................................................... 166 D.3.1.8. API Reference................................................................................167 D.3.2. Device-side Launch from PTX.................................................................. 168 D.3.2.1. Kernel Launch APIs......................................................................... 168 D.3.2.2. Parameter Buffer Layout.................................................................. 170 D.3.3. Toolkit Support for Dynamic Parallelism......................................................170 D.3.3.1. Including Device Runtime API in CUDA Code........................................... 170 D.3.3.2. Compiling and Linking......................................................................171 D.4. Programming Guidelines.............................................................................. 171 D.4.1. Basics............................................................................................... 171 D.4.2. Performance.......................................................................................172 D.4.2.1. Synchronization..............................................................................172 D.4.2.2. Dynamic-parallelism-enabled Kernel Overhead........................................ 172 D.4.3. Implementation Restrictions and Limitations................................................173 D.4.3.1. Runtime.......................................................................................173 Appendix E. Mathematical Functions..................................................................... 176 E.1. Standard Functions.................................................................................... 176 E.2. Intrinsic Functions..................................................................................... 184 Appendix F. C/C++ Language Support.................................................................... 187 F.1. C++11 Language Features............................................................................. 187 F.2. C++14 Language Features............................................................................. 190 F.3. Restrictions.............................................................................................. 190 F.3.1. Host Compiler Extensions........................................................................190 F.3.2. Preprocessor Symbols.............................................................................191 F.3.2.1. __CUDA_ARCH__............................................................................. 191 F.3.3. Qualifiers........................................................................................... 192 F.3.3.1. Device Memory Space Specifiers.......................................................... 192 F.3.3.2. __managed__ Memory Space Specifier...................................................193 F.3.3.3. Volatile Qualifier.............................................................................195 F.3.4. Pointers............................................................................................. 196 F.3.5. Operators........................................................................................... 196 F.3.5.1. Assignment Operator........................................................................ 196 F.3.5.2. Address Operator............................................................................ 196 F.3.6. Run Time Type Information (RTTI)............................................................. 196 F.3.7. Exception Handling............................................................................... 196
F.3.8.Standard Library...... .196 f3.9.Functi0n5.… 196 F.3.9.1.External Linkage.................... .197 F.3.9.2.Implicitly-declared and explicitly-defaulted functions........................197 F.3.9.3.Function Parameters....... 198 F3.9.4.Static Variables within Function.........198 F3.9.5.Function Pointers.................... ..199 F.3.9.6.Function Recursion............... 199 F.3.9.7.Friend Functions.............. 199 F.3.9.8.Operator Function............. 200 f.3.10.Classes.… .200 F.3.10.1.Data Members........ .200 F.3.10.2.Function Members........... …200 F.3.10.3.Virtual Functions..... 200 F.3.10.4.Virtual Base Classes......... …201 F.3.10.5.Anonymous Unions..... 201 F.3.10.6.Windows-Specific............... 201 F.3.11.Templates................ 202 F.3.12.Trigraphs and Digraphs............... .202 F.3.13.Const-qualified variables.... 203 f3.14.Long Double.… 203 F.3.15.Deprecation Annotation......... 203 F3.16.C++11 Features.......... .204 F.3.16.1.Lambda Expressions................ .204 F3.16.2.std::initializer_list............... 205 F.3.16.3.Rvalue references...................... 206 F.3.16.4.Constexpr functions and function templates.. 206 F.3.16.5.Constexpr variables...................... 206 F.3.16.6.Inline namespaces................... .207 f.3.16.7.thread_local.............. ....208 F.3.16.8.global functions and function templates.... 208 F.3.16.9.device/constant/sharedvariables......................................210 F.3.16.10.Defaulted functions................. 210 F3.17.C++14 Features........... 211 F.3.17.1.Functions with deduced return type... 211 F3.17.2.Variable templates.......................... 212 F.4.Polymorphic Function Wrappers............ 212 F.5.Experimental Feature:Extended Lambdas....... .216 F.5.1.Extended Lambda Type Traits............ .217 F.5.2.Extended Lambda Restrictions................ 218 F.5.3.Notes onhost_devicelambdas..... 226 F.5.4.*this Capture By Value.................. 227 F.5.5.Additional Notes................... ...229 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2|×
www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | x F.3.8. Standard Library...................................................................................196 F.3.9. Functions........................................................................................... 196 F.3.9.1. External Linkage............................................................................. 197 F.3.9.2. Implicitly-declared and explicitly-defaulted functions................................ 197 F.3.9.3. Function Parameters........................................................................ 198 F.3.9.4. Static Variables within Function.......................................................... 198 F.3.9.5. Function Pointers............................................................................ 199 F.3.9.6. Function Recursion.......................................................................... 199 F.3.9.7. Friend Functions............................................................................. 199 F.3.9.8. Operator Function........................................................................... 200 F.3.10. Classes............................................................................................. 200 F.3.10.1. Data Members...............................................................................200 F.3.10.2. Function Members..........................................................................200 F.3.10.3. Virtual Functions........................................................................... 200 F.3.10.4. Virtual Base Classes........................................................................201 F.3.10.5. Anonymous Unions......................................................................... 201 F.3.10.6. Windows-Specific........................................................................... 201 F.3.11. Templates......................................................................................... 202 F.3.12. Trigraphs and Digraphs..........................................................................202 F.3.13. Const-qualified variables....................................................................... 203 F.3.14. Long Double...................................................................................... 203 F.3.15. Deprecation Annotation........................................................................ 203 F.3.16. C++11 Features...................................................................................204 F.3.16.1. Lambda Expressions........................................................................204 F.3.16.2. std::initializer_list..........................................................................205 F.3.16.3. Rvalue references.......................................................................... 206 F.3.16.4. Constexpr functions and function templates.......................................... 206 F.3.16.5. Constexpr variables........................................................................ 206 F.3.16.6. Inline namespaces..........................................................................207 F.3.16.7. thread_local................................................................................. 208 F.3.16.8. __global__ functions and function templates......................................... 208 F.3.16.9. __device__/__constant__/__shared__ variables...................................... 210 F.3.16.10. Defaulted functions.......................................................................210 F.3.17. C++14 Features...................................................................................211 F.3.17.1. Functions with deduced return type....................................................211 F.3.17.2. Variable templates......................................................................... 212 F.4. Polymorphic Function Wrappers..................................................................... 212 F.5. Experimental Feature: Extended Lambdas.........................................................216 F.5.1. Extended Lambda Type Traits...................................................................217 F.5.2. Extended Lambda Restrictions..................................................................218 F.5.3. Notes on __host__ __device__ lambdas.......................................................226 F.5.4. *this Capture By Value........................................................................... 227 F.5.5. Additional Notes...................................................................................229