B.8.2.1.tex1Dfetch(). .104 B.8.2.2.tex1D0 104 B.8.2.3.tex1DLod(0.… .105 B.8.2.4.tex1DGrad()............... 105 B.8.2.5.tex2D0. 105 B.8.2.6.tex2DLod0.… .105 B.8.2.7.tex2DGrad().......... ...105 B.8.2.8.tex3D0.… …106 B.8.2.9.tex3DLod0… .106 B.8.2.10.tex3DGrad()............ .106 B.8.2.11.tex1DLayered()......... .......106 B.8.2.12.tex1DLayeredLod().. .107 B.8.2.13.tex1DLayeredGrad()..... .107 B.8.2.14.tex2DLayered()...... ,107 B.8.2.15.tex2DLayeredLod()...... .107 B.8.2.16.tex2DLayeredGrad()... 108 B.8.2.17.texCubemap().............. 108 B.8.2.18.texCubemapLod()..... 108 B.8.2.19.texCubemapLayered()............ 108 B.8.2.20.texCubemapLayeredLod(). …108 B.8.2.21.tex2Dgather()............... .109 B.9.Surface Functions............ ..109 B.9.1.Surface object APl............... 109 B.9.1.1.surf1Dread()................ .109 B.9.1.2.surf1Dwrite............. .109 B.9.1.3.surf2Dread()............. .110 B.9.1.4.surf2Dwrite()............ ,110 B.9.1.5.surf3Dread()............ .110 B.9.1.6.surf3Dwrite().... .110 B.9.1.7.surf1DLayeredread().... .111 B.9.1.8.surf1DLayeredwrite()... 111 B.9.1.9.surf2DLayeredread()..... 111 B.9.1.10.surf2DLayeredwrite() 111 B.9.1.11.surfCubemapread()......... .112 B.9.1.12.surfCubemapwrite().... 112 B.9.1.13.surfCubemapLayeredread() .112 B.9.1.14.surfCubemapLayeredwrite() ..112 B.9.2.Surface Reference APl.... .113 B.9.2.1.surf1Dread().......... ..113 B.9.2.2.surf1Dwrite............. …113 B.9.2.3.surf2Dread().......... .113 B.9.2.4.surf2Dwrite()....... .113 B.9.2.5.surf3Dread()....... ...114 www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.01
www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | vi B.8.2.1. tex1Dfetch()..................................................................................104 B.8.2.2. tex1D()........................................................................................ 104 B.8.2.3. tex1DLod()....................................................................................105 B.8.2.4. tex1DGrad().................................................................................. 105 B.8.2.5. tex2D()........................................................................................ 105 B.8.2.6. tex2DLod()....................................................................................105 B.8.2.7. tex2DGrad().................................................................................. 105 B.8.2.8. tex3D()........................................................................................ 106 B.8.2.9. tex3DLod()....................................................................................106 B.8.2.10. tex3DGrad().................................................................................106 B.8.2.11. tex1DLayered()............................................................................. 106 B.8.2.12. tex1DLayeredLod().........................................................................107 B.8.2.13. tex1DLayeredGrad()....................................................................... 107 B.8.2.14. tex2DLayered()............................................................................. 107 B.8.2.15. tex2DLayeredLod().........................................................................107 B.8.2.16. tex2DLayeredGrad()....................................................................... 108 B.8.2.17. texCubemap().............................................................................. 108 B.8.2.18. texCubemapLod().......................................................................... 108 B.8.2.19. texCubemapLayered().....................................................................108 B.8.2.20. texCubemapLayeredLod()................................................................ 108 B.8.2.21. tex2Dgather()...............................................................................109 B.9. Surface Functions...................................................................................... 109 B.9.1. Surface Object API............................................................................... 109 B.9.1.1. surf1Dread()..................................................................................109 B.9.1.2. surf1Dwrite................................................................................... 109 B.9.1.3. surf2Dread()..................................................................................110 B.9.1.4. surf2Dwrite()................................................................................. 110 B.9.1.5. surf3Dread()..................................................................................110 B.9.1.6. surf3Dwrite()................................................................................. 110 B.9.1.7. surf1DLayeredread()........................................................................ 111 B.9.1.8. surf1DLayeredwrite()....................................................................... 111 B.9.1.9. surf2DLayeredread()........................................................................ 111 B.9.1.10. surf2DLayeredwrite()......................................................................111 B.9.1.11. surfCubemapread()........................................................................ 112 B.9.1.12. surfCubemapwrite()....................................................................... 112 B.9.1.13. surfCubemapLayeredread()...............................................................112 B.9.1.14. surfCubemapLayeredwrite()..............................................................112 B.9.2. Surface Reference API........................................................................... 113 B.9.2.1. surf1Dread()..................................................................................113 B.9.2.2. surf1Dwrite................................................................................... 113 B.9.2.3. surf2Dread()..................................................................................113 B.9.2.4. surf2Dwrite()................................................................................. 113 B.9.2.5. surf3Dread()..................................................................................114
B.9.2.6.surf3Dwrite()..... ,114 B.9.2.7.surf1DLayeredread().............. 114 B.9.2.8.surf1DLayeredwrite().... .114 B.9.2.9.surf2DLayeredread()............ .115 B.9.2.10.surf2DLayeredwrite(). .115 B.9.2.11.surfCubemapread()............ .115 B.9.2.12.surfCubemapwrite().......... .…115 B.9.2.13.surfCubemapLayeredread().. .116 B.9.2.14.surfCubemapLayeredwrite(). .116 B.10.Read-Only Data Cache Load Function. .116 B.11.Time Function.......................... ..116 B.12.Atomic Functions........... ...117 B.12.1.Arithmetic Functions......... 118 B.12.1.1.atomicAdd().... ..118 B.12.1.2.atomicSub()............ .118 B.12.1.3.atomicExch()... .119 B.12.1.4.atomicMin()............. .119 B.12.1.5.atomicMax()... .119 B.12.1.6.atomiclnc().............. .119 B.12.1.7.atomicDec()... .120 B.12.1.8.atomiccAS().......... ….120 B.12.2.Bitwise Functions.... 120 B.12.2.1.atomicAnd().......... .120 B.12.2.2.atomic0r0.. 120 B.12.2.3.atomicXor()............ 121 B.13.Warp Vote Functions............ 121 B.14.Warp Shuffle Functions.............. 122 B.14.1.Synopsis..… .122 B.14.2.Description.... 122 B.14.3.Return Value.............. 123 B.14.4.Notes....................... 123 B.14.5.Example5........ .124 B.14.5.1.Broadcast of a single value across a warp........ 124 B.14.5.2.Inclusive plus-scan across sub-partitions of 8 threads.............................124 B.14.5.3.Reduction across a warp.............. 125 B.15.Profiler Counter Function.............................. .125 B.16.Assertion.................. 125 B.17.Formatted Output.............. .126 B.17.1.Format Specifiers............ 127 B.17.2.Limitations........ 127 B.17.3.Associated Host-Side APl.................... .128 B.17.4.Example5..129 B.18.Dynamic Global Memory Allocation and Operations........................130 www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0|vi
www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | vii B.9.2.6. surf3Dwrite()................................................................................. 114 B.9.2.7. surf1DLayeredread()........................................................................ 114 B.9.2.8. surf1DLayeredwrite()....................................................................... 114 B.9.2.9. surf2DLayeredread()........................................................................ 115 B.9.2.10. surf2DLayeredwrite()......................................................................115 B.9.2.11. surfCubemapread()........................................................................ 115 B.9.2.12. surfCubemapwrite()....................................................................... 115 B.9.2.13. surfCubemapLayeredread()...............................................................116 B.9.2.14. surfCubemapLayeredwrite()..............................................................116 B.10. Read-Only Data Cache Load Function.............................................................116 B.11. Time Function.........................................................................................116 B.12. Atomic Functions..................................................................................... 117 B.12.1. Arithmetic Functions........................................................................... 118 B.12.1.1. atomicAdd().................................................................................118 B.12.1.2. atomicSub()................................................................................. 118 B.12.1.3. atomicExch()................................................................................119 B.12.1.4. atomicMin()................................................................................. 119 B.12.1.5. atomicMax().................................................................................119 B.12.1.6. atomicInc()..................................................................................119 B.12.1.7. atomicDec().................................................................................120 B.12.1.8. atomicCAS().................................................................................120 B.12.2. Bitwise Functions............................................................................... 120 B.12.2.1. atomicAnd().................................................................................120 B.12.2.2. atomicOr().................................................................................. 120 B.12.2.3. atomicXor()................................................................................. 121 B.13. Warp Vote Functions................................................................................. 121 B.14. Warp Shuffle Functions..............................................................................122 B.14.1. Synopsis........................................................................................... 122 B.14.2. Description....................................................................................... 122 B.14.3. Return Value..................................................................................... 123 B.14.4. Notes.............................................................................................. 123 B.14.5. Examples..........................................................................................124 B.14.5.1. Broadcast of a single value across a warp............................................ 124 B.14.5.2. Inclusive plus-scan across sub-partitions of 8 threads............................... 124 B.14.5.3. Reduction across a warp................................................................. 125 B.15. Profiler Counter Function........................................................................... 125 B.16. Assertion............................................................................................... 125 B.17. Formatted Output.................................................................................... 126 B.17.1. Format Specifiers............................................................................... 127 B.17.2. Limitations....................................................................................... 127 B.17.3. Associated Host-Side API.......................................................................128 B.17.4. Examples..........................................................................................129 B.18. Dynamic Global Memory Allocation and Operations............................................ 130
B.18.1.Heap Memory Allocation...... 130 B.18.2.Interoperability with Host Memory APl.......................131 B.18.3.Examples..... .131 B.18.3.1.Per Thread Allocation.........131 B.18.3.2.Per Thread Block Allocation........ …132 B.18.3.3.Allocation Persisting Between Kernel Launches........................133 B.19.Execution Configuration.........134 B.20.Launch Bounds............. …134 B.21.#pragma unroul............. .137 B.22.SIMD Video Instructions...................... .137 Appendix C.CUDA Dynamic Parallelism.............. …139 C.1.Introduction..… .139 C.1.1.0 verview...... .139 C.1.2.Glossary.… 139 C.2.Execution Environment and Memory Model. 140 C.2.1.Execution Environment............. ..140 C.2.1.1.Parent and Child Grids............... .140 C.2.1.2.Scope of CUDA Primitives..... .141 C.2.1.3.Synchronization........ .141 C.2.1.4.Streams and Events......... .141 C.2.1.5.Ordering and Concurrency....... .142 C.2.1.6.Device Management....... .142 C.2.2.Memory Model...… 142 C.2.2.1.Coherence and Consistency.......... .143 C.3.Programming Interface................... .145 C.3.1.CUDA C/C++Reference............... ...145 C.3.1.1.Device-Side Kernel Launch... 145 C.3.1.2.Streams..… .146 C.3.1.3.Events..… ...147 C.3.1.4.Synchronization.............. .147 C.3.1.5.Device Management............. 147 C.3.1.6.Memory Declarations............... …148 C.3.1.7.API Errors and Launch Failures..... 149 C.3.1.8.API Reference........ .150 C.3.2.Device-side Launch from PTX............. 151 C.3.2.1.Kernel Launch APIs................... 151 C.3.2.2.Parameter Buffer Layout.............. 153 C.3.3.Toolkit Support for Dynamic Parallelism.......... …153 C.3.3.1.Including Device Runtime API in CUDA Code. .153 C.3.3.2.Compiling and Linking.............. 154 C.4.Programming Guidelines................ 154 C.4.1.Basic5.… …….154 C.4.2.Performance............. ......155 www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.01iii
www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | viii B.18.1. Heap Memory Allocation....................................................................... 130 B.18.2. Interoperability with Host Memory API......................................................131 B.18.3. Examples..........................................................................................131 B.18.3.1. Per Thread Allocation.....................................................................131 B.18.3.2. Per Thread Block Allocation............................................................. 132 B.18.3.3. Allocation Persisting Between Kernel Launches...................................... 133 B.19. Execution Configuration.............................................................................134 B.20. Launch Bounds........................................................................................ 134 B.21. #pragma unroll........................................................................................137 B.22. SIMD Video Instructions..............................................................................137 Appendix C. CUDA Dynamic Parallelism.................................................................. 139 C.1. Introduction.............................................................................................139 C.1.1. Overview........................................................................................... 139 C.1.2. Glossary............................................................................................ 139 C.2. Execution Environment and Memory Model....................................................... 140 C.2.1. Execution Environment.......................................................................... 140 C.2.1.1. Parent and Child Grids..................................................................... 140 C.2.1.2. Scope of CUDA Primitives..................................................................141 C.2.1.3. Synchronization..............................................................................141 C.2.1.4. Streams and Events.........................................................................141 C.2.1.5. Ordering and Concurrency.................................................................142 C.2.1.6. Device Management........................................................................ 142 C.2.2. Memory Model.................................................................................... 142 C.2.2.1. Coherence and Consistency............................................................... 143 C.3. Programming Interface................................................................................145 C.3.1. CUDA C/C++ Reference..........................................................................145 C.3.1.1. Device-Side Kernel Launch................................................................ 145 C.3.1.2. Streams....................................................................................... 146 C.3.1.3. Events......................................................................................... 147 C.3.1.4. Synchronization..............................................................................147 C.3.1.5. Device Management........................................................................ 147 C.3.1.6. Memory Declarations....................................................................... 148 C.3.1.7. API Errors and Launch Failures........................................................... 149 C.3.1.8. API Reference................................................................................150 C.3.2. Device-side Launch from PTX.................................................................. 151 C.3.2.1. Kernel Launch APIs......................................................................... 151 C.3.2.2. Parameter Buffer Layout.................................................................. 153 C.3.3. Toolkit Support for Dynamic Parallelism......................................................153 C.3.3.1. Including Device Runtime API in CUDA Code........................................... 153 C.3.3.2. Compiling and Linking......................................................................154 C.4. Programming Guidelines.............................................................................. 154 C.4.1. Basics............................................................................................... 154 C.4.2. Performance.......................................................................................155
C.4.2.1.Synchronization....... 155 C.4.2.2.Dynamic-parallelism-enabled Kernel Overhead.......................... 155 C.4.3.Implementation Restrictions and Limitations.................................156 C.4.3.1.Runtime..156 Appendix D.Mathematical Functions............. .159 D.1.Standard Functions............................. ….159 D.2.Intrinsic Functions................... ..167 Appendix E.C/C++Language Support............. 170 E.1.C++11 Language Features................. 170 E.2.C++14 Language Features................. 173 E.3.Restrictions..… .173 E.3.1.Host Compiler Extensions..... .173 E.3.2.Preprocessor Symbols............. .174 E.3.2.1._CUDA_ARCH.… ..174 E.3.3.Qualifiers............. .175 E.3.3.1.Device Memory Qualifiers.... .175 E.3.3.2.managed_Qualifier.............. .176 E.3.3.3.Volatile Qualifier......... 178 E.3.4.Pointers.… .179 E.3.5.0 perators.… .179 E.3.5.1.Assignment Operator.............. 179 E.3.5.2.Address Operator........... .179 E.3.6.Run Time Type Information (RTTI)... .179 E.3.7.Exception Handling................. 179 E.3.8.Standard Library............... …179 E.3.9.Functions................................. .180 E.3.9.1.External Linkage........................ …….180 E.3.9.2.Compiler generated functions......... 180 E.3.9.3.Function Parameters...... …180 E.3.9.4.Static Variables within Function... .181 E.3.9.5.Function Pointers............ ..181 E.3.9.6.Function Recursion................. …182 E.3.10.Classes...… 182 E.3.10.1.Data Members................... 182 E.3.10.2.Function Members..... 182 E.3.10.3.Virtual Functions................. .182 E.3.10.4.Virtual Base Classes... 182 E.3.10.5.Anonymous Unions....... …182 E.3.10.6.Windows-Specific...... ....182 E.3.11.Templates.… ….183 E.3.12.Trigraphs and Digraphs..... 183 E.3.13.Const-qualified variables. .184 E.3.14.C++11 Features.....… …184 www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.01ix
www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | ix C.4.2.1. Synchronization..............................................................................155 C.4.2.2. Dynamic-parallelism-enabled Kernel Overhead........................................ 155 C.4.3. Implementation Restrictions and Limitations................................................156 C.4.3.1. Runtime.......................................................................................156 Appendix D. Mathematical Functions..................................................................... 159 D.1. Standard Functions.................................................................................... 159 D.2. Intrinsic Functions..................................................................................... 167 Appendix E. C/C++ Language Support.................................................................... 170 E.1. C++11 Language Features............................................................................ 170 E.2. C++14 Language Features............................................................................ 173 E.3. Restrictions..............................................................................................173 E.3.1. Host Compiler Extensions....................................................................... 173 E.3.2. Preprocessor Symbols............................................................................ 174 E.3.2.1. __CUDA_ARCH__.............................................................................174 E.3.3. Qualifiers...........................................................................................175 E.3.3.1. Device Memory Qualifiers..................................................................175 E.3.3.2. __managed__ Qualifier.....................................................................176 E.3.3.3. Volatile Qualifier............................................................................ 178 E.3.4. Pointers.............................................................................................179 E.3.5. Operators.......................................................................................... 179 E.3.5.1. Assignment Operator....................................................................... 179 E.3.5.2. Address Operator............................................................................179 E.3.6. Run Time Type Information (RTTI).............................................................179 E.3.7. Exception Handling...............................................................................179 E.3.8. Standard Library.................................................................................. 179 E.3.9. Functions...........................................................................................180 E.3.9.1. External Linkage.............................................................................180 E.3.9.2. Compiler generated functions............................................................ 180 E.3.9.3. Function Parameters........................................................................180 E.3.9.4. Static Variables within Function.......................................................... 181 E.3.9.5. Function Pointers............................................................................181 E.3.9.6. Function Recursion..........................................................................182 E.3.10. Classes............................................................................................ 182 E.3.10.1. Data Members.............................................................................. 182 E.3.10.2. Function Members......................................................................... 182 E.3.10.3. Virtual Functions...........................................................................182 E.3.10.4. Virtual Base Classes....................................................................... 182 E.3.10.5. Anonymous Unions......................................................................... 182 E.3.10.6. Windows-Specific...........................................................................182 E.3.11. Templates.........................................................................................183 E.3.12. Trigraphs and Digraphs......................................................................... 183 E.3.13. Const-qualified variables.......................................................................184 E.3.14. C++11 Features.................................................................................. 184
E.3.14.1.Lambda Expressions............ 184 E.3.14.2.std::initializer_list......... 185 E.3.14.3.Rvalue references................. .186 E.3.14.4.Constexpr functions and function templates............................186 E.3.14.5.Constexpr variables.............. .186 E.3.14.6.Intine namespaces.187 E.3.14.7.thread local...........................................188 E.3.14.8.global functions and function templates..... .189 E.3.14.9.device_/constant_/shared_variables......................................190 E.3.14.10.Defautted functions.... 190 E.3.15.C+14 Features.........190 E.3.15.1.Functions with deduced return type... 190 E.3.15.2.Variable templates......................... .191 E.3.15.3.[[deprecated]]attribute........... ..192 E.4.Polymorphic Function Wrappers....................... .192 E.5.Experimental Feature:Extended Lambdas.... 195 E.5.1.Extended Lambda Type Traits.......................... .197 E.5.2.Extended Lambda Restrictions...... 198 E.5.3.Notes onhostdevicelambdas..............205 E.5.4.*this Capture By Value........... .206 E.5.5.Additional Notes.............. 208 E.6.Code Samples............... ...210 E.6.1.Data Aggregation Class.............. 210 E.6.2.Derived class................ .210 E.6.3.Class Template.............. 211 E.6.4.Function Template............... .211 E.6.5.Functor class............ .212 Appendix F.Texture Fetching................ .213 F.1.Nearest-Point Sampling.... ,213 F.2.Linear Filtering................... 214 f.3.Table Lookup..… 215 Appendix G.Compute Capabilities........... .217 G.1.Features and Technical Specifications. 217 G.2.Floating-Point Standard................ .221 G.3.Compute Capability 2.x...... 222 G.3.1.Architecture.................... 222 G.3.2.Global Memory..... 223 G.3.3.Shared Memory.......... 224 G.3.4.Constant Memory...... ....225 G.4.Compute Capability 3.x........... 225 G.4.1.Architecture............. .225 G.4.2.Global Memory........... 227 G.4.3.Shared Memory.......... .228 www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.01×
www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | x E.3.14.1. Lambda Expressions....................................................................... 184 E.3.14.2. std::initializer_list......................................................................... 185 E.3.14.3. Rvalue references..........................................................................186 E.3.14.4. Constexpr functions and function templates..........................................186 E.3.14.5. Constexpr variables........................................................................186 E.3.14.6. Inline namespaces......................................................................... 187 E.3.14.7. thread_local................................................................................ 188 E.3.14.8. __global__ functions and function templates.........................................189 E.3.14.9. __device__/__constant__/__shared__ variables......................................190 E.3.14.10. Defaulted functions...................................................................... 190 E.3.15. C++14 Features.................................................................................. 190 E.3.15.1. Functions with deduced return type................................................... 190 E.3.15.2. Variable templates.........................................................................191 E.3.15.3. [[deprecated]] attribute..................................................................192 E.4. Polymorphic Function Wrappers..................................................................... 192 E.5. Experimental Feature: Extended Lambdas........................................................ 195 E.5.1. Extended Lambda Type Traits.................................................................. 197 E.5.2. Extended Lambda Restrictions................................................................. 198 E.5.3. Notes on __host__ __device__ lambdas...................................................... 205 E.5.4. *this Capture By Value...........................................................................206 E.5.5. Additional Notes.................................................................................. 208 E.6. Code Samples...........................................................................................210 E.6.1. Data Aggregation Class.......................................................................... 210 E.6.2. Derived Class...................................................................................... 210 E.6.3. Class Template.................................................................................... 211 E.6.4. Function Template................................................................................211 E.6.5. Functor Class...................................................................................... 212 Appendix F. Texture Fetching.............................................................................. 213 F.1. Nearest-Point Sampling................................................................................213 F.2. Linear Filtering......................................................................................... 214 F.3. Table Lookup............................................................................................ 215 Appendix G. Compute Capabilities........................................................................ 217 G.1. Features and Technical Specifications............................................................. 217 G.2. Floating-Point Standard...............................................................................221 G.3. Compute Capability 2.x.............................................................................. 222 G.3.1. Architecture.......................................................................................222 G.3.2. Global Memory....................................................................................223 G.3.3. Shared Memory................................................................................... 224 G.3.4. Constant Memory.................................................................................225 G.4. Compute Capability 3.x.............................................................................. 225 G.4.1. Architecture.......................................................................................225 G.4.2. Global Memory....................................................................................227 G.4.3. Shared Memory................................................................................... 228