Programming Interface cudaMemepy ()In the vector addition code sample of Kernels,the vectors need to be copied from host memory to device memory: /Device code _global void VecAdd(float*A,float+B,float*C,int N) int i blockDim.x blockIdx.x threadIdx.x; if (i N) C[i]=A[i]+B[i]; /Host code int main() intN=,·.i size t size N sizeof(float); /Allocate input vectors h A and h B in host memory float*h A (float*)malloc(size); float*h B=(float*)malloc(size); /Initialize input vectors /Allocate vectors in device memory float*d A; cudaMalloc(&d A,size); float*d B; cudaMalloc(&d B,size); float*d C; cudaMalloc(&d C,size); //Copy vectors from host memory to device memory cudaMemcpy(d A,h A,size,cudaMemcpyHostToDevice); cudaMemcpy(d_B,h_B,size,cudaMemcpyHostToDevice); /Invoke kernel int threadsPerBlock 256; int blocksPerGrid (N threadsPerBlock-1)/threadsPerBlock; VecAdd<<<blocksPerGrid,threadsPerBlock>>>(d A,d B,d C,N); /Copy result from device memory to host memory /h C contains the result in host memory cudaMemcpy(h_C,d_C,size,cudaMemcpyDeviceToHost); /Free device memory cudaFree(d A); cudaFree(d B); cudaFree(d C); /Free host memory Linear memory can also be allocated through cudaMallocPitch (and cudaMalloc3D().These functions are recommended for allocations of 2D or 3D arrays as it makes sure that the allocation is appropriately padded to meet the alignment requirements described in Device Memory Accesses,therefore ensuring best performance when accessing the row addresses or performing copies between 2D arrays and other regions of device memory(using the cudaMemepy2D()and cudaMemepy3D() functions).The returned pitch(or stride)must be used to access array elements.The www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2|22
Programming Interface www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | 22 cudaMemcpy(). In the vector addition code sample of Kernels, the vectors need to be copied from host memory to device memory: // Device code __global__ void VecAdd(float* A, float* B, float* C, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } // Host code int main() { int N = ...; size_t size = N * sizeof(float); // Allocate input vectors h_A and h_B in host memory float* h_A = (float*)malloc(size); float* h_B = (float*)malloc(size); // Initialize input vectors ... // Allocate vectors in device memory float* d_A; cudaMalloc(&d_A, size); float* d_B; cudaMalloc(&d_B, size); float* d_C; cudaMalloc(&d_C, size); // Copy vectors from host memory to device memory cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); // Invoke kernel int threadsPerBlock = 256; int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock; VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N); // Copy result from device memory to host memory // h_C contains the result in host memory cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); // Free host memory ... } Linear memory can also be allocated through cudaMallocPitch() and cudaMalloc3D(). These functions are recommended for allocations of 2D or 3D arrays as it makes sure that the allocation is appropriately padded to meet the alignment requirements described in Device Memory Accesses, therefore ensuring best performance when accessing the row addresses or performing copies between 2D arrays and other regions of device memory (using the cudaMemcpy2D() and cudaMemcpy3D() functions). The returned pitch (or stride) must be used to access array elements. The
Programming Interface following code sample allocates a width x height 2D array of floating-point values and shows how to loop over the array elements in device code: /Host code int width 64,height 64; float*devptr; size t pitch; cudaMallocPitch(&devPtr,&pitch, width sizeof(float),height); MyKernel<<<100,512>>>(devPtr,pitch,width,height); /Device code _global_void MyKernel(float*devptr, size t pitch,int width,int height) for (int r 0;r height;++r) float+row =(float*)((char*)devPtr +r pitch); for (int c 0;c width;++c)( float element row[c]; The following code sample allocates a width x height x depth 3D array of floating- point values and shows how to loop over the array elements in device code: /Host code int width 64,height 64,depth =64; cudaExtent extent make cudaExtent(width sizeof(float), height,depth); cudapitchedPtr devpitchedptr; cudaMalloc3D(&devPitchedptr,extent); MyKernel<<<100,512>>>(devPitchedPtr,width,height,depth); //Device code _global_void MyKernel(cudaPitchedPtr devpitchedptr, int width,int height,int depth) char*devptr devpitchedptr.ptr; size t pitch devpitchedptr.pitch; size t slicepitch pitch height; for (int z 0;z depth;++z){ chart slice devPtr z*slicepitch; for (int y=0;y<height;++y){ float*row =(float+)(slice +y pitch); for (int x=0;x width;++x){ float element row [x]; The reference manual lists all the various functions used to copy memory between linear memory allocated with cudaMalloc(),linear memory allocated with cudaMallocPitch()or cudaMalloc3D(),CUDA arrays,and memory allocated for variables declared in global or constant memory space. www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2|23
Programming Interface www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | 23 following code sample allocates a width x height 2D array of floating-point values and shows how to loop over the array elements in device code: // Host code int width = 64, height = 64; float* devPtr; size_t pitch; cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height); MyKernel<<<100, 512>>>(devPtr, pitch, width, height); // Device code __global__ void MyKernel(float* devPtr, size_t pitch, int width, int height) { for (int r = 0; r < height; ++r) { float* row = (float*)((char*)devPtr + r * pitch); for (int c = 0; c < width; ++c) { float element = row[c]; } } } The following code sample allocates a width x height x depth 3D array of floatingpoint values and shows how to loop over the array elements in device code: // Host code int width = 64, height = 64, depth = 64; cudaExtent extent = make_cudaExtent(width * sizeof(float), height, depth); cudaPitchedPtr devPitchedPtr; cudaMalloc3D(&devPitchedPtr, extent); MyKernel<<<100, 512>>>(devPitchedPtr, width, height, depth); // Device code __global__ void MyKernel(cudaPitchedPtr devPitchedPtr, int width, int height, int depth) { char* devPtr = devPitchedPtr.ptr; size_t pitch = devPitchedPtr.pitch; size_t slicePitch = pitch * height; for (int z = 0; z < depth; ++z) { char* slice = devPtr + z * slicePitch; for (int y = 0; y < height; ++y) { float* row = (float*)(slice + y * pitch); for (int x = 0; x < width; ++x) { float element = row[x]; } } } } The reference manual lists all the various functions used to copy memory between linear memory allocated with cudaMalloc(), linear memory allocated with cudaMallocPitch() or cudaMalloc3D(), CUDA arrays, and memory allocated for variables declared in global or constant memory space
Programming Interface The following code sample illustrates various ways of accessing global variables via the runtime API: constant float constData[256]; float data[256]; cudaMemcpyToSymbol(constData,data,sizeof(data)); cudaMemcpyFromSymbol(data,constData,sizeof(data)); device float devData; float value 3.14f; cudaMemcpyToSymbol(devData,&value,sizeof(float)); device float*devPointer; float*ptri cudaMalloc(sptr,256 sizeof(float)); cudaMemcpyToSymbol(devPointer,sptr,sizeof(ptr)); cudaGetsymbolAddress()is used to retrieve the address pointing to the memory allocated for a variable declared in global memory space.The size of the allocated memory is obtained through cudaGetsymbolsize(). 3.2.3.Shared Memory As detailed in Variable Memory Space Specifiers shared memory is allocated using the sharedmemory space specifier. Shared memory is expected to be much faster than global memory as mentioned in Thread Hierarchy and detailed in Shared Memory.Any opportunity to replace global memory accesses by shared memory accesses should therefore be exploited as illustrated by the following matrix multiplication example. The following code sample is a straightforward implementation of matrix multiplication that does not take advantage of shared memory.Each thread reads one row of A and one www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2|24
Programming Interface www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | 24 The following code sample illustrates various ways of accessing global variables via the runtime API: __constant__ float constData[256]; float data[256]; cudaMemcpyToSymbol(constData, data, sizeof(data)); cudaMemcpyFromSymbol(data, constData, sizeof(data)); __device__ float devData; float value = 3.14f; cudaMemcpyToSymbol(devData, &value, sizeof(float)); __device__ float* devPointer; float* ptr; cudaMalloc(&ptr, 256 * sizeof(float)); cudaMemcpyToSymbol(devPointer, &ptr, sizeof(ptr)); cudaGetSymbolAddress() is used to retrieve the address pointing to the memory allocated for a variable declared in global memory space. The size of the allocated memory is obtained through cudaGetSymbolSize(). 3.2.3. Shared Memory As detailed in Variable Memory Space Specifiers shared memory is allocated using the __shared__ memory space specifier. Shared memory is expected to be much faster than global memory as mentioned in Thread Hierarchy and detailed in Shared Memory. Any opportunity to replace global memory accesses by shared memory accesses should therefore be exploited as illustrated by the following matrix multiplication example. The following code sample is a straightforward implementation of matrix multiplication that does not take advantage of shared memory. Each thread reads one row of A and one
Programming Interface column of B and computes the corresponding element of C as illustrated in Figure 9.A is therefore read B.width times from global memory and B is read A.height times. /Matrices are stored in row-major order: /M(row,col)=*(M.elements row M.width col) typedef struct int width; int height; float*elements; Matrix; /Thread block size #define BLOCK SIZE 16 /Forward declaration of the matrix multiplication kernel _global_void MatMulKernel(const Matrix,const Matrix,Matrix); /Matrix multiplication Host code /Matrix dimensions are assumed to be multiples of BLOCK SIZE void MatMul(const Matrix A,const Matrix B,Matrix C) /Load A and B to device memory Matrix d A; d A.width A.width;d A.height A.height; size t size A.width A.height sizeof(float); cudaMalloc(&d A.elements,size); cudaMemcpy(d A.elements,A.elements,size, cudaMemcpyHostToDevice); Matrix d Bi d B.width B.width;d B.height B.height; size B.width B.height sizeof(float); cudaMalloc(&d B.elements,size); cudaMemcpy(d B.elements,B.elements,size, cudaMemcpyHostToDevice); /Allocate c in device memory Matrix d C; d C.width =C.width;d C.height C.height; size =C.width C.height sizeof(float); cudaMalloc(&d C.elements,size); /Invoke kernel dim3 dimBlock(BLOCK SIZE,BLOCK SIZE); dim3 dimGrid(B.width dimBlock.x,A.height dimBlock.y); MatMulKernel<<<dimGrid,dimBlock>>>(d A,d B,d C); /Read c from device memory cudaMemcpy(C.elements,Cd.elements,size, cudaMemcpyDeviceToHost); /Free device memory cudaFree(d A.elements); cudaFree(d B.elements); cudaFree(d C.elements); /Matrix multiplication kernel called by MatMul() ‘gXTI1P'仅XTI1e)T3 uIaxTnp3 EN PTOA一Te( /Each thread computes one element of C /by accumulating results into Cvalue float Cvalue =0; int row =blockIdx.y blockDim.y threadIdx.y; int col blockIdx.x blockDim.x+threadIdx.x; for (int e 0;e A.width;+te) Cvalue +=A.elements[row A.width e] B.elements[e B.width col]; C.elements[row C.width col]Cvalue; www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2|25
Programming Interface www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | 25 column of B and computes the corresponding element of C as illustrated in Figure 9. A is therefore read B.width times from global memory and B is read A.height times. // Matrices are stored in row-major order: // M(row, col) = *(M.elements + row * M.width + col) typedef struct { int width; int height; float* elements; } Matrix; // Thread block size #define BLOCK_SIZE 16 // Forward declaration of the matrix multiplication kernel __global__ void MatMulKernel(const Matrix, const Matrix, Matrix); // Matrix multiplication - Host code // Matrix dimensions are assumed to be multiples of BLOCK_SIZE void MatMul(const Matrix A, const Matrix B, Matrix C) { // Load A and B to device memory Matrix d_A; d_A.width = A.width; d_A.height = A.height; size_t size = A.width * A.height * sizeof(float); cudaMalloc(&d_A.elements, size); cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice); Matrix d_B; d_B.width = B.width; d_B.height = B.height; size = B.width * B.height * sizeof(float); cudaMalloc(&d_B.elements, size); cudaMemcpy(d_B.elements, B.elements, size, cudaMemcpyHostToDevice); // Allocate C in device memory Matrix d_C; d_C.width = C.width; d_C.height = C.height; size = C.width * C.height * sizeof(float); cudaMalloc(&d_C.elements, size); // Invoke kernel dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y); MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C); // Read C from device memory cudaMemcpy(C.elements, Cd.elements, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(d_A.elements); cudaFree(d_B.elements); cudaFree(d_C.elements); } // Matrix multiplication kernel called by MatMul() __global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) { // Each thread computes one element of C // by accumulating results into Cvalue float Cvalue = 0; int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; for (int e = 0; e < A.width; ++e) Cvalue += A.elements[row * A.width + e] * B.elements[e * B.width + col]; C.elements[row * C.width + col] = Cvalue; }
Programming Interface col L 山 B 0 A row A.width B.width A.height-1 Figure 9 Matrix Multiplication without Shared Memory The following code sample is an implementation of matrix multiplication that does take advantage of shared memory.In this implementation,each thread block is responsible for computing one square sub-matrix Csub of C and each thread within the block is responsible for computing one element of Csub.As illustrated in Figure 10,Csub is equal to the product of two rectangular matrices:the sub-matrix of A of dimension(A.width, block_size)that has the same row indices as Csub,and the sub-matrix of B of dimension (block_size,A.width )that has the same column indices as Csub.In order to fit into the device's resources,these two rectangular matrices are divided into as many square matrices of dimension block_size as necessary and Csub is computed as the sum of the products of these square matrices.Each of these products is performed by first loading the two corresponding square matrices from global memory to shared memory with one thread loading one element of each matrix,and then by having each thread compute one element of the product.Each thread accumulates the result of each of these products into a register and once done writes the result to global memory. www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2|26
Programming Interface www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | 26 A B C A.w idth B. w idth 0 col A.height B. height B. w idth-1 row 0 A.height-1 Figure 9 Matrix Multiplication without Shared Memory The following code sample is an implementation of matrix multiplication that does take advantage of shared memory. In this implementation, each thread block is responsible for computing one square sub-matrix Csub of C and each thread within the block is responsible for computing one element of Csub. As illustrated in Figure 10, Csub is equal to the product of two rectangular matrices: the sub-matrix of A of dimension (A.width, block_size) that has the same row indices as Csub, and the sub-matrix of B of dimension (block_size, A.width )that has the same column indices as Csub. In order to fit into the device's resources, these two rectangular matrices are divided into as many square matrices of dimension block_size as necessary and Csub is computed as the sum of the products of these square matrices. Each of these products is performed by first loading the two corresponding square matrices from global memory to shared memory with one thread loading one element of each matrix, and then by having each thread compute one element of the product. Each thread accumulates the result of each of these products into a register and once done writes the result to global memory