3 CUDA Device and Threads ° A com pute device Is a coprocessor to the cpu or host Has its own RAM ( device mermory Runs many threads in parallel Is typically a gpU but can also be another type of parallel p rocess vice Data-parallel portions of an application are expressed as device kernels which run on many threads Differences between GPu and cpu threads > gPU threads are extremely light weight >very little creation overhead >gPU needs 1000s of threads for full effic ciency >Multi-core CpU needs only a few
CUDA Device and Threads 6 •A compute device ➢Is a coprocessor to the CPU or host ➢Has its own DRAM (device memory) ➢Runs many threads in parallel ➢Is typically a GPU but can also be another type of parallel processing device •Data-parallel portions of an application are expressed as device kernels which run on many threads •Differences between GPU and CPU threads ➢GPU threads are extremely light weight ➢Very little creation overhead ➢GPU needs 1000s of threads for full efficiency ➢Multi-core CPU needs only a few
@ EXtension eclspecs device float filter [ nli global, device, shared, local constant global void convolve (float *image shared float region [M] Keywords threadx. blockldx region [threadIdx]= image [i]i · Intrinsics syncthreads syncthreads image[j]= resulti Runtime apl Memory, symbol // Allocate GPU memory void *myimage= cudaMalloc(bytes execution management //100 blocks, 10 threads per block Function launch convolve<<<100, 10>>>(myimage) 7
C Extension 7
S)Compilation Flow Integrated source (foo. cu) cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly CPU Host Code foo s foo. cpp OCG gcc/cl G80 SASS Mark Murphy. " NVIDIA's Experience with fo Open64 8
Compilation Flow 8
@ Compilation Flow C/C++ CUDA float4 me gxIgtid] Application me.X t= me. y me. Z, NVCC CPU Code Virtual PTX Code PhysicapTX to Target Id globalv4. f32 [sfh mad. f32 sfl Compiler G80 GPU Target code
Compilation Flow 9
@Matrix Multiplication void MatrixMultiplication(float* M, float* N, float* P, int width) for (int i =0: i< width: ++i) for (int j=0:j< Width: ++j)I k float sum=0: for (int k =0;k< Width: ++k)I float a= M[i width + k]: float b=NIk width j]: sum a b P[i Width j] P k WIDTH WIDTH 1000X1000=1,000,000 independent dot product 1000 multiply+ 1000 accumulate per dot
Matrix Multiplication 10 1000X1000=1,000,000 independent dot product 1000 multiply+1000 accumulate per dot