Control Divergence Examples Divergence can arise when branch or loop condition is a function of thread indices Example kernel statement with divergence: - if (threadldx.x>2){} This creates two different control paths for threads in a block Decision granularity warp size;threads 0,1 and 2 follow different path than the rest of the threads in the first warp Example without divergence: -If (blockldx.x>2){} Decision granularity is a multiple of blocks size;all threads in any given warp follow the same path 电子科妓女学 O
11 Control Divergence Examples – Divergence can arise when branch or loop condition is a function of thread indices – Example kernel statement with divergence: – if (threadIdx.x > 2) { } – This creates two different control paths for threads in a block – Decision granularity < warp size; threads 0, 1 and 2 follow different path than the rest of the threads in the first warp – Example without divergence: – If (blockIdx.x > 2) { } – Decision granularity is a multiple of blocks size; all threads in any given warp follow the same path
Example:Vector Addition Kernel Device Code /Compute vector sum C A B /Each thread performs one pair-wise addition global void vecAddKernel(float*A,float*B,float*C, int n) { int i=threadIdx.x blockDim.x blockIdx.x; if(i<n)c[i]A[i]B[i]; 电子科妓女学 O
12 Example: Vector Addition Kernel // Compute vector sum C = A + B // Each thread performs one pair-wise addition __global__ void vecAddKernel(float* A, float* B, float* C, int n) { int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C[i] = A[i] + B[i]; } 12 Device Code
Analysis for vector size of 1,000 elements -Assume that block size is 256 threads -8 warps in each block All threads in Blocks 0,1,and 2 are within valid range i values from 0 to 767 There are 24 warps in these three blocks,none will have control divergence Most warps in Block 3 will not control divergence Threads in the warps 0-6 are all within valid range,thus no control divergence One warp in Block 3 will have control divergence Threads with i values 992-999 will all be within valid range Threads with i values of 1000-1023 will be outside valid range Effect of serialization on control divergence will be small 1 out of 32 warps has control divergence The impact on performance will likely be less than 3% 电子神越女学 O
13 Analysis for vector size of 1,000 elements – Assume that block size is 256 threads – 8 warps in each block – All threads in Blocks 0, 1, and 2 are within valid range – i values from 0 to 767 – There are 24 warps in these three blocks, none will have control divergence – Most warps in Block 3 will not control divergence – Threads in the warps 0-6 are all within valid range, thus no control divergence – One warp in Block 3 will have control divergence – Threads with i values 992-999 will all be within valid range – Threads with i values of 1000-1023 will be outside valid range – Effect of serialization on control divergence will be small – 1 out of 32 warps has control divergence – The impact on performance will likely be less than 3%
Warps and SIMD performance impact of control divergence parallel reduction 笔古甲大子
14 Warps and SIMD performance impact of control divergence parallel reduction
Objective To learn to analyze the performance impact of control divergence Boundary condition checking Control divergence is data-dependent 电子科妓女学 O
15 Objective – To learn to analyze the performance impact of control divergence – Boundary condition checking – Control divergence is data-dependent