G) Divergence Main performance concern with branching is divergence Threads within a single warp take different paths Different execution paths are serialized in G80 The control paths taken by the threads in a warp are traversed one at a time until there is no more A common case: avoid divergence when branch condition is a function of thread iD Example with divergence °工f( threadx,x>2){} This creates two different control paths for threads in a block Branch granularity warp size: threads 0 and l follow different patl than the rest of the threads in the first warp Example without divergence If (threadIdx. x/ WARP SIZE >2)) Also creates two different control paths for threads in a block Branch granularity is a whole multiple of warp size: all threads in any given warp follow the same path
Divergence 11
G)Reduction Given an array of values reduce'' them to a single value in parallel Examples sum reduction: sum of all values in the array Max reduction: maximum of all values in the array Typically parallel implementation Recursively halve threads, add two values per thread Takes log(n) steps for n elements, requires n/2 threads
Reduction 12
O Reduction Program assume we have already loaded array into shared float partialSum unsigned int t= threadIdx xi for (unsigned int stride =1 stride blockDim x; stride *=2) syncthreads ()i if(t(2*stride)==0) partialsum[t] + partialSum[t+strideji
Reduction Program 13
G) Reduction Program Array elements 0123456789101 10+1 2+3 4+5 6+ 8+910+11 1 iterations
Reduction Program 14
G) Divergence in Reduction Thread o Thread 2 read 4 Thread 6 Thread 8 Thread 10 01234567891011 10+1 2+3 4+5 6+7 8+910+11 30.7 iteration
Divergence in Reduction 15