O) Problem in the Program In each iterations, two control flow paths will be sequentiall traversed for each warp Threads that perform addition and threads that do not Threads that do not perform addition may cost extra cycles depending on the implementation of divergence No more than half of threads will be executing at any time all odd index threads are disabled right from the beginning On average, less than 14 of the threads will be activated for all warps over time After the 5t iteration, entire warps in each block will be disabled, poor resource utilization but no divergence This can go on for a while, up to 4 more iterations(512/32=16=24). where each iteration only has one thread activated until all warps retire
Problem in the Program 16
通 Problem in the Program assume we have already loaded array into shared float partialSum unsigned int t= threadIdx. x; BAD. Divergence due to interleaved for (unsigned int stride = li branch decisions stride blockDim x: str syncthreads()氵 if (t(2*stri 0) partialSum[t] + partialSum[t+strideli 17
Problem in the Program 17
G) Better Implementation assume we have already loaded array into shared float partialsum unsigned int t= threadIdxx for (unsigned int stride blockDimxi stride >l; stride > 1 syncthreads() if (t< stride) partialsum[t] + partialsum[t+stride]i
Better Implementation 18