上声定通大字 SHANGHAI JLAO TONG UNIVERSITY CS427 Multicore Architecture and Parallel Computing Lecture 8 CUDA, contd Prof Xiaoyao Liang 2016/10/26
CS427 Multicore Architecture and Parallel Computing Lecture 8 CUDA, cont’d Prof. Xiaoyao Liang 2016/10/26 1
O Register File Limitation If each block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM Each block requires 10* 256=2560 registers 8192=3*2560+ change So. three blocks can run on an sm as far as registers are concerned How about if each thread increases the use of registers by Each Block now requires 11256=2816 registers 8192<2816*3 Only two Blocks can run on an SM, 1/3 reduction of parallelism!!!
Register File Limitation 2
Dynamic Partitioning Dynamic partitioning gives more flexibility to compilers/programm ers One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each This allows for finer grain threading than traditional cpu threading models The compiler can tradeoff between instruction-level parallelism and thread level parallelism
Dynamic Partitioning 3
O)ILP VS. TLP assume that a kernel has 256-thread blocks. 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, can fit 3 blocks global loads have 400 cycles 4 cycles 4 inst 24 warps =384 400 If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load, can only fit 2 blocks 4 cycles"8 inst *16warps=512>400, better hiding memory latenc
ILP Vs. TLP 4 • Assume that a kernel has 256-thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, can fit 3 blocks global loads have 400 cycles – 4 cycles * 4 inst * 24 warps = 384 < 400 • If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load, can only fit 2 blocks – 4 cycles * 8 inst * 16 warps = 512 > 400, better hiding memory latency
O Memory Coalescing Access MIMolM 2,0 direction 1M1M2M21 in Kerne code M3. 2 M 12223,2 Time Period 1 Time Period 2 Moo M1.O M2. M3. 0 MO1 M1 M2.1 M3.Mo: 2 M2 M22 M3.2 Mo.3 M13 M2.3 M33
Memory Coalescing 5