O Memory Coalescing Access direction M 0,1m1,1 M21M31 in Kerne code M。2M12M22M32 M。3M13M2.M Time period 2 Tinhe Period 1 M MOo MoM20M3..1 M1.1 M2. M31M. 2 M1. 2 M2. 2 M3. 2 M3 M1.3 M2. 3 M3.3
Memory Coalescing 6
O Memory Coalescing global void Matrix MulKernel (float* Md, float* Nd, float*Pd, int Width) shared float Mds [TILE WIDTH] [TILE WIDTH]; shared f1 oat nds[T工LEW工DTH][ TILE WIDTH]; 3. int bx blockIdx x int by blockIdxy 4. int tx threadIdx. x; int ty =threadIdxy // Identify the row and column of the Pd element to work on int Row by TILE WIDTH tyi 6. int col bx TILE WIDTH + txi float Pvalue = o // Loop over the Ma and Nd tiles required to compute the pd element 8 for (int m=0; m< Width/TILE WIDTH; ++m) I // Coolaborative loading of Md and Nd tiles into shared memory 9 Mds [ty][tx]= Md[Row*width +(m*TILE WIDTH tx)]i 10 Nds [ty][tx]= Nd[Col +(m*TILE WIDTH ty)*width]i syncthreads ()i 11. for (int k=0; k< TILE WIDTH; ++k) Pvalue + Mds [tyl[k] Nds [k][tx]i Synchthreads ()i 14.} 13. Pd[Row*Width+Col]= Pvaluei
Memory Coalescing 7
O) Increasing Per Thread Work 012 TILE WIDTH-1 Each thread computes two element of pda, Reduced loads from global memory(Md)to shared memory Reduced instruction overhead More work done in each iteration TILE WIDT TIPE WIIDT: IAE WIDB TIIPE WDTR
Increasing Per Thread Work 8
@Double buffering One could double buffer the computation, getting better instruction mix within each thread This is classic software pipelining in ILP compilers Loop i Load next tile from global memory Load current tile to shared memory Loop t Deposit current tile to shared syncthreads memory syncthreadso) Compute current tile Load next tile from global syncthreads( memory Compute current tile
Double Buffering 9
Double buffering Deposit blue tile from register into 012 TILE WIDTH-1 IL Shared memory yncthreads Load orange tile into register Compute blue tile Deposit orange tile into shared memory 二二二量 TILE WIDT TIRE WIIF IE wiLDI TIPE WIDIIB
Double Buffering 10