Matrix Layout 0 N 02。0|13.0 21132 MOOM1OM2OM M3.1 Mo2 M12 M2.2 M3
Matrix Layout 11
Matrix Main Program int main(void)i 1.//Allocate and initialize the matrices m. n, p //1/0 to read the input matrices m and N 2.//M*Non the device MatrixMultiplication(M,N, P, Width): 3. //I/0 to write the output matrix // Free matrices M. N, p return o
Matrix Main Program 12
S)Kernel Program void MatrixMultiplication(float*M, float*N, float*P, int Width) int size-Width Width sizeof(float) float* Md, Nd, pd 1.// Allocate device memory for M, N, and p / copy M and N to al located device memory locations 2.// Kernel invocation code- to have the device to perform / the actual matrix multiplication 3.// copy P from the device memory / Free device matrices
Kernel Program 13
O Creating CUDA Memory Space TILE WIDTH 64 Float* Md int size TILE WIDTH* TILE WIDTH sizeof(float) cudaMalloc((void*x)&Md, size) cudaFree(Md;
Creating CUDA Memory Space 14
O Memory Copy (Device) Grid cudaMemcpyo Memory data transfer Block(0, 0) Block(1, 0) Requires four parameters Pointer to destination Shared Memory Shared Memory Pointer to source Number of bytes copied RegistersRegiste Type of transfer Thread(0, O) Thread (1, 0)Thread (0, o) Thread (1, o) Host to Host Host to Device Device to Host Device to device Transfer is asynchronous cudaMemcpy (Md, M, size, cudaMemcpyHostTo Device); cudaMemcpy (M, Md, size, cudaMemcpy DeviceToHost);
Memory Copy 15