当前位置：和泉文库 > 计算机 > 浏览文档

上海交通大学：《Multicore Architecture and Parallel Computing》课程教学资源（PPT课件讲稿）Lecture 7 CUDA

文件格式：PPT，文件大小：3.5MB，售价：10.5元

文档详细内容（约50页）

3 CUDA Device and Threads ° A com pute device Is a coprocessor to the cpu or host Has its own RAM ( device mermory Runs many threads in parallel Is typically a gpU but can also be another type of parallel p rocess vice Data-parallel portions of an application are expressed as device kernels which run on many threads Differences between GPu and cpu threads > gPU threads are extremely light weight >very little creation overhead >gPU needs 1000s of threads for full effic ciency >Multi-core CpU needs only a few

CUDA Device and Threads 6 •A compute device ➢Is a coprocessor to the CPU or host ➢Has its own DRAM (device memory) ➢Runs many threads in parallel ➢Is typically a GPU but can also be another type of parallel processing device •Data-parallel portions of an application are expressed as device kernels which run on many threads •Differences between GPU and CPU threads ➢GPU threads are extremely light weight ➢Very little creation overhead ➢GPU needs 1000s of threads for full efficiency ➢Multi-core CPU needs only a few

@ EXtension eclspecs device float filter [ nli global, device, shared, local constant global void convolve (float *image shared float region [M] Keywords threadx. blockldx region [threadIdx]= image [i]i · Intrinsics syncthreads syncthreads image[j]= resulti Runtime apl Memory, symbol // Allocate GPU memory void *myimage= cudaMalloc(bytes execution management //100 blocks, 10 threads per block Function launch convolve<<<100, 10>>>(myimage) 7

C Extension 7

S)Compilation Flow Integrated source (foo. cu) cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly CPU Host Code foo s foo. cpp OCG gcc/cl G80 SASS Mark Murphy. " NVIDIA's Experience with fo Open64 8

Compilation Flow 8

@ Compilation Flow C/C++ CUDA float4 me gxIgtid] Application me.X t= me. y me. Z, NVCC CPU Code Virtual PTX Code PhysicapTX to Target Id globalv4. f32 [sfh mad. f32 sfl Compiler G80 GPU Target code

Compilation Flow 9

@Matrix Multiplication void MatrixMultiplication(float* M, float* N, float* P, int width) for (int i =0: i< width: ++i) for (int j=0:j< Width: ++j)I k float sum=0: for (int k =0;k< Width: ++k)I float a= M[i width + k]: float b=NIk width j]: sum a b P[i Width j] P k WIDTH WIDTH 1000X1000=1,000,000 independent dot product 1000 multiply+ 1000 accumulate per dot

Matrix Multiplication 10 1000X1000=1,000,000 independent dot product 1000 multiply+1000 accumulate per dot

点击进入文档下载页（PPT格式）

共50页，可试读17页，点击继续阅读 ↓↓

您可能感兴趣的文档

上海交通大学：云安全（PPT讲稿）Cloud Security
局域网的硬件设备和操作系统（PPT讲稿）
大数据分析（PPT讲稿）大数据引领我们走向数据智能化时代
淮阴工学院：《数据库原理》课程教学资源（PPT课件讲稿）第3章关系数据库的基本理论
《Java面向对象程序设计》课程教学资源（PPT课件讲稿）第三章 Java面向对象编程
《Java面向对象程序设计》课程教学资源（PPT课件讲稿）第六章 Java输入输出流与文件操作
《Java面向对象程序设计》课程教学课件（PPT讲稿）流程控制语句
《Java面向对象程序设计》课程教学课件（PPT讲稿）AWT和Swing组件
江苏海洋大学（淮海工学院）：《Java面向对象程序设计》课程教学资源（PPT课件讲稿）第4章 Java图形用户界面设计
江苏海洋大学（淮海工学院）：《Java面向对象程序设计》课程教学资源（PPT课件讲稿）第2章 Java语言基础
《Java面向对象程序设计》课程教学资源（PPT课件讲稿）第四章 Java图形用户界面设计 4.2 AWT和Swing组件
《高级语言程序设计 Advanced Programming》课程教学资源（PPT课件讲稿）第8章指针
上海交通大学：《通信网络》课程PPT教学课件（讲稿）Communication Networks - ANALYSIS OF 10G EEE PROTOCOL
亚马逊云计算AWS（Amazon Web Service）、Cloud Computing——Cassandra
《计算机图形学》课程教学资源（PPT课件讲稿）Chapter 4 Graphics Output Primitives（Part II）
北京理工大学：《软件工程基础》课程教学资源（PPT课件讲稿）需求工程（主讲：刘驰）
上海交通大学：Scheduling Algorithms in Heterogeneous Computing Systems
上海交通大学：《程序设计》课程教学资源（PPT课件讲稿）第5章批量数据处理——数组
上海交通大学：《现代操作系统》课程教学资源（PPT课件讲稿）Chapter 02 进程与线程 Process and Thread
《数据库基础与应用》课程PPT教学课件（Access案例教程）第9章数据库语言SQL
《数据库基础与应用》课程PPT教学课件（Access案例教程）第8章宏
《数据库基础与Access应用》课程教学资源（PPT课件）第12章应用实例
《数字图像处理基础》课程教学资源（教学大纲
长安大学：《微机原理》课程教学资源（PPT课件讲稿）第7章汇编语言程序设计

点击购买下载（PPT）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录