当前位置：和泉文库 > 计算机 > 浏览文档

上海交通大学：《Multicore Architecture and Parallel Computing》课程教学资源（PPT课件讲稿）Lecture 8 CUDA, cont’d

文件格式：PPT，文件大小：4.45MB，售价：11.06元

文档详细内容（约53页）

上声定通大字 SHANGHAI JLAO TONG UNIVERSITY CS427 Multicore Architecture and Parallel Computing Lecture 8 CUDA, contd Prof Xiaoyao Liang 2016/10/26

CS427 Multicore Architecture and Parallel Computing Lecture 8 CUDA, cont’d Prof. Xiaoyao Liang 2016/10/26 1

O Register File Limitation If each block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM Each block requires 10* 256=2560 registers 8192=3*2560+ change So. three blocks can run on an sm as far as registers are concerned How about if each thread increases the use of registers by Each Block now requires 11256=2816 registers 8192<2816*3 Only two Blocks can run on an SM, 1/3 reduction of parallelism!!!

Dynamic Partitioning Dynamic partitioning gives more flexibility to compilers/programm ers One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each This allows for finer grain threading than traditional cpu threading models The compiler can tradeoff between instruction-level parallelism and thread level parallelism

Dynamic Partitioning 3

O)ILP VS. TLP assume that a kernel has 256-thread blocks. 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, can fit 3 blocks global loads have 400 cycles 4 cycles 4 inst 24 warps =384 400 If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load, can only fit 2 blocks 4 cycles"8 inst *16warps=512>400, better hiding memory latenc

ILP Vs. TLP 4 • Assume that a kernel has 256-thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, can fit 3 blocks global loads have 400 cycles – 4 cycles * 4 inst * 24 warps = 384 < 400 • If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load, can only fit 2 blocks – 4 cycles * 8 inst * 16 warps = 512 > 400, better hiding memory latency

O Memory Coalescing Access MIMolM 2,0 direction 1M1M2M21 in Kerne code M3. 2 M 12223,2 Time Period 1 Time Period 2 Moo M1.O M2. M3. 0 MO1 M1 M2.1 M3.Mo: 2 M2 M22 M3.2 Mo.3 M13 M2.3 M33

Memory Coalescing 5

点击进入文档下载页（PPT格式）

共53页，可试读18页，点击继续阅读 ↓↓

您可能感兴趣的文档

东南大学：《操作系统概念 Operating System Concepts》课程教学资源（PPT课件讲稿）06 Process synchronization
河南中医药大学：《数据库原理》课程教学资源（PPT课件讲稿）第一章绪论
中国科学技术大学：《计算机体系结构》课程教学资源（PPT课件讲稿）第4章存储层次结构设计
西安交通大学：《网络与信息安全》课程PPT教学课件（网络入侵与防范）第一章网络安全概述（主讲：沈超、刘烃）
《管理信息系统》课程教学资源（PPT课件讲稿）第16章新型数据库技术及发展
北京大学：《软件需求工程》课程教学资源（PPT课件讲稿）第三章软件需求获取（主讲：周立新）
电子工业出版社：《计算机网络》课程教学资源（第六版，PPT课件讲稿）第三章数据链路层
山东大学：《微机原理及单片机接口技术》课程教学资源（PPT课件讲稿）第四章指令系统及汇编语言程序设计（4.1-4.6）
西北农林科技大学：高性能计算之并行编程技术（讲座PPT，报告人：周兆永）
《计算机操作系统》课程教学资源（PPT课件讲稿）第8章计算机系统的测试
数据包检测技术（PPT讲稿）High-Performance Pattern Matching for Intrusion Detection
中国科学技术大学：《信号与图像处理基础 Signal and Image Processing》课程教学资源（PPT课件讲稿）图像成像机理与模型
赣南师范大学：《计算机网络原理》课程教学资源（PPT课件讲稿）第四章数据链路层
南京大学：移动Agent系统支撑（PPT讲稿）Agent Mobility Software Agent（主讲：余萍）
上海师范大学：《R语言与统计分析》课程教学资源（PPT课件）R语言——介绍（主讲：汤银才）
《视频制作》课程教学资源：课程教学大纲
新乡学院：《办公自动化》课程教学资源（教学大纲）
《Excel高级应用》课程教学资源：课程教学大纲
《计算机网络》课程电子教案（PPT课件讲稿）第2章数据通信的基础知识
并行处理（PPT讲稿）Parallel Processing - Hypercubes and Their Algorithms
《计算机网络》课程教学资源（PPT课件讲稿）第8章应用层
香港城市大学：PROGRAMMING METHODOLOGY AND SOFTWARE ENGINEERING
《计算机操作系统》课程教学资源（PPT课件讲稿）第二章进程描述与控制 Process Concept & Process Control
佛山科学技术学院：《网络技术基础》课程教学资源（专业技能考试大纲）

点击购买下载（PPT）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录