高级计算机体系结构设计及其在数据中心和云计算的应用Lecture 14GPGPUArchitectureandProgramming Paradigm
高级计算机体系结构设计及其在数据中心和云计算的应 用 Lecture 14 GPGPU Architecture and Programming Paradigm
高级计算机体系结构设计及其在数据中心和云计算的应用OutlineGPGPUArchitectureOverviewCore ArchitectureMemory HierarchyInterconnectCPU-GPU InterfacingProgramming Paradigm
高级计算机体系结构设计及其在数据中心和云计算的应 用 Outline • GPGPU Architecture Overview • Core Architecture • Memory Hierarchy • Interconnect • CPU-GPU Interfacing • Programming Paradigm
高级计算机体系结构设计及其在数据中心和云计算的应用Basic Blocks: Several shadercores/streamingOn-chipareamultiprocessor (SM)SM,SMSMSMSM.Interconnection networkTPC-1TPC-0HOSTCPUPciFxpross BunInterconnectNetworkOn-chip memoryL2L2L2L2L2controllersDRAMDRAMDRAMDRAMDRAM..ControlleraController,ControllerControllerControllerm On-chip caches (level1/2)DRAMDRAMDRAMDRAMDRAMChipChipChipsChip,Chipe Off-chip DRAMOf-chiparea
高级计算机体系结构设计及其在数据中心和云计算的应 用 Basic Blocks • Several shader cores/streaming multiprocessor (SM) • Interconnection network • On-chip memory controllers • On-chip caches (level1/2) • Off-chip DRAM
高级计算机体系结构设计及其在数据中心和云计算的应用Basic BlocksSSSSSSSSATextureThreadSchedulerCacheSSSSSConstantInstructionCacheCacheDecoderThread batch-HW unit of threadexecution (Warp -SharedNvidia)Memory(Wavefront-ATI)HardwarethreadschedulingThreadshavededicatedregistersRegisterShared memoryamongthreadFileblockSamePCforallthreadsinwarpSeparateALUandmemorypipeline
高级计算机体系结构设计及其在数据中心和云计算的应 用 . INTERCONNECT . SM SM . SM Texture Processor Cluster0 SM SM . SM Texture Processor Cluster1 SM SM . SM Texture Processor ClusterM Streaming Multiprocessor High BW onchip network SP SP SP . SP Thread Scheduler Instruction Cache Decoder Texture Cache Constant Cache Shared . matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); . . GPU Kernels Compile with Thread batch CUDA compiler - HW unit of thread execution (Warp - Basic Blocks MC0 MC1 MC2 MC3 MCL DRAM DRAM DRAM DRAM DRAM L2 L2 L2 L2 . L2 . . Off-chip memory array Memory Controllers SP SP SP . SP SP SP SP . SP SP SP SP . SP SP SP SP . SP . . . . . Shared Memory Register File . mov.s32 %r14, 15; and.b32 %r15, %r13, %r14; add.s32 %r16, %r15, %r12; shr.s32 %r17, %r16, 4; . Light weight into thread-blocks threads grouped PTX assembly execution (Warp - Nvidia) (Wavefront - ATI) • Hardware thread scheduling • Threads have dedicated registers • Shared memory among thread block • Same PC for all threads in warp • Separate ALU and memory pipeline
高级计算机体系结构设计及其在数据中心和云计算的应用Streaming MultiprocessorMulti thread unitTPCSharedTextureMTUnitCache:InstructionConstantCacheInstructionCachecache/decoderDecoderSeveral singleSPSPSPprocessor (SP)SPSPRegisterSharedLoad-store/SFU unitsFileMemoryLarge register fileSPSPSP.+....Shared memorySPSPSP Shared texture cachesSFUUnitLoad/Store Unit Constant cache
高级计算机体系结构设计及其在数据中心和云计算的应 用 Streaming Multiprocessor • Multi thread unit • Instruction cache/decoder • Several single processor (SP) • Load-store/SFU units • Large register file • Shared memory • Shared texture caches • Constant cache