高级计算机体系结构设计及其在数据中心和云计算的应用OutlineGPGPUArchitectureOverviewCore ArchitectureMemory HierarchyInterconnect CPU-GPU InterfacingProgramming Paradigm
高级计算机体系结构设计及其在数据中心和云计算的应 用 Outline • GPGPU Architecture Overview • Core Architecture • Memory Hierarchy • Interconnect • CPU-GPU Interfacing • Programming Paradigm
高级计算机体系结构设计及其在数据中心和云计算的应用Inside Streaming MultiprocessorStreamingMultiprocessorInstructionL1Data L1Streaming Multiprocessor (G80)InstructionFetch/Dispatch- 8 Streaming Processors (SP)Shared MemorySPSP- 2 Super Function Units (SFU)SPSPSFUSFUSPSPMulti-threaded instruction dispatchSPSP- 1 to 512 threads activeShader Core- Shared instruction fetch per 32ThreadWarThreadWapthreadsThreadWarpSchedulerPIpENDACoverlatencyoftexture/memory loadslocal/globatDRETofnterehrows16 KB shared memoryWriteh
高级计算机体系结构设计及其在数据中心和云计算的应 用 Inside Streaming Multiprocessor • Streaming Multiprocessor (G80) – 8 Streaming Processors (SP) – 2 Super Function Units (SFU) • Multi-threaded instruction dispatch – 1 to 512 threads active – Shared instruction fetch per 32 threads – Cover latency of texture/memory loads • 16 KB shared memory
高级计算机体系结构设计及其在数据中心和云计算的应用RegisterFile8192 registers in each SM in G80ablocks4blocks- Implementation decision, not partof programming abstraction-Registersaredynamicallypartitioned across all blocksassigned to the SM- Once assigned to a block, theregister is NOT accessible bythreads in other blocks_ Threads access registers assignedto itself
高级计算机体系结构设计及其在数据中心和云计算的应 用 Register File • 8192 registers in each SM in G80 – Implementation decision, not part of programming abstraction – Registers are dynamically partitioned across all blocks assigned to the SM – Once assigned to a block, the register is NOT accessible by threads in other blocks – Threads access registers assigned to itself
高级计算机体系结构设计及其在数据中心和云计算的应用Thread Dispatch Policy. Hierarchy of grid of blocks ofHostDevicethreadsGrid1KernelBlockBlockBlock Blocks are serially distributed to(0, 0)(1,0)(2, 0)SMBlockBlockBlock(0,2)(1,1)(2, 1)· Potentially >1 Block/SMGrid2KernelSM launches Warps (322threads)Block (1, 1): 2 levels of parallelismRound-robin, ready-to-executescheduling policyFiguresource-NvidiaCUDAProgrammingGuide2.3
高级计算机体系结构设计及其在数据中心和云计算的应 用 Thread Dispatch Policy • Hierarchy of grid of blocks of threads • Blocks are serially distributed to SM • Potentially >1 Block/SM • SM launches Warps (32 threads) • 2 levels of parallelism • Round-robin, ready-to-execute scheduling policy Figure source – Nvidia CUDA Programming Guide 2.3
高级计算机体系结构设计及其在数据中心和云计算的应用Block Execution : Software ViewSM block execution (G80)SNOSM1totit2totit2tnMTIUMTIUBlocks- Assignment in blockSSFgranularityBlocks. Up to 8 blocks/SM as resourceallows: SM in G80 can take up to 768TFthreadsTextureL1- 256 (threads/block)L2* 3 blocks-Or 128(threads/block) * 6blocks, etc.Threadsrunconcurrently
高级计算机体系结构设计及其在数据中心和云计算的应 用 Block Execution : Software View • SM block execution (G80) – Assignment in block granularity • Up to 8 blocks/SM as resource allows • SM in G80 can take up to 768 threads 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. Threads run concurrently