高级计算机体系结构设计及其在数据中心和云计算的应用Examples : G80 and GT200. MT- unit (Global Block)Global BlockSchedderGT200TPCOschedulerSM ControlierOsMControtioeSMControfeTPC - texture24K8KETX0processor cluster0(group of SM sharesame texture unit)GiobatBlockSchedulenG80TPCC2 GPU generation G80SMControler7SM ControllerOSMController1and GT200 shownSM
高级计算机体系结构设计及其在数据中心和云计算的应 用 Examples : G80 and GT200 • MT- unit (Global Block) scheduler • TPC – texture processor cluster (group of SM share same texture unit) • 2 GPU generation G80 and GT200 shown
高级计算机体系结构设计及其在数据中心和云计算的应用Examples : GT300GT300 (Fermi)InstructionCachWarp SohedulerWarp ScheculerDispatch UnitDispatch UnitRegisterFile(32768x32-bitRANSHDRAMCUDA CoreDispatch PonSFUL2CacheFPUnitINTUnitDRAMSFUResult CueueDRANSHUFermi's16SMarepositionedaroundacommonL2cache.EachSMisavertical rectangularstripthatcontainanorangeportion64KBSharedMemoryL1Cach(scheduleranddispatch),agreenportion(executionunits),andlightblueportions(registerfileandLicache)FermiStreamingMultiprocessor(SM)http://www.nvidia.com/content/PDF/fermiwhitepapers/NVIDIAFermiCompute_Architecture_Whitepaper.pdf
高级计算机体系结构设计及其在数据中心和云计算的应 用 Examples : GT300 • GT300 (Fermi) http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA _Fermi_Compute_Architecture_Whitepaper.pdf Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache) Fermi Streaming Multiprocessor (SM)
高级计算机体系结构设计及其在数据中心和云计算的应用ComparisonG80 S. GT200 vS. GT300GPUG80GT200Fermi681million1.4billion3.0billionTransistors128240512CUDACoresNone30FMAops/clock256FMAops/clockDouble PrecisionFloatingPoint Capability128MAD240MADOPS/512FMAops/clockSingle PrecisionFloatingops/clockclockPointCapability112Warpschedulers(perSM)224SpecialFunction Units(SFUs)/SM16KB16KBSharedMemory(perSM)Configurable48KBor16KBL1Cache(perSM)NoneNoneConfigurable16KBor48KBNone768KBL2Cache(perSM)NoneNoNoYesEcCMemorySupportNoNoConcurrentKernelsUpto1632-bit32-bit64-bitLoad/StoreAddressWidthhttp:/www.dvhardware.net/article38173.html
高级计算机体系结构设计及其在数据中心和云计算的应 用 Comparison • G80 vs. GT200 vs. GT300 http://www.dvhardware.net/article38173.html
高级计算机体系结构设计及其在数据中心和云计算的应用Example: GK110 (Kepler Architecture)PCiExpress1.0HosIntertacL2CacheKepler:FastfEfficientTexSMSMXMorepower efficient than FermiFermiKeplerNewSMarchitecture(SMX)CNRLLOECONTROLLOGC3xRevampedmemoryarchitectureHardwaresupportfornewPerf/Wattprogramingmodels192coresCapableofDynamicParallelismSource:http://www.nvidia.com/content/PDF/kepler/NVIDIA-KeplerGK110-Architecture-Whitenaner.nd
高级计算机体系结构设计及其在数据中心和云计算的应 用 Example: GK110 (Kepler Architecture) More power efficient than Fermi. New SM architecture (SMX). Revamped memory architecture. Hardware support for new programing models. Capable of Dynamic Parallelism. Source: http://www.nvidia.com/content/PDF/kepler/NVIDIA-KeplerGK110-Architecture-Whitepaper.pdf
高级计算机体系结构设计及其在数据中心和云计算的应用BasicGPGPUProcessor PipelineSimplein-orderexecutioninSIMT-SingleinstructionmultiplethreadsSchedule Warp andFetchInstructionSchedulerchooses one of severalwarps (PC)水I-cacheFetches 1 instruction from the Is per warpDecode+I-BufferandDecodesthe instruction,reads register andScoreboarddispatchesSharedIssueInstructionScoreboard maintains dependenciesMemoryRegister FileMulti-ported registerfileprovidesdataforalllanesSpecialLoadiIntegerFloatStoreUnitALUALUFunctions1NumerousALU,FPU,LD/ST,SFUlanesrunOf-chipDataRegister Write Backsimultaneously (differentspeeds)DRAMicacheWriteback updatestheregisterfile
高级计算机体系结构设计及其在数据中心和云计算的应 用 Basic GPGPU Processor Pipeline • Simple in-order execution in SIMT – Single instruction multiple threads • Scheduler chooses one of several warps (PC) • Fetches 1 instruction from the I$ per warp • Decodes the instruction, reads register and dispatches – Scoreboard maintains dependencies • Multi-ported register file provides data for all lanes • Numerous ALU, FPU, LD/ST, SFU lanes run simultaneously (different speeds) • Writeback updates the register file