高级计算机体系结构设计及其在数据中心和云计算的应用Lecture 12Shared-Memory Multi-Processors
高级计算机体系结构设计及其在数据中心和云计算的应用 Lecture 12 Shared-Memory Multi-Processors
高级计算机体系结构设计及其在数据中心和云计算的应用Shared-Memory MultiprocessorsMultiple threads use shared memory (address space)-"SysV Shared Memory" or“"Threads" in softwareCommunication implicitvialoadsandstores- Opposite of explicit message-passing multiprocessorsTheoretical foundation:PRAM modelPAP2Q3P4MemorySystem
高级计算机体系结构设计及其在数据中心和云计算的应用 Shared-Memory Multiprocessors • Multiple threads use shared memory (address space) – “SysV Shared Memory” or “Threads” in software • Communication implicit via loads and stores – Opposite of explicit message-passing multiprocessors • Theoretical foundation: PRAM model P1 P2 P3 P4 Memory System
高级计算机体系结构设计及其在数据中心和云计算的应用Why Shared Memory?Pluses-App seesmultitaskinguniprocessor- os needs only evolutionaryextensions-CommunicationhappenswithoutOs.Minuses-Synchronizationis complex- Communication is implicit (hard to optimize)- Hard to implement (in hardware)Result-SMPsandCMPsaremostsuccessfulmachinestodate-First withmulti-billion-dollarmarkets
高级计算机体系结构设计及其在数据中心和云计算的应用 Why Shared Memory? • Pluses – App sees multitasking uniprocessor – OS needs only evolutionary extensions – Communication happens without OS • Minuses – Synchronization is complex – Communication is implicit (hard to optimize) – Hard to implement (in hardware) • Result – SMPs and CMPs are most successful machines to date – First with multi-billion-dollar markets
高级计算机体系结构设计及其在数据中心和云计算的应用Paired vs. Separate Processor/Memory?Separate CPU/memory· Paired CPU/memory-Uniformmemoryaccess-Non-uniformmemoryaccess(UMA)(NUMA)Equallatencytomemory.Fasterlocalmemory.Data placement matters-Lowpeakperformance- High peak performance[CPU($)CPU(S)CPU(S)CPU(S)CPU(S)CPU($)CPU(S)CPU($)RMemMemRMemRMemRMemMemMemMem
高级计算机体系结构设计及其在数据中心和云计算的应用 Paired vs. Separate Processor/Memory? • Separate CPU/memory – Uniform memory access (UMA) • Equal latency to memory – Low peak performance • Paired CPU/memory – Non-uniform memory access (NUMA) • Faster local memory – Low peak performance • Data placement matters – High peak performance CPU($) Mem CPU($) Mem CPU($) Mem CPU($) Mem CPU($) Mem CPU($) Mem CPU($) Mem CPU($) R R R Mem R
高级计算机体系结构设计及其在数据中心和云计算的应用Shared vs. Point-to-Point Networks· Shared networkPoint-to-point network:- Example:bus-Example:mesh,ring-Low latency-Highlatency (many“hops")-Lowbandwidth-Higherbandwidth.Doesn't scale >~16 cores: Scales to 1000s of cores-Simplecachecoherence-ComplexcachecoherenceCPU($)CPU(S)CPU($)CPU(S)CPU($)CPU($)MemRMemRMemRMemRRMemMemRMemRRMemCPU(S)CPU(S)
高级计算机体系结构设计及其在数据中心和云计算的应用 Shared vs. Point-to-Point Networks • Shared network – Example: bus – Low latency – Low bandwidth • Point-to-point network: – Example: mesh, ring – High latency (many “hops”) – Higher bandwidth • Doesn’t scale >~16 cores – Simple cache coherence • Scales to 1000s of cores – Complex cache coherence CPU($) Mem CPU($) Mem R CPU($) Mem R CPU($) R Mem CPU($) R Mem CPU($) Mem CPU($) Mem CPU($) R R R Mem R