Caveats of Parallelism (1I. Amdahl's Law- f:Parallelizable fraction of a program- N:Numberof processors1Speedup =f1-fNAmdahl, “"Validity of the single processor approach to achieving large scale computingcapabilities,”AFIPS1967 Maximum speedup limited by serial portion: SerialbottleneckParallel portion is usually not perfectly parallel- Synchronization overhead (e.g., updates to shared data)- Load imbalance overhead (imperfect parallelization)- Resource sharing overhead (contention among N processors)ComputerArchitecture16
Computer Architecture Caveats of Parallelism (II) • Amdahl’s Law – f: Parallelizable fraction of a program – N: Number of processors – Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” AFIPS 1967. • Maximum speedup limited by serial portion: Serial bottleneck • Parallel portion is usually not perfectly parallel – Synchronization overhead (e.g., updates to shared data) – Load imbalance overhead (imperfect parallelization) – Resource sharing overhead (contention among N processors) 16 Speedup = 1 + 1 - f f N
SequentialBottleneck380088008868843200dnpaadsN=10N=100N=1000080'091'0z'08ZE'096'095'0908'0+8'0880Z6'096'0to'oZTO'0tt'08050t9'089'0ZL'O92'0f (parallelfraction)ComputerArchitecture17
Computer Architecture Sequential Bottleneck 17 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 0 0.04 0.08 0.12 0.16 0.2 0.24 0.28 0.32 0.36 0.4 0.44 0.48 0.52 0.56 0.6 0.64 0.68 0.72 0.76 0.8 0.84 0.88 0.92 0.96 1 N=10 N=100 N=1000 f (parallel fraction)
Why the Sequential Bottleneck?Parallelmachineshavethesequential bottleneckMain cause:Nonparallelizable operations ondata (e.g. non-parallelizableloops)for (i= 0;i<N; i++)A[i] = (A[i] + A[i-1]) / 2Single thread prepares dataand spawns paralleltasks(usually sequential)ComputerArchitecture18
Computer Architecture Why the Sequential Bottleneck? • Parallel machines have the sequential bottleneck • Main cause: Nonparallelizable operations on data (e.g. non-parallelizable loops) for ( i = 0 ; i < N; i++) A[i] = (A[i] + A[i-1]) / 2 • Single thread prepares data and spawns parallel tasks (usually sequential) 18
Another Example of Sequential BottleneckLEGENDA,E: Amdahl's serial paltInitPriorityQueue(PQ);B:ParallelPortionC1,C2:CriticalSectionsSpawnThreads();D:Outsidecritical sectidnForEachThread:while(problemnotsolved)Lock (X)SubProblem=PQ.remove(Unlock(X);Solve(SubProblem)BIf(problemsolved)breakNewSubProblems = Partition(SubProblem);Lock(X)PQ.insert(NewSubProblemUnlock(X)02PrintSolutionOEComputerArchitecture19
Computer Architecture Another Example of Sequential Bottleneck Solve(SubProblem); C2 C1 B E A D1 D2 A,E: Amdahl’s serial part C1,C2: Critical Sections D: Outside critical section B: Parallel Portion Lock (X) SubProblem = PQ.remove(); Unlock(X); Unlock(X) PQ.insert(NewSubProblems); Lock(X) NewSubProblems = Partition(SubProblem); If(problem solved) break; while (problem not solved) . . . PrintSolution(); ForEach Thread: SpawnThreads(); InitPriorityQueue(PQ); LEGEND 19
Bottlenecks in Parallel PortionSynchronization: Operations manipulating shared datacannot be parallelized-Locks,mutualexclusion,barriersynchronization- Communication:Tasks may need values from each other- Causes thread serialization when shared data is contendedLoad Imbalance: Parallel tasks may have differentlengths-Duetoimperfectparallelizationormicroarchitecturaleffects- Reduces speedup in parallel portionResource Contention: Parallel tasks can share hardwareresources, delaying each other- Replicating all resources (e.g., memory) expensive-Additional latencynotpresentwheneachtaskrunsaloneComputerArchitecture20
Computer Architecture Bottlenecks in Parallel Portion • Synchronization: Operations manipulating shared data cannot be parallelized – Locks, mutual exclusion, barrier synchronization – Communication: Tasks may need values from each other - Causes thread serialization when shared data is contended • Load Imbalance: Parallel tasks may have different lengths – Due to imperfect parallelization or microarchitectural effects - Reduces speedup in parallel portion • Resource Contention: Parallel tasks can share hardware resources, delaying each other – Replicating all resources (e.g., memory) expensive - Additional latency not present when each task runs alone 20