State Machine for 2-bit Saturating CounterCounterusing saturatingarithmeticThereisasymbolformaximumandminimumvaluesactuallyactuallytakenpredpred!takentakentaken1011actuallytakenactuallyactuallytaken!takenactuallypredpred!taken!taken!takenactually0100actually!takentakenComputerArchitecture21
Computer Architecture State Machine for 2-bit Saturating Counter • Counter using saturating arithmetic – There is a symbol for maximum and minimum values pred taken 11 pred taken 10 pred !taken 01 pred !taken 00 actually taken actually taken actually !taken actually !taken actually !taken actually !taken actually taken actually taken 21
Hysteresis Using a 2-bit Counteractuallyactuallyweakly!takentakentaken“"stronglypredpredtakentakentakenactuallytakenactuallyactually!takentaken"stronglyactually!taken!takenpredpred"weakly!taken!takenactuallyOO7!taken!takenactuallytakenChangepredictionafter2consecutivemistakesComputerArchitecture22
Computer Architecture Hysteresis Using a 2-bit Counter 22 pred taken pred taken pred !taken pred !taken actually taken actually taken actually !taken actually !taken actually !taken actually !taken actually taken actually taken Change predicLon a_er 2 consecuLve mistakes “weakly taken” “strongly taken” “weakly !taken” “strongly !taken
Is This Enough?: ~85-90% accuracy for many programs with 2-bitcounter based prediction (also called bimodalprediction): Is this good enough?. How big is the branch problem?ComputerArchitecture23
Computer Architecture Is This Enough? • ~85-90% accuracy for many programs with 2-bit counter based prediction (also called bimodal prediction) • Is this good enough? • How big is the branch problem? 23
Rethinking the The Branch Problem: Control flow instructions (branches) are frequent15-25%ofallinstructionsProblem:Nextfetchaddressafteracontrol-flowinstruction is not determined after N cycles in apipelinedprocessor N cycles: (minimum) branch resolution latency- Stalling on a branch wastes instruction processing bandwidth(i.e. reduces IPC): NxIW instruction slots are wasted (IW:issue width)How do wekeep the pipeline full after a branch?Problem:Need to determine the next fetch addresswhenthebranchisfetched(toavoidapipelinebubble)ComputerArchitecture24
Computer Architecture Rethinking the The Branch Problem • Control flow instructions (branches) are frequent – 15-25% of all instructions • Problem: Next fetch address after a control-flow instruction is not determined after N cycles in a pipelined processor – N cycles: (minimum) branch resolution latency – Stalling on a branch wastes instruction processing bandwidth (i.e. reduces IPC) • N x IW instruction slots are wasted (IW: issue width) • How do we keep the pipeline full after a branch? • Problem: Need to determine the next fetch address when the branch is fetched (to avoid a pipeline bubble) 24
Importance of The Branch ProblemAssumea5-widesuperscalarpipelinewith20-cyclebranchresolutionlatencyHow long does it take to fetch 5oo instructions?-Assumenofetchbreaksand1outof5instructionsisabranch-100%accuracy.1o0cycles(all instructionsfetchedonthecorrectpath).Nowastedwork- 99% accuracy.100(correctpath)+20(wrongpath)=120cycles20%extrainstructionsfetched-98%accuracy:100 (correct path)+20*2(wrongpath)= 140 cycles40%extrainstructionsfetched-95%accuracy。100 (correct path)+ 20 * 5 (wrong path) = 200 cycles100%extrainstructionsfetchedComputerArchitecture25
Computer Architecture Importance of The Branch Problem • Assume a 5-wide superscalar pipeline with 20-cycle branch resolution latency • How long does it take to fetch 500 instructions? – Assume no fetch breaks and 1 out of 5 instructions is a branch – 100% accuracy • 100 cycles (all instructions fetched on the correct path) • No wasted work – 99% accuracy • 100 (correct path) + 20 (wrong path) = 120 cycles • 20% extra instructions fetched – 98% accuracy • 100 (correct path) + 20 * 2 (wrong path) = 140 cycles • 40% extra instructions fetched – 95% accuracy • 100 (correct path) + 20 * 5 (wrong path) = 200 cycles • 100% extra instructions fetched 25