Seriously,Why Fault-Tolerance Comes Back? Simply put,technology-driven oundation of Innovation: With technology scaling Defining the Pace 2005 22007 2009 2011 Total Cost 20nm prototype 32nm process 5nm prototype 22nm process Onm prototype Reliability Cost Transistor Cost Time Today's chips are extremely We cannot afford heavyweight, complex (billion transistors macro-scale redundancy for running with less noise margin) commodity computing systems. and are much hotter! Part.1.11 Qiang Xu CUHK,Fall 2012
Part.1 .11 Qiang Xu CUHK, Fall 2012 Seriously, Why Fault-Tolerance Comes Back? Simply put, technology-driven Time Transistor Cost Reliability Cost Total Cost With technology scaling Today’s chips are extremely complex (billion transistors running with less noise margin) and are much hotter! We cannot afford heavyweight, macro-scale redundancy for commodity computing systems
The Impact of Technology Scaling Decreasing Constant Increasing Failure Failure Failure Rate Rate Rate Burn-in test Higher random 'failure rate Faster less effective Observed Failure wear-out Rate ounjey Mortality" Failure Wear Out Fallures Constant(Random) Failures Time ◆More leakage More process variability Smaller critical charges Weaker transistors and wires Part.1.12 Qiang Xu CUHK,Fall 2012
Part.1 .12 Qiang Xu CUHK, Fall 2012 The Impact of Technology Scaling More leakage More process variability Smaller critical charges Weaker transistors and wires Burn-in test less effective Higher random failure rate Faster wear-out
What Can We Do when Confronting Enemies? Surrender,but don't become traitor Fail,but safe,i.e.,don't corrupt anything (ATM machine) Not that easy as you may think,you have to detect faults! ◆Weaken the enemies fault-avoidance and fault-removal Process improvement with less threats Testing and DfT to remove defective circuits Careful design reviews to remove design bugs More training to reduce operator errors Always some faults cannot be avoided and removed completely Make yourself stronger Fault-tolerance >Adding redundancy to detect,diagnose,confine,mask,compensate and recover from faults Mind the cost in terms of hardware,power,and performance Fault-evasion (a.k.a.,Fault-prediction) Observe,learn and take pre-emptive steps to stop fault from occurring Part.1.13 Qiang Xu CUHK,Fall 2012
Part.1 .13 Qiang Xu CUHK, Fall 2012 What Can We Do when Confronting Enemies? Surrender, but don’t become traitor Fail, but safe, i.e., don’t corrupt anything (ATM machine) Not that easy as you may think, you have to detect faults! Weaken the enemies fault-avoidance and fault-removal » Process improvement with less threats » Testing and DfT to remove defective circuits » Careful design reviews to remove design bugs » More training to reduce operator errors Always some faults cannot be avoided and removed completely Make yourself stronger Fault-tolerance » Adding redundancy to detect, diagnose, confine, mask, compensate and recover from faults » Mind the cost in terms of hardware, power, and performance Fault-evasion (a.k.a., Fault-prediction) » Observe, learn and take pre-emptive steps to stop fault from occurring
A Motivating Case Study Data availability and integrity concerns Distributed DB system with 5 sites So Full connectivity,dedicated links 0 5 User Only direct communication allowed Sites and links may malfunction Lo Redundancy improves availability 6 S:Probability of a site being available L:Probability of a link being available L3 18 Single-copy availability SL Unavailability 1 SL =1-0.99×0.95=5.95% F Data replication methods,and a challenge File duplication:home mirror sites File triplication:home backup 1/backup 2 Are there availability improvement methods with less redundancy? Part.1.14 Qiang Xu CUHK,Fall 2012
Part.1 .14 Qiang Xu CUHK, Fall 2012 A Motivating Case Study Data availability and integrity concerns Distributed DB system with 5 sites Full connectivity, dedicated links Only direct communication allowed Sites and links may malfunction Redundancy improves availability S0 S1 S3 S2 S4 L1 L0 L2 L3 L4 L5 L6 L7 L8 L9 S: Probability of a site being available L: Probability of a link being available Data replication methods, and a challenge File duplication: home / mirror sites File triplication: home / backup 1 / backup 2 Are there availability improvement methods with less redundancy? Single-copy availability = SL Unavailability = 1 – SL = 1 – 0.99 0.95 = 5.95% Fi User
Data Duplication:Home and Mirror Sites S:Site availability e.g,99% F mirror L:Link availability e.g.,95% So 0 User A=SL+(1-SL)SL Lo S Primary site Mirrorsite can be reached can be reached 8 Primary site inaccessible S3 S2 Duplicated availability 2SL-(SL)2 Unavailability =1-2SL +(SL)2 Fhome =(1-SL)2=0.35% Data unavailability reduced from 5.95%to 0.35% Availability improved from ~94%to 99.65% Part.1.15 Qiang Xu CUHK,Fall 2012
Part.1 .15 Qiang Xu CUHK, Fall 2012 Data Duplication: Home and Mirror Sites S0 S1 S3 S2 S4 L1 L0 L2 L3 L4 L5 L6 L7 L8 L9 Data unavailability reduced from 5.95% to 0.35% Availability improved from 94% to 99.65% Duplicated availability = 2SL – (SL) 2 Unavailability = 1 – 2SL + (SL) 2 = (1 – SL) 2 = 0.35% A = SL + (1 – SL)SL Primary site can be reached Primary site inaccessible Mirror site can be reached S: Site availability e.g., 99% L: Link availability e.g., 95% Fi home Fi mirror User