Probabilistic guarantees Example: Actual answer is within5+ 1 with prob 20.9 Use Tail Inequalities to give probabilistic bounds on returned answer Markov Inequality Chebyshev's Inequality Hoeffding s Inequality Chernoff bound Garofalakis, Gehrke, Rastogi, VLDB02 #ll
Garofalakis, Gehrke, Rastogi, VLDB’02 # 11 Probabilistic Guarantees • Example: Actual answer is within 5 ± 1 with prob 0.9 • Use Tail Inequalities to give probabilistic bounds on returned answer – Markov Inequality – Chebyshev’s Inequality – Hoeffding’s Inequality – Chernoff Bound
Tail Inequalities General bounds on tail probability of a random variable (that is probability that a random variable deviates far from its expectation Probability distribution il probabil u8 u8 Basic Inequalities: Let X be a random variable with expectation u and variance Var[X]. Then for any E>0 Markov:P(x2)≤ E Chebyshev:Pr(X-AE)≤ VariX E Garofalakis, Gehrke, Rastogi, VLDB02 #12
Garofalakis, Gehrke, Rastogi, VLDB’02 # 12 Tail Inequalities • General bounds on tail probability of a random variable (that is, probability that a random variable deviates far from its expectation) • Basic Inequalities: Let X be a random variable with expectation and variance Var[X]. Then for any Probability distribution Tail probability 0 Markov: Chebyshev: 2 2 [ ] Pr(| | ) Var X X − Pr(X )
Tail Inequalities for Sums Possible to derive stronger bounds on tail probabilities for the sum of independent random variables Hoeffding' s Inequality:Let×1,…,× m be independent random variab|es with O=Xi<=r. Let x=2 X, and u be the expectation of X. Then, for any 8>0 -2me P(X-6)≤2exp Application to avg queries: m is size of subset of sample s satisfying predicate( 3 in example) r is range of element values in sample(8 in example) Application to count queries m is size of sample s (4 in example) r is number of elements n in stream(12 in example) More details in [HHW971 Garofalakis, Gehrke, Rastogi, VLDB02 #13
Garofalakis, Gehrke, Rastogi, VLDB’02 # 13 Tail Inequalities for Sums • Possible to derive stronger bounds on tail probabilities for the sum of independent random variables • Hoeffding’s Inequality: Let X1, ..., Xm be independent random variables with 0<=Xi <= r. Let and be the expectation of . Then, for any , • Application to avg queries: – m is size of subset of sample S satisfying predicate (3 in example) – r is range of element values in sample (8 in example) • Application to count queries: – m is size of sample S (4 in example) – r is number of elements n in stream (12 in example) • More details in [HHW97] 2 2 2 Pr(| | ) 2exp r m X − − 0 = i Xi m X 1 X
Tail Inequalities for Sums GORNELL Contd. Possible to derive even stronger bounds on tail probabilities for the sum of independent Bernoulli trials Chernoff bound: Let X1,. Xm be independent Bernoulli trials such that Pr[Xi=l]=p(Pr[Xi=]=1-p).Let x=>x, and u=mp be the expectation of X. Then, for any 8>0 Pr(X¥-E)≤2exp2 Application to count queries m is size of sample s(4 in example p is fraction of odd elements in stream( 2/3 in example Remark Chernoff bound results in tighter bounds for count queries compared to Hoeffding's inequality Garofalakis, Gehrke, Rastogi, VLDB02 #14
Garofalakis, Gehrke, Rastogi, VLDB’02 # 14 Tail Inequalities for Sums (Contd.) • Possible to derive even stronger bounds on tail probabilities for the sum of independent Bernoulli trials • Chernoff Bound: Let X1, ..., Xm be independent Bernoulli trials such that Pr[Xi=1] = p (Pr[Xi=0] = 1-p). Let and be the expectation of . Then, for any , • Application to count queries: – m is size of sample S (4 in example) – p is fraction of odd elements in stream (2/3 in example) • Remark: Chernoff bound results in tighter bounds for count queries compared to Hoeffding’s inequality 2 2 Pr(| | ) 2exp − X − 0 =i X Xi = mp X
Computing Stream Sample GORNELL Reservoir Sampling [Vit85]: Maintains a sample S of a fixed-size M Add each new element to S with probability M/n, where n is the current number of stream elements If add an element evict a random element from S Instead of flipping a coin for each element, determine the number of elements to skip before the next to be added to s Concise sampling [GM98]: Duplicates in sample S stored as <value,count> pairs(thus, potentially boosting actual sample size) Add each new element to S with probability 1/t(simply increment count if element already in S) If sample size exceeds M Select new threshold>T Evict each element(decrement count)from S with probability T/T Add subsequent elements to s with probability 1/T' Garofalakis, Gehrke, Rastogi, VLDB02 #15
Garofalakis, Gehrke, Rastogi, VLDB’02 # 15 Computing Stream Sample • Reservoir Sampling [Vit85]: Maintains a sample S of a fixed-size M – Add each new element to S with probability M/n, where n is the current number of stream elements – If add an element, evict a random element from S – Instead of flipping a coin for each element, determine the number of elements to skip before the next to be added to S • Concise sampling [GM98]: Duplicates in sample S stored as <value, count> pairs (thus, potentially boosting actual sample size) – Add each new element to S with probability 1/T (simply increment count if element already in S) – If sample size exceeds M • Select new threshold T’ > T • Evict each element (decrement count) from S with probability 1- T/T’ – Add subsequent elements to S with probability 1/T’