Counting Samples Gm98 Effective for answering hot list queries(k most frequent values Sample s is a set of <value, count> pairs For each new stream element If element value in s, increment its count Otherwise, add to S with probability 1/T If size of sample s exceeds M, select new thresholds>t For each value(with count C)in S, decrement count in repeated tries until c tries or a try in which count is not decremented First try, decrement count with probability 1-T/T Subsequent tries, decrement count with probability 1-1/T Sub ject each subsequent stream element to higher threshold t' Estimate of frequency for value in S: count in S+0.418*T Garofalakis, Gehrke, Rastogi, VLDB02 #16
Garofalakis, Gehrke, Rastogi, VLDB’02 # 16 Counting Samples [GM98] • Effective for answering hot list queries (k most frequent values) – Sample S is a set of <value, count> pairs – For each new stream element • If element value in S, increment its count • Otherwise, add to S with probability 1/T – If size of sample S exceeds M, select new threshold T’ > T • For each value (with count C) in S, decrement count in repeated tries until C tries or a try in which count is not decremented – First try, decrement count with probability 1- T/T’ – Subsequent tries, decrement count with probability 1-1/T’ – Subject each subsequent stream element to higher threshold T’ • Estimate of frequency for value in S: count in S + 0.418*T
Histograms Histograms approximate the frequency distribution of element values in a stream A histogram(typically) consists of A partitioning of element domain values into buckets A count Cb per bucket b(of the number of elements in B) Long history of use for selectivity estimation within a query optimizer [Koo80][PSC84],etc PIH96][Poo97] introduced a taxonomy, algorithms, etc Garofalakis, Gehrke, Rastogi, VLDB02 #17
Garofalakis, Gehrke, Rastogi, VLDB’02 # 17 Histograms • Histograms approximate the frequency distribution of element values in a stream • A histogram (typically) consists of – A partitioning of element domain values into buckets – A count per bucket B (of the number of elements in B) • Long history of use for selectivity estimation within a query optimizer [Koo80], [PSC84], etc. • [PIH96] [Poo97] introduced a taxonomy, algorithms, etc. CB
Types of histograms Equi-Depth histograms Idea: Select buckets such that counts per bucket are equal Count for bucket 1234567891011121314151617181920 Domain values V-Optimal Histograms [IP95][JKM98 Idea: Select buckets to minimize frequency variance within buckets minimize ∑。∑=2 B Count for bucket 1234567891011121314151617181920 Domain values Garofalakis, Gehrke, Rastogi, VLDB02 #18
Garofalakis, Gehrke, Rastogi, VLDB’02 # 18 Types of Histograms • Equi-Depth Histograms – Idea: Select buckets such that counts per bucket are equal • V-Optimal Histograms [IP95] [JKM98] – Idea: Select buckets to minimize frequency variance within buckets Count for bucket 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values Count for bucket 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values 2 minimize ( ) B B B v B v V C f −
Answering queries using histograms mn temebegeO Coe P99 (Implicitly)map the histogram back to an approximate relation, apply the query to the approximate relation Example: select count(*)from r where 4<Re<= 15 Count spread evenly among 1234567891011121314151617181920 bucket values 4≤Re≤15 answer: 3.5* CB For equi-depth histograms,ma× mum error:±2*CB Garofalakis, Gehrke, Rastogi, VLDB02 #19
Garofalakis, Gehrke, Rastogi, VLDB’02 # 19 Answering Queries using Histograms [IP99] • (Implicitly) map the histogram back to an approximate relation, & apply the query to the approximate relation • Example: select count(*) from R where 4 <= R.e <= 15 • For equi-depth histograms, maximum error: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Count spread evenly among bucket values 4 R.e 15 answer: 3.5 * CB 2*CB
Equi-Depth Histogram Construction "um stm:O o For histogram with b buckets, compute elements with rank n/b, 2n/b (b-1)n/b Example: (n=12, b=4) Data stream: 9 3 5 27 1658491 After sort12|345567899 rank 3 rank= 9 ( 25-quantile)I(75-quantile) rank =6 (.5-quantile) Garofalakis, Gehrke, Rastogi, VLDB02 #20
Garofalakis, Gehrke, Rastogi, VLDB’02 # 20 Equi-Depth Histogram Construction • For histogram with b buckets, compute elements with rank n/b, 2n/b, ..., (b-1)n/b • Example: (n=12, b=4) Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 After sort: 1 1 2 3 4 5 5 6 7 8 9 9 rank = 3 (.25-quantile) rank = 6 (.5-quantile) rank = 9 (.75-quantile)