IP Network Measurement Data O Po IP session data (collected using Cisco NetFlow Source Destination Duration Bytes Protocol 10.1.0.2 16.2.3.7 12 20K http 18.6.7.1 124.0.3 24K http 13.94.3 116.8.2 15 20K 152.2.9 17.1.2.1 9 40K http 12.4.38 14.8.7.4 26 58K http 10.5.1.3 13.0.0.1 27 100K ftp 11.1.0.6 10.3.4.5 32 300K ftp 19.7.1.2 16.5.5.8 18 80K ft a t&t collects 100 GBs of net flow data each day! Garofalakis, Gehrke, Rastogi, VLDB02 #6
Garofalakis, Gehrke, Rastogi, VLDB’02 # 6 IP Network Measurement Data • IP session data (collected using Cisco NetFlow) • AT&T collects 100 GBs of NetFlow data each day! Source Destination Duration Bytes Protocol 10.1.0.2 16.2.3.7 12 20K http 18.6.7.1 12.4.0.3 16 24K http 13.9.4.3 11.6.8.2 15 20K http 15.2.2.9 17.1.2.1 19 40K http 12.4.3.8 14.8.7.4 26 58K http 10.5.1.3 13.0.0.1 27 100K ftp 11.1.0.6 10.3.4.5 32 300K ftp 19.7.1.2 16.5.5.8 18 80K ftp
Network Data Processing Traffic estimation How many bytes were sent between a pair of Ip addresses? What fraction network ip addresses are active? List the top 100 iP addresses in terms of traffic Traffic analysis What is the average duration of an IP session? What is the median of the number of bytes in each IP session? raud detection List all sessions that transmitted more than 1000 bytes Identify all sessions whose duration was more than twice the normal Security/Denial of Service List all IP addresses that have witnessed a sudden spike in traffic Identify ip addresses involved in more than 1000 sessions Garofalakis, Gehrke, Rastogi, VLDB'02 #7
Garofalakis, Gehrke, Rastogi, VLDB’02 # 7 Network Data Processing • Traffic estimation – How many bytes were sent between a pair of IP addresses? – What fraction network IP addresses are active? – List the top 100 IP addresses in terms of traffic • Traffic analysis – What is the average duration of an IP session? – What is the median of the number of bytes in each IP session? • Fraud detection – List all sessions that transmitted more than 1000 bytes – Identify all sessions whose duration was more than twice the normal • Security/Denial of Service – List all IP addresses that have witnessed a sudden spike in traffic – Identify IP addresses involved in more than 1000 sessions
Data Stream Processing Algorithms Generally, algorithms compute approximate answers Difficult to compute answers accurately with limited memory Approximate answers-Deterministic bounds Algorithms only compute an approximate answer, but bounds on error Approximate answers-Probabilistic bounds Algorithms compute an approximate answer with high probability With probability at least 1-8, the computed answer is within a factor f of the actual answer Single-pass algorithms for processing streams also pplicable to(massive) terabyte databases Garofalakis, Gehrke, Rastogi, VLDB02 #8
Garofalakis, Gehrke, Rastogi, VLDB’02 # 8 Data Stream Processing Algorithms • Generally, algorithms compute approximate answers – Difficult to compute answers accurately with limited memory • Approximate answers - Deterministic bounds – Algorithms only compute an approximate answer, but bounds on error • Approximate answers - Probabilistic bounds – Algorithms compute an approximate answer with high probability • With probability at least , the computed answer is within a factor of the actual answer • Single-pass algorithms for processing streams also applicable to (massive) terabyte databases! 1−
Ou uTIne Introduction Motivation Basic stream synopses computation Samples: Answering queries using samples, Reservoir sampling Histograms: Equi-depth histograms, On-line quantile computation Wavelets: Haar-wavelet histogram construction maintenance Mining data streams Sketch-based computation techniques Advanced techniques Future directions conclusions Garofalakis, Gehrke, Rastogi, VLDB02 #9
Garofalakis, Gehrke, Rastogi, VLDB’02 # 9 Outline • Introduction & Motivation • Basic stream synopses computation – Samples: Answering queries using samples, Reservoir sampling – Histograms: Equi-depth histograms, On-line quantile computation – Wavelets: Haar-wavelet histogram construction & maintenance • Mining data streams • Sketch-based computation techniques • Advanced techniques • Future directions & Conclusions
Sampling: Basics Idea: A small random sample s of the data often well represents all the data For a fast approx answer, apply"modified"query to s Example: select ggq from r where R e is odd Data stream: 9 3 5 2 7 165849 Sample s: 9 5 1 8 If agg is avg, return average of odd elements in s answer: 5 If agg is count, return average over all elements e in S of n if e is odd answer:12*3/4=9 o if e is even Unbiased: For expressions involving count, sum, avg: the estimator is unbiased, i. e, the expected value of the answer is the actual answer Garofalakis, Gehrke, Rastogi, VLDB02 #10
Garofalakis, Gehrke, Rastogi, VLDB’02 # 10 Sampling: Basics • Idea: A small random sample S of the data often wellrepresents all the data – For a fast approx answer, apply “modified” query to S – Example: select agg from R where R.e is odd (n=12) – If agg is avg, return average of odd elements in S – If agg is count, return average over all elements e in S of • n if e is odd • 0 if e is even Unbiased: For expressions involving count, sum, avg: the estimator is unbiased, i.e., the expected value of the answer is the actual answer Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 Sample S: 9 5 1 8 answer: 5 answer: 12*3/4 =9