当前位置：和泉文库 > IT&计算机 > 浏览文档

朗讯科技公司 Lucent Technologies：Querying and Mining Data Streams——You Only Get One Look

• Introduction & Motivation – Stream computation model, Applications • Basic stream synopses computation – Samples, Equi-depth histograms, Wavelets • Mining data streams – Decision trees, clustering, association rules • Sketch-based computation techniques – Self-joins, Joins, Wavelets, V-optimal histograms • Advanced techniques – Sliding windows, Distinct values, Hot lists • Future directions & Conclusions

文件格式：PPT，文件大小：1.24MB，售价：30.12元

共124页，可试读30页，点击往前阅读 ↑↑

文档详细内容（约124页）

IP Network Measurement Data O Po IP session data (collected using Cisco NetFlow Source Destination Duration Bytes Protocol 10.1.0.2 16.2.3.7 12 20K http 18.6.7.1 124.0.3 24K http 13.94.3 116.8.2 15 20K 152.2.9 17.1.2.1 9 40K http 12.4.38 14.8.7.4 26 58K http 10.5.1.3 13.0.0.1 27 100K ftp 11.1.0.6 10.3.4.5 32 300K ftp 19.7.1.2 16.5.5.8 18 80K ft a t&t collects 100 GBs of net flow data each day! Garofalakis, Gehrke, Rastogi, VLDB02 #6

Garofalakis, Gehrke, Rastogi, VLDB’02 # 6 IP Network Measurement Data • IP session data (collected using Cisco NetFlow) • AT&T collects 100 GBs of NetFlow data each day! Source Destination Duration Bytes Protocol 10.1.0.2 16.2.3.7 12 20K http 18.6.7.1 12.4.0.3 16 24K http 13.9.4.3 11.6.8.2 15 20K http 15.2.2.9 17.1.2.1 19 40K http 12.4.3.8 14.8.7.4 26 58K http 10.5.1.3 13.0.0.1 27 100K ftp 11.1.0.6 10.3.4.5 32 300K ftp 19.7.1.2 16.5.5.8 18 80K ftp

Network Data Processing Traffic estimation How many bytes were sent between a pair of Ip addresses? What fraction network ip addresses are active? List the top 100 iP addresses in terms of traffic Traffic analysis What is the average duration of an IP session? What is the median of the number of bytes in each IP session? raud detection List all sessions that transmitted more than 1000 bytes Identify all sessions whose duration was more than twice the normal Security/Denial of Service List all IP addresses that have witnessed a sudden spike in traffic Identify ip addresses involved in more than 1000 sessions Garofalakis, Gehrke, Rastogi, VLDB'02 #7

Garofalakis, Gehrke, Rastogi, VLDB’02 # 7 Network Data Processing • Traffic estimation – How many bytes were sent between a pair of IP addresses? – What fraction network IP addresses are active? – List the top 100 IP addresses in terms of traffic • Traffic analysis – What is the average duration of an IP session? – What is the median of the number of bytes in each IP session? • Fraud detection – List all sessions that transmitted more than 1000 bytes – Identify all sessions whose duration was more than twice the normal • Security/Denial of Service – List all IP addresses that have witnessed a sudden spike in traffic – Identify IP addresses involved in more than 1000 sessions

Data Stream Processing Algorithms Generally, algorithms compute approximate answers Difficult to compute answers accurately with limited memory Approximate answers-Deterministic bounds Algorithms only compute an approximate answer, but bounds on error Approximate answers-Probabilistic bounds Algorithms compute an approximate answer with high probability With probability at least 1-8, the computed answer is within a factor f of the actual answer Single-pass algorithms for processing streams also pplicable to(massive) terabyte databases Garofalakis, Gehrke, Rastogi, VLDB02 #8

Garofalakis, Gehrke, Rastogi, VLDB’02 # 8 Data Stream Processing Algorithms • Generally, algorithms compute approximate answers – Difficult to compute answers accurately with limited memory • Approximate answers - Deterministic bounds – Algorithms only compute an approximate answer, but bounds on error • Approximate answers - Probabilistic bounds – Algorithms compute an approximate answer with high probability • With probability at least , the computed answer is within a factor of the actual answer • Single-pass algorithms for processing streams also applicable to (massive) terabyte databases! 1− 

Ou uTIne Introduction Motivation Basic stream synopses computation Samples: Answering queries using samples, Reservoir sampling Histograms: Equi-depth histograms, On-line quantile computation Wavelets: Haar-wavelet histogram construction maintenance Mining data streams Sketch-based computation techniques Advanced techniques Future directions conclusions Garofalakis, Gehrke, Rastogi, VLDB02 #9

Garofalakis, Gehrke, Rastogi, VLDB’02 # 9 Outline • Introduction & Motivation • Basic stream synopses computation – Samples: Answering queries using samples, Reservoir sampling – Histograms: Equi-depth histograms, On-line quantile computation – Wavelets: Haar-wavelet histogram construction & maintenance • Mining data streams • Sketch-based computation techniques • Advanced techniques • Future directions & Conclusions

Sampling: Basics Idea: A small random sample s of the data often well represents all the data For a fast approx answer, apply"modified"query to s Example: select ggq from r where R e is odd Data stream: 9 3 5 2 7 165849 Sample s: 9 5 1 8 If agg is avg, return average of odd elements in s answer: 5 If agg is count, return average over all elements e in S of n if e is odd answer:12*3/4=9 o if e is even Unbiased: For expressions involving count, sum, avg: the estimator is unbiased, i. e, the expected value of the answer is the actual answer Garofalakis, Gehrke, Rastogi, VLDB02 #10

Garofalakis, Gehrke, Rastogi, VLDB’02 # 10 Sampling: Basics • Idea: A small random sample S of the data often wellrepresents all the data – For a fast approx answer, apply “modified” query to S – Example: select agg from R where R.e is odd (n=12) – If agg is avg, return average of odd elements in S – If agg is count, return average over all elements e in S of • n if e is odd • 0 if e is even Unbiased: For expressions involving count, sum, avg: the estimator is unbiased, i.e., the expected value of the answer is the actual answer Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 Sample S: 9 5 1 8 answer: 5 answer: 12*3/4 =9

点击进入文档下载页（PPT格式）

共124页，可试读30页，点击继续阅读 ↓↓

您可能感兴趣的文档

机群应用开发（PPT讲稿）并行编程原理及程序设计 Parallel Programming：Fundamentals and Implementation
中国互联网络发展状况统计报告（2008年7月）
2005年中国信息化发展报告2006（2006年3月）
中国科学院曙光公司：高性能并行计算机简介
《分形的计算机生成及其应用》PDF电子书
化境编程界推荐图书系列：高质量C++编程指南
“家园科技”系列教学软件介绍
《PHOTOSHOP》使用技巧167条
虚拟现实概论_VR概述
经典电脑故障全攻略_计算机故障速查手册
“家校通”互动信息平台技术方案V2
2014年机房空调市场回顾及发展建议
米勒贝姆振动与声学系统（北京）有限公司：MÜ LLER-BBM PAK培训——基本功能培训（PAK 55）
中国金融学院电子商务研究所：电子商务的发展战略（2020）
PPT教程_3天成为PowerPoint幻灯演示高手
PPT教程_PowerPoint 2003全能培训教程
PPT教程_PowerPoint.2007宝典
PPT教程_POWERPOINT演示技巧
PPT教程_ppt制作技巧大全(PowerPoint
PPT教程_PPT制作法则
PPT教程_PPT培训
PPT教程_PPT技巧终极大全

点击购买下载（PPT）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录