当前位置：和泉文库 > 计算机 > 浏览文档

同济大学：《大数据分析与数据挖掘 Big Data Analysis and Mining》课程教学资源（PPT课件讲稿）Getting to Know Your Data

Data Objects and Attribute Types Basic Statistical Descriptions of Data Data Visualization Measuring Data Similarity and Dissimilarity Summary

文件格式：PPT，文件大小：4.86MB，售价：15.04元

共65页，可试读20页，点击往前阅读 ↑↑

文档详细内容（约65页）

Basic Statistical Descriptions of Data ■ Motivation o To better understand the data: central tendency variation and spread Data dispersion characteristics o median, max, min, quantiles, outliers, variance, etc Numerical dimensions correspond to sorted intervals Data dispersion analyzed with multiple granularities of precision o Boxplot or quantile analysis on sorted intervals a Dispersion analysis on computed measures o Folding measures into numerical dimensions o Boxplot or quantile analysis on the transformed cube 11 同济大学软件学院 ool of Software Engineering. Tongpi Unversity

11 Basic Statistical Descriptions of Data ◼ Motivation ◆ To better understand the data: central tendency, variation and spread ◼ Data dispersion characteristics ◆ median, max, min, quantiles, outliers, variance, etc. ◼ Numerical dimensions correspond to sorted intervals ◆ Data dispersion: analyzed with multiple granularities of precision ◆ Boxplot or quantile analysis on sorted intervals ◼ Dispersion analysis on computed measures ◆ Folding measures into numerical dimensions ◆ Boxplot or quantile analysis on the transformed cube

Measuring the Central Tendency Mean(algebraic measure)(sample vs population ): x=∑ ∑x Note: n is sample size and N is population size N Weighted arithmetic mean Trimmed mean: chopping extreme values Median: o Middle value if odd number of values, or average of the middle two values otherwise requency 1-5 200 Estimated by interpolation(for grouped data) 6-15 450 median=L+( n/2-C∑freq 16-20 300 )width 21-50 1500 M oae fre median 5180 700 o Value that occurs most frequently in the data 81-110 44 e Unimodal bimodal trimodal Empirical formula: mean-mode=3x(mean-median) 同济大学软件学院 ool of Software Engineering. Tongpi Unversity 12

12 Measuring the Central Tendency ◼ Mean (algebraic measure) (sample vs. population): Note: n is sample size and N is population size. ◆ Weighted arithmetic mean: ◆ Trimmed mean: chopping extreme values ◼ Median: ◆ Middle value if odd number of values, or average of the middle two values otherwise ◆ Estimated by interpolation (for grouped data): ◼ Mode ◆ Value that occurs most frequently in the data ◆ Unimodal, bimodal, trimodal ◆ Empirical formula: = = n i xi n x 1 1   = = = n i i n i i i w w x x 1 1 width freq n freq l median L median ) / 2 ( ) ( 1 = + −  mean − mode = 3(mean − median) N x  =

Symmetric vs. Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data 加n Mean Mode Mode Mean positively skewed negatively skewed Median Median

February 9, 2021 Data Mining: Concepts and Techniques 13 Symmetric vs. Skewed Data ◼ Median, mean and mode of symmetric, positively and negatively skewed data positively skewed negatively skewed symmetric

Measuring the Dispersion of Data Quartiles, outliers and boxplots Quartiles: Q1(25th percentile), Q3(75th percentile) o Inter-quartile range: IQR=Q3-Q1 Five number summary: min, Q1, median, Q3, max Boxplot: ends of the box are the quartiles; median is marked add whiskers and plot outliers individually Outlier: usually, a value higher/lower than 1.5 X IQR a Variance and standard deviation(sample: S, population: a) o Variance: (algebraic, scalable computation) x2-1∑x ∑( ∑x2 Standard deviation s(or o) is the square root of variance s2 (ord2 同济大学软件学院 ool of Software Engineering. Tongpi Unversity 14

14 Measuring the Dispersion of Data ◼ Quartiles, outliers and boxplots ◆ Quartiles: Q1 (25th percentile), Q3 (75th percentile) ◆ Inter-quartile range: IQR = Q3 – Q1 ◆ Five number summary: min, Q1 , median, Q3 , max ◆ Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually ◆ Outlier: usually, a value higher/lower than 1.5 x IQR ◼ Variance and standard deviation (sample: s, population: σ) ◆ Variance: (algebraic, scalable computation) ◆ Standard deviation s (or σ) is the square root of variance s 2 (or σ2)    = = = − − − = − = n i n i i i n i i x n x n x x n s 1 1 2 2 1 2 2 ( ) ] 1 [ 1 1 ( ) 1 1   = = = − = − n i i n i i x N x N 1 2 2 1 2 2 1 ( ) 1   

Lower Quarti Quartile UpDe Extreme Median Extreme Boxplot Analysis 十+}+++十+ 0102030405060708090100 a Five-number summary of a distribution Minimum Q1. Median. Q3. Maximum Boxplot Data is represented with a box e the ends of the box are at the first and third quartiles i e. the height of the box is IQR e The median is marked by a line within the box Whiskers two lines outside the box extended to minimum and maximum Outliers points beyond a specified outlier threshold, plotted individually 同济大学软件学院 ool of Software Engineering. Tongpi Unversity 15

15 Boxplot Analysis ◼ Five-number summary of a distribution ◆ Minimum, Q1, Median, Q3, Maximum ◼ Boxplot ◆ Data is represented with a box ◆ The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR ◆ The median is marked by a line within the box ◆ Whiskers: two lines outside the box extended to Minimum and Maximum ◆ Outliers: points beyond a specified outlier threshold, plotted individually

点击进入文档下载页（PPT格式）

共65页，可试读20页，点击继续阅读 ↓↓

您可能感兴趣的文档

《计算机系统安全》课程PPT教学课件（信息安全与管理）第九章防火墙
《计算机网络》课程教学资源（PPT课件讲稿）第六章传输层
《PHP程序设计》教学资源（PPT课件讲稿）项目七 Ajax商品发布
《电脑组装与维护实例教程》教学资源（PPT课件讲稿）第14章系统的维护
东北大学：《可信计算基础》课程教学资源（PPT课件讲稿）第五讲分布式系统的安全（主讲：周福才）
《运筹学与最优化方法》课程教学资源（PPT课件讲稿）第十章智能优化计算简介
《3ds Max 9》教学资源（PPT课件）第8章灯光、摄影机、渲染输出
编译程序构造 COMPILER CONSTRUCTION（PPT讲稿）原理与实践 Principles and Practice
上海交通大学：《程序设计》课程教学资源（PPT课件讲稿）第7章间接访问——指针
《数据库系统概论》课程教学资源（PPT课件讲稿）数据结构实用教程（共十章）
大连理工大学：《计算机网络》课程教学资源（PPT课件讲稿）Chapter 1 Introduction（roadmap，主讲：孙伟峰）
《计算机网络基础》课程PPT教学课件（讲稿）第4章 IP协议
香港浸会大学：Computer Security（PPT课件讲稿）Cryptography Chapter 1 Symmetric Ciphers
《计算机文化基础》课程教学资源（PPT课件讲稿）第九章多媒体技术基础
数据挖掘10大算法产生过程（PPT讲稿）
清华大学：高校信息化建设理论与规划（PPT讲稿）
《汇编语言程序设计》课程教学资源（PPT课件讲稿）第二章 IBM-PC微机的功能结构
《软件工程》课程教学资源（PPT课件讲稿）详细设计
同济大学：《大数据分析与数据挖掘 Big Data Analysis and Mining》课程教学资源（PPT课件讲稿）Decision Tree
上海交通大学：《网络科学导论》课程PPT教学课件（Network Science An Introduction）Chapter 4 Degree Correlations & Community Structure
中国科学技术大学：《数据结构与数据库》课程教学资源（PPT课件讲稿）第五章串和数组
最小生成树（PPT课件讲稿）Minimum Spanning Trees
《数据结构》课程教学资源（PPT课件讲稿）第10章内排序
jQuery个人主页（PPT讲稿）

点击购买下载（PPT）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录