Basic Statistical Descriptions of Data ■ Motivation o To better understand the data: central tendency variation and spread Data dispersion characteristics o median, max, min, quantiles, outliers, variance, etc Numerical dimensions correspond to sorted intervals Data dispersion analyzed with multiple granularities of precision o Boxplot or quantile analysis on sorted intervals a Dispersion analysis on computed measures o Folding measures into numerical dimensions o Boxplot or quantile analysis on the transformed cube 11 同济大学软件学院 ool of Software Engineering. Tongpi Unversity
11 Basic Statistical Descriptions of Data ◼ Motivation ◆ To better understand the data: central tendency, variation and spread ◼ Data dispersion characteristics ◆ median, max, min, quantiles, outliers, variance, etc. ◼ Numerical dimensions correspond to sorted intervals ◆ Data dispersion: analyzed with multiple granularities of precision ◆ Boxplot or quantile analysis on sorted intervals ◼ Dispersion analysis on computed measures ◆ Folding measures into numerical dimensions ◆ Boxplot or quantile analysis on the transformed cube
Measuring the Central Tendency Mean(algebraic measure)(sample vs population ): x=∑ ∑x Note: n is sample size and N is population size N Weighted arithmetic mean Trimmed mean: chopping extreme values Median: o Middle value if odd number of values, or average of the middle two values otherwise requency 1-5 200 Estimated by interpolation(for grouped data) 6-15 450 median=L+( n/2-C∑freq 16-20 300 )width 21-50 1500 M oae fre median 5180 700 o Value that occurs most frequently in the data 81-110 44 e Unimodal bimodal trimodal Empirical formula: mean-mode=3x(mean-median) 同济大学软件学院 ool of Software Engineering. Tongpi Unversity 12
12 Measuring the Central Tendency ◼ Mean (algebraic measure) (sample vs. population): Note: n is sample size and N is population size. ◆ Weighted arithmetic mean: ◆ Trimmed mean: chopping extreme values ◼ Median: ◆ Middle value if odd number of values, or average of the middle two values otherwise ◆ Estimated by interpolation (for grouped data): ◼ Mode ◆ Value that occurs most frequently in the data ◆ Unimodal, bimodal, trimodal ◆ Empirical formula: = = n i xi n x 1 1 = = = n i i n i i i w w x x 1 1 width freq n freq l median L median ) / 2 ( ) ( 1 = + − mean − mode = 3(mean − median) N x =
Symmetric vs. Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data 加n Mean Mode Mode Mean positively skewed negatively skewed Median Median
February 9, 2021 Data Mining: Concepts and Techniques 13 Symmetric vs. Skewed Data ◼ Median, mean and mode of symmetric, positively and negatively skewed data positively skewed negatively skewed symmetric
Measuring the Dispersion of Data Quartiles, outliers and boxplots Quartiles: Q1(25th percentile), Q3(75th percentile) o Inter-quartile range: IQR=Q3-Q1 Five number summary: min, Q1, median, Q3, max Boxplot: ends of the box are the quartiles; median is marked add whiskers and plot outliers individually Outlier: usually, a value higher/lower than 1.5 X IQR a Variance and standard deviation(sample: S, population: a) o Variance: (algebraic, scalable computation) x2-1∑x ∑( ∑x2 Standard deviation s(or o) is the square root of variance s2 (ord2 同济大学软件学院 ool of Software Engineering. Tongpi Unversity 14
14 Measuring the Dispersion of Data ◼ Quartiles, outliers and boxplots ◆ Quartiles: Q1 (25th percentile), Q3 (75th percentile) ◆ Inter-quartile range: IQR = Q3 – Q1 ◆ Five number summary: min, Q1 , median, Q3 , max ◆ Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually ◆ Outlier: usually, a value higher/lower than 1.5 x IQR ◼ Variance and standard deviation (sample: s, population: σ) ◆ Variance: (algebraic, scalable computation) ◆ Standard deviation s (or σ) is the square root of variance s 2 (or σ2) = = = − − − = − = n i n i i i n i i x n x n x x n s 1 1 2 2 1 2 2 ( ) ] 1 [ 1 1 ( ) 1 1 = = = − = − n i i n i i x N x N 1 2 2 1 2 2 1 ( ) 1
Lower Quarti Quartile UpDe Extreme Median Extreme Boxplot Analysis 十+}+++十+ 0102030405060708090100 a Five-number summary of a distribution Minimum Q1. Median. Q3. Maximum Boxplot Data is represented with a box e the ends of the box are at the first and third quartiles i e. the height of the box is IQR e The median is marked by a line within the box Whiskers two lines outside the box extended to minimum and maximum Outliers points beyond a specified outlier threshold, plotted individually 同济大学软件学院 ool of Software Engineering. Tongpi Unversity 15
15 Boxplot Analysis ◼ Five-number summary of a distribution ◆ Minimum, Q1, Median, Q3, Maximum ◼ Boxplot ◆ Data is represented with a box ◆ The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR ◆ The median is marked by a line within the box ◆ Whiskers: two lines outside the box extended to Minimum and Maximum ◆ Outliers: points beyond a specified outlier threshold, plotted individually