Measuring the Central Tendency Mean algebraic measure) sample vs. population: x=∑xH ∑x Note: n is sample size and N is population size Weighted arithmetic mean ∑x Trimmed mean chopping extreme values x=i=l Median. Middle value if odd number of values or average of age the middle two values otherwise requency 1 200 a Estimated by interpolation( for grouped data) 6-15 450 16-20 300 median=L+( n/2-(2eq )widh21-50 1500 Mode median 51-80 700 a Value that occurs most frequently in the data 81-110 a Unimodal, bimodal trimodal a Empirical formula: mean-mode=3x(mean- median) 12
12 Measuring the Central Tendency ◼ Mean (algebraic measure) (sample vs. population): Note: n is sample size and N is population size. ◼ Weighted arithmetic mean: ◼ Trimmed mean: chopping extreme values ◼ Median: ◼ Middle value if odd number of values, or average of the middle two values otherwise ◼ Estimated by interpolation (for grouped data): ◼ Mode ◼ Value that occurs most frequently in the data ◼ Unimodal, bimodal, trimodal ◼ Empirical formula: = = n i xi n x 1 1 = = = n i i n i i i w w x x 1 1 width freq n freq l median L median ) / 2 ( ) ( 1 = + − mean − mode = 3(mean − median) N x =
Symmetric vs Skewed Da Median mean and mode of symmetric symmetric, positively and negatively skewed data Mean Mode Mode Mean positively skewed negatively skewed
January 30, 2021 Data Mining: Concepts and Techniques 13 Symmetric vs. Skewed Data ◼ Median, mean and mode of symmetric, positively and negatively skewed data positively skewed negatively skewed symmetric
Measuring the Dispersion of Data Quartiles outliers and boxplots Quartiles: Q1(25th percentile), Q3(75th percentile) Inter-quartile range: IQR= Q3-Q1 a Five number summary: min, Q1 median, Q3, max a Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually Outlier: usually, a value higher/lower than 1.5 X IQR Variance and standard deviation(samp/e: s, population: 0) Variance: (algebraic, scalable computation) x x ∑( -1 Standard deviation s(or o is the square root of variance s (or 2) 14
14 Measuring the Dispersion of Data ◼ Quartiles, outliers and boxplots ◼ Quartiles: Q1 (25th percentile), Q3 (75th percentile) ◼ Inter-quartile range: IQR = Q3 – Q1 ◼ Five number summary: min, Q1 , median, Q3 , max ◼ Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually ◼ Outlier: usually, a value higher/lower than 1.5 x IQR ◼ Variance and standard deviation (sample: s, population: σ) ◼ Variance: (algebraic, scalable computation) ◼ Standard deviation s (or σ) is the square root of variance s 2 (or σ 2) = = = − − − = − = n i n i i i n i i x n x n x x n s 1 1 2 2 1 2 2 ( ) ] 1 [ 1 1 ( ) 1 1 = = = − = − n i i n i i x N x N 1 2 2 1 2 2 1 ( ) 1
Lower Quartile Quartile Extreme Median Boxplot Analysis 102030405060708090100 a Five-number summary of a distribution Minimum Q1, Median, Q3, Maximum Boxplot Data is represented with a box The ends of the box are at the first and third quartiles i e. the height of the box is iQR - The median is marked by a line within the box Whiskers: two lines outside the box extended to minimum and maximum Outliers: points beyond a specified outlier threshold plotted individually 15
15 Boxplot Analysis ◼ Five-number summary of a distribution ◼ Minimum, Q1, Median, Q3, Maximum ◼ Boxplot ◼ Data is represented with a box ◼ The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR ◼ The median is marked by a line within the box ◼ Whiskers: two lines outside the box extended to Minimum and Maximum ◼ Outliers: points beyond a specified outlier threshold, plotted individually
Visualization of Data Dispersion: 3-D Boxplots revenue cost 000 4 0.0 5000 3 0 40 长 2 00 3 O O 00 O
January 30, 2021 Data Mining: Concepts and Techniques 16 Visualization of Data Dispersion: 3-D Boxplots