Visualization of Data Dispersion: 3-D Boxplots revenue 0000+ 4000 5000 00 3 000 0 O 0.00~000.00 O O
February 9, 2021 16 Data Mining: Concepts and Techniques Visualization of Data Dispersion: 3-D Boxplots
Properties of Normal Distribution Curve The normal(distribution) curve ◆ From u-σtop+o: contains about68% of the measurements (u: mean, o: standard deviation) ◆Fromμ-2σtou+2σ: contains about95%ofit ◆Fromμ-30toμ+3σ: contains about99.7%ofit 68% 95% 997% 同济大学软件学院 ool of Software Engineering. Tongpi Unversity 17
17 Properties of Normal Distribution Curve ◼ The normal (distribution) curve ◆ From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation) ◆ From μ–2σ to μ+2σ: contains about 95% of it ◆ From μ–3σ to μ+3σ: contains about 99.7% of it
Graphic Displays of Basic Statistical Descriptions a Boxplot: graphic display of five-number summary a Histogram: X-axis are values, y-axis repres frequencies a Quantile plot: each value X; is paired with f, indicating that approximately 100 fi% of data are sX a Quantile-quantile(q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another a Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane 同济大学软件学院 18 ool of Software Engineering. Tongpi Unversity
18 Graphic Displays of Basic Statistical Descriptions ◼ Boxplot: graphic display of five-number summary ◼ Histogram: x-axis are values, y-axis repres. frequencies ◼ Quantile plot: each value xi is paired with f i indicating that approximately 100 f i % of data are xi ◼ Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another ◼ Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane
Histogram Analysis a Histogram: Graph display of tabulated frequencies, shown as bars 05 a It shows what proportion of cases 30 fall into each of several categories Differs from a bar chart in that it is 50 the area of the bar that denotes the value, not the height as in bar charts, 15 a crucial distinction when the 10 categories are not of uniform width The categories are usually specified as non-overlapping intervals of 10003000 5000 70000 90000 some variable. The categories(bars) must be adjacent 同济大学软件学院 ool of Software Engineering. Tongpi Unversity 19
19 Histogram Analysis ◼ Histogram: Graph display of tabulated frequencies, shown as bars ◼ It shows what proportion of cases fall into each of several categories ◼ Differs from a bar chart in that it is the area of the bar that denotes the value, not the height as in bar charts, a crucial distinction when the categories are not of uniform width ◼ The categories are usually specified as non-overlapping intervals of some variable. The categories (bars) must be adjacent 0 5 10 15 20 25 30 35 40 10000 30000 50000 70000 90000
Histograms Often Tell More than Boxplots The two histograms shown in the left may have the same boxplot representation The same values for min Q1, median, Q3, max But they have rather different data distributions 同济大学软件学院 20 ool of Software Engineering. Tongpi Unversity
20 Histograms Often Tell More than Boxplots ◼ The two histograms shown in the left may have the same boxplot representation ◼ The same values for: min, Q1, median, Q3, max ◼ But they have rather different data distributions