Attribute Types Nominal: categories, states or " names of things Hair color=auburn, black blond, brown, grey red whitey marital status, occupation, ID numbers zip codes Bina iry Nominal attribute with only 2 states(0 and 1) Symmetric binary: both outcomes equally important e.g. gender Asymmetric binary: outcomes not equally important. e.g., medical test(positive vs, negative Convention assign 1 to most important outcome(e.g, HIV positive Ordinal Values have a meaningful order(ranking but magnitude between successive values is not known Size =tsmall, medium, large grades, army rankings
6 Attribute Types ◼ Nominal: categories, states, or “names of things” ◼ Hair_color = {auburn, black, blond, brown, grey, red, white} ◼ marital status, occupation, ID numbers, zip codes ◼ Binary ◼ Nominal attribute with only 2 states (0 and 1) ◼ Symmetric binary: both outcomes equally important ◼ e.g., gender ◼ Asymmetric binary: outcomes not equally important. ◼ e.g., medical test (positive vs. negative) ◼ Convention: assign 1 to most important outcome (e.g., HIV positive) ◼ Ordinal ◼ Values have a meaningful order (ranking) but magnitude between successive values is not known. ◼ Size = {small, medium, large}, grades, army rankings
Numeric Attribute Types Quantity(integer or real-valued) Interval Measured on a scale of equal-sized units Values have order E.g temperature in C or F calendar dates No true zero-point Ratio Inherent zero-point We can speak of values as being an order of magnitude larger than the unit of measurement (10K° is twice as high as5K°) e.g. temperature in Ke/vin, length, counts, monetary quantities
7 Numeric Attribute Types ◼ Quantity (integer or real-valued) ◼ Interval ◼ Measured on a scale of equal-sized units ◼ Values have order ◼ E.g., temperature in C˚or F˚, calendar dates ◼ No true zero-point ◼ Ratio ◼ Inherent zero-point ◼ We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚). ◼ e.g., temperature in Kelvin, length, counts, monetary quantities
Discrete vs Continuous Attributes Discrete Attribute Has only a finite or countably infinite set of values E.g., zip codes, profession, or the set of words in a collection of documents Sometimes represented as integer variables Note: Binary attributes are a special case of discrete attributes Continuous attribute Has real numbers as attribute values E.g. temperature, height or weight Practically, real values can only be measured and represented using a finite number of digits Continuous attributes are typically represented as floating-point variables
8 Discrete vs. Continuous Attributes ◼ Discrete Attribute ◼ Has only a finite or countably infinite set of values ◼ E.g., zip codes, profession, or the set of words in a collection of documents ◼ Sometimes, represented as integer variables ◼ Note: Binary attributes are a special case of discrete attributes ◼ Continuous Attribute ◼ Has real numbers as attribute values ◼ E.g., temperature, height, or weight ◼ Practically, real values can only be measured and represented using a finite number of digits ◼ Continuous attributes are typically represented as floating-point variables
Basic Statistical Descriptions of Data ■ Motivation a To better understand the data: central tendency variation and spread data dispersion characteristics median, max, min quantiles, outliers, variance etc. a Numerical dimensions correspond to sorted intervals Data dispersion analyzed with multiple granularities of precision a boxplot or quantile analysis on sorted intervals a Dispersion analysis on computed measures a Folding measures into numerical dimensions a Boxplot or quantile analysis on the transformed cube
9 Basic Statistical Descriptions of Data ◼ Motivation ◼ To better understand the data: central tendency, variation and spread ◼ Data dispersion characteristics ◼ median, max, min, quantiles, outliers, variance, etc. ◼ Numerical dimensions correspond to sorted intervals ◼ Data dispersion: analyzed with multiple granularities of precision ◼ Boxplot or quantile analysis on sorted intervals ◼ Dispersion analysis on computed measures ◼ Folding measures into numerical dimensions ◼ Boxplot or quantile analysis on the transformed cube
Measuring the Central Tendency Mean(algebraic measure)(sample vs. population: x=∑ ∑x Note: n is sample size and / is population size. N a Weighted arithmetic mean Trimmed mean chopping extreme values Middle value if odd number of values, or average of Median: the middle two values otherwise requency 1-5 200 a Estimated by interpolation(for grouped data) 6-15 450 n/2-C∑freq 16-20 300 )width 21-50 1500 Mode fred median 5180 700 a value that occurs most frequently in the data 81-110 44 a Unimodal bimodal, trimodal Empirical formula: mean-mode=3x(mean-median)
10 Measuring the Central Tendency ◼ Mean (algebraic measure) (sample vs. population): Note: n is sample size and N is population size. ◼ Weighted arithmetic mean: ◼ Trimmed mean: chopping extreme values ◼ Median: ◼ Middle value if odd number of values, or average of the middle two values otherwise ◼ Estimated by interpolation (for grouped data): ◼ Mode ◼ Value that occurs most frequently in the data ◼ Unimodal, bimodal, trimodal ◼ Empirical formula: = = n i xi n x 1 1 = = = n i i n i i i w w x x 1 1 width freq n freq l median L median ) / 2 ( ) ( 1 = + − mean − mode = 3(mean − median) N x =