Static Discretization of quantitative Attributes Discretized prior to mining using concept hierarchy Numeric values are replaced by ranges In relational database, finding all frequent k-predicate sets will require k or k+1 table scans Data cube is well suited for mining The cells of an n-dimensional age (income) buys cuboid correspond to the predicate sets Mining from data cubes (age, income)(age, buys)(ineome, buys can be much faster (age, income, buys) 11
11 Static Discretization of Quantitative Attributes ◼ Discretized prior to mining using concept hierarchy. ◼ Numeric values are replaced by ranges ◼ In relational database, finding all frequent k-predicate sets will require k or k+1 table scans ◼ Data cube is well suited for mining ◼ The cells of an n-dimensional cuboid correspond to the predicate sets ◼ Mining from data cubes can be much faster (age) (income) () (buys) (age, income) (age,buys) (income,buys) (age,income,buys)
Quantitative Association Rules Based on Statistical Inference Theory [Aumann and LindelIODMKD'o3] Finding extraordinary and therefore interesting phenomena, e. g (Sex female)=> Wage: mean =$ /hr(overall mean =$9) LHS: a subset of the population RHS: an extraordinary behavior of this subset The rule is accepted only if a statistical test(e. g. Z-test)confirms the inference with high confidence Subrule: highlights the extraordinary behavior of a subset of the pop of the super rule E.g. ,(Sex female)( South yes)=> mean wage= $6.3/hr Two forms of rules Categorical = quantitative rules or Quantitative = quantitative rules E.g. Education in [14-18](yrs)=> mean wage = $11.64/hr Open problem Efficient methods for LHS containing two or more quantitative attributes
12 Quantitative Association Rules Based on Statistical Inference Theory [Aumann and Lindell@DMKD’03] ◼ Finding extraordinary and therefore interesting phenomena, e.g., (Sex = female) => Wage: mean=$7/hr (overall mean = $9) ◼ LHS: a subset of the population ◼ RHS: an extraordinary behavior of this subset ◼ The rule is accepted only if a statistical test (e.g., Z-test) confirms the inference with high confidence ◼ Subrule: highlights the extraordinary behavior of a subset of the pop. of the super rule ◼ E.g., (Sex = female) ^ (South = yes) => mean wage = $6.3/hr ◼ Two forms of rules ◼ Categorical => quantitative rules, or Quantitative => quantitative rules ◼ E.g., Education in [14-18] (yrs) => mean wage = $11.64/hr ◼ Open problem: Efficient methods for LHS containing two or more quantitative attributes
Negative and Rare Patterns Rare patterns: Very low support but interesting E.g. buying Rolex watches Mining: Setting individual-based or special group-based support threshold for valuable items Negative patterns Since it is unlikely that one buys ford expedition (an SUV car )and Toyota Prius (a hybrid car together, Ford Expedition and Toyota Prius are likely negatively correlated patterns Negatively correlated patterns that are infrequent tend to be more interesting than those that are frequent
13 Negative and Rare Patterns ◼ Rare patterns: Very low support but interesting ◼ E.g., buying Rolex watches ◼ Mining: Setting individual-based or special group-based support threshold for valuable items ◼ Negative patterns ◼ Since it is unlikely that one buys Ford Expedition (an SUV car) and Toyota Prius (a hybrid car) together, Ford Expedition and Toyota Prius are likely negatively correlated patterns ◼ Negatively correlated patterns that are infrequent tend to be more interesting than those that are frequent
Defining Negative Correlated Patterns( Definition 1(support-based) If itemsets X and Y are both frequent but rarely occur together, i.e sup(X U Y< sup(X) sup (Y Then X and y are negatively correlated Problem: a store sold two needle 100 packages a and b, only one transaction containing both a and B When there are in total 200 transactions we have s(AUB)=0.005,(A)*s(B)=0.25,S(AUB)<S(A)*S(B) When there are 105 transactions we have s(AUB)=1/105,S(A)*S(B)=1/103*1/103,S(AUB)>S(A)*S(B) Where is the problem?-Null transactions, i, e the support-based definition is not null-invariant 14
14 Defining Negative Correlated Patterns (I) ◼ Definition 1 (support-based) ◼ If itemsets X and Y are both frequent but rarely occur together, i.e., sup(X U Y) < sup (X) * sup(Y) ◼ Then X and Y are negatively correlated ◼ Problem: A store sold two needle 100 packages A and B, only one transaction containing both A and B. ◼ When there are in total 200 transactions, we have s(A U B) = 0.005, s(A) * s(B) = 0.25, s(A U B) < s(A) * s(B) ◼ When there are 105 transactions, we have s(A U B) = 1/105 , s(A) * s(B) = 1/103 * 1/103 , s(A U B) > s(A) * s(B) ◼ Where is the problem? —Null transactions, i.e., the support-based definition is not null-invariant!
Defining Negative Correlated Patterns (D) Definition 2(negative itemset-based) X is a negative itemset if (1X=AU B, where b is a set of positive items, and a is a set of negative items a> 1, and (2)()>u Itemsets X is negatively correlated, if X)<Is(i), where tiE X, and s(ci)is the support of ai This definition suffers a similar null-invariant problem Definition 3(Kulzynski measure-based If itemsets X and Y are frequent but(P(XY+ PYX/2< E, where e is a negative pattern threshold, then X and y are negatively correlated Ex. For the same needle package problem when no matter there are 200 or 105 transactions, ife=0.01, we have (P(A|B)+P(B|A)2=(0.01+0.01/2<E 15
15 Defining Negative Correlated Patterns (II) ◼ Definition 2 (negative itemset-based) ◼ X is a negative itemset if (1) X = Ā U B, where B is a set of positive items, and Ā is a set of negative items, |Ā|≥ 1, and (2) s(X) ≥ μ ◼ Itemsets X is negatively correlated, if ◼ This definition suffers a similar null-invariant problem ◼ Definition 3 (Kulzynski measure-based) If itemsets X and Y are frequent, but (P(X|Y) + P(Y|X))/2 < є, where є is a negative pattern threshold, then X and Y are negatively correlated. ◼ Ex. For the same needle package problem, when no matter there are 200 or 105 transactions, if є = 0.01, we have (P(A|B) + P(B|A))/2 = (0.01 + 0.01)/2 < є