Data mining Association Analysis: Basic Concepts and algorithms Lecture Notes for Chapter 6 Introduction to Data Mining Tan, steinbach Kumar O Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules Td tems Bread. milk Diaper}→{Beer MMilk, Bread >Eggs, Coke, Bread, Diaper, Beer, Eggs Beer, Bread}→{Mk Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Implication means co-occurrence, Bread, Milk, Diaper, Coke not causality! n Steinbach. Kumar Introduction to Data Mining 4/18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of Association Rules {Diaper} → {Beer}, {Milk, Bread} → {Eggs,Coke}, {Beer, Bread} → {Milk}, Implication means co-occurrence, not causality!
Definition: Frequent Itemset Item set a collection of one or more items Example: Milk, Bread, Diaper] K-itemset TD ltems e An itemset that contains k items Bread, milk Support count(o) Bread, diaper, beer, eggs Frequency of occurrence of an itemset 234 Milk, Diaper, Beer, Coke E.g. o(Milk, Bread, Diaper )=2 Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Support Fraction of transactions that contain an itemset E.g. s(Milk, Bread, Diaper) )=2/5 Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold n Steinbach. Kumar Introduction to Data Mining 4/18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Definition: Frequent Itemset Itemset – A collection of one or more items ◆ Example: {Milk, Bread, Diaper} – k-itemset ◆ An itemset that contains k items Support count () – Frequency of occurrence of an itemset – E.g. ({Milk, Bread,Diaper}) = 2 Support – Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5 Frequent Itemset – An itemset whose support is greater than or equal to a minsup threshold TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
Definition: association rule Association rule TD tems An implication expression of the form X.. where x and y are itemsets Bread. milk Bread, Diaper, Beer, eggs Example: {Mik, Diaper}→{Ber 2345 Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Rule evaluation metrics Support (s) e fraction of transactions that contain Example both x and y Mik, Diaper}→Beer Confidence(c) e Measures how often items in y o Milk, Diaper, Beer)2 =0.4 appear in transactions that T contain x C O(Milk, Di Diaper, Beer) 2 0.67 o Milk, Diaper) 3 n Steinbach. Kumar Introduction to Data Mining 4/18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Definition: Association Rule Example: {Milk ,Diaper } Beer 0.4 5 2 | T | (Milk,Diaper,Beer) = = = s 0.67 3 2 (Milk,Diaper) (Milk,Diaper,Beer) = = = c Association Rule – An implication expression of the form X → Y, where X and Y are itemsets – Example: {Milk, Diaper} → {Beer} Rule Evaluation Metrics – Support (s) ◆ Fraction of transactions that contain both X and Y – Confidence (c) ◆ Measures how often items in Y appear in transactions that contain X TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
Another Example Customer Customer buys both Customer buys beer Transaction ID Items bought Let minimum support 50%, and 2000 AB c minimum confidence 50%. we 1000AC have 4000 A d A→C(50%,66.6% 5000 B.E.F C→A(50%,100%) n Steinbach. Kumar Introduction to Data Mining 4/18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Another Example Transaction ID Items Bought 2000 A,B,C 1000 A,C 4000 A,D 5000 B,E,F Let minimum support 50%, and minimum confidence 50%, we have – A C (50%, 66.6%) – C A (50%, 100%) Customer buys diaper Customer buys both Customer buys beer