The Apriori algorithm Example Database d itemset sup I itemset sup TID Items {1} 100134 {2} 200235 Scan d {2} {3} 23313 {3} 3001235 2333 5 40025 5 2 itemset sup c, Litemset L ,itemset sup {12} Scan d(1 2 {13}2 {13}2 {13} {23}2 {15}1 {15} 25}3 {23}2 {23} 35}2 {25}3 {25} 35}2 35 C3 itemset ScanD L3 litemset su 2351 23 35}2 O Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› The Apriori Algorithm — Example TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5 Database D itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3 itemset sup. {1} 2 {2} 3 {3} 3 {5} 3 Scan D C1 L1 itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5} itemset sup {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2 itemset sup {1 3} 2 {2 3} 2 {2 5} 3 {3 5} 2 L2 C2 C2 Scan D C3 L3 itemset {2 3 5} Scan D itemset sup {2 3 5} 2
Reducing Number of comparisons Candidate counting Scan the database of transactions to determine the support of each candidate itemset o reduce the number of comparisons, store the candidates in a hash structure Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets Transactions Hash Structure TID Items Bread. milk 2 Bread. Diaper. beer, eggs Milk, Diaper, Beer, Coke k 4Bread, Milk, Diaper, Beer 5Bread, Milk, Diaper,Coke Buckets n Steinbach. Kumar Introduction to Data Mining 4/18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Reducing Number of Comparisons Candidate counting: – Scan the database of transactions to determine the support of each candidate itemset – To reduce the number of comparisons, store the candidates in a hash structure ◆ Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke N Transactions Hash Structure k Buckets
Generate Hash tree Suppose you have 15 candidate itemsets of length 3: {145}{124},{457{125},{458}1159}{136}{234}{567{45} 356}{357},{689},{367}{368} You need · Hash function Max leaf size: max number of itemsets stored in a leaf node(if number of candidate itemsets exceeds max leaf size, split the node Hash function 3.6 567 145 345356367 136 2.5.8 357368 124 689 457 25159 458 n Steinbach. Kumar Introduction to Data Mining 4/18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Generate Hash Tree 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 1,4,7 2,5,8 3,6,9 Hash function Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need: • Hash function • Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node)
Association Rule Discovery: Hash tree Hash function Candidate Hash tree 1.4.7 3,6,9 2,5,8 234 56 145 136 345 356 367 Hash on 357 368 1.40r7 124 125 15 689 457458 n Steinbach. Kumar Introduction to Data Mining 4/18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Association Rule Discovery: Hash tree 1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1,4,7 2,5,8 3,6,9 Hash Function Candidate Hash Tree Hash on 1, 4 or 7
Association Rule Discovery: Hash tree Hash function Candidate Hash tree 1,4,7 3,6,9 2,5,8 234 145 136 345 356 367 Hash on 357 368 2.508 124 689 125 457458 n Steinbach. Kumar Introduction to Data Mining 4/18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Association Rule Discovery: Hash tree 1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1,4,7 2,5,8 3,6,9 Hash Function Candidate Hash Tree Hash on 2, 5 or 8