Frequent Itemset Generation Strategies Reduce the number of candidates(M) Complete search M=2d Use pruning techniques to reduce M Reduce the number of transactions(N Reduce size of n as the size of itemset increases Reduce the number of comparisons(NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction O Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Frequent Itemset Generation Strategies Reduce the number of candidates (M) – Complete search: M=2d – Use pruning techniques to reduce M Reduce the number of transactions (N) – Reduce size of N as the size of itemset increases Reduce the number of comparisons (NM) – Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction
Reducing Number of Candidates Apriori principle If an itemset is frequent then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure Vx,y:(XcY)→(X)≥(Y) Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support n Steinbach. Kumar Introduction to Data Mining 4/18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: – Support of an itemset never exceeds the support of its subsets – This is known as the anti-monotone property of support X,Y :(X Y ) s(X ) s(Y )
Illustrating Apriori Principle BD Found to be Infrequent BD)(ABE)(ACD)(ACE)(ADE(BCD(BCE)(BDE DE ABCD ABCE ABDE\(ACDE BCDE Pruned supersets ABCDE n Steinbach. Kumar Introduction to Data Mining 4/18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Found to be Infrequent null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE Illustrating Apriori Principle null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE Pruned supersets
Illustrating Apriori Principle Item Count Items(1-itemsets) Bread oke Milk Itemset Count Pairs(2-itemsets) Beer Diaper 424341 Bread, Milk) 3 Eggs bRead, Beer 2(No need to generate (Bread, Diaper) 3 candidates involving Coke MIlk, Beer 2 or Eggs) IMilk, Diaper] 3 [Beer, Diaper 3 Minimum Support =3 Triplets(3-itemsets) If every subset is considered Itemset Count 6C1+6C2+6C3=41 Bread, Milk, Diaper 3 With support-based pruning 6+6+1=13 n Steinbach. Kumar Introduction to Data Mining 4/18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Illustrating Apriori Principle Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1 Itemset Count {Bread,Milk} 3 {Bread,Beer} 2 {Bread,Diaper} 3 {Milk,Beer} 2 {Milk,Diaper} 3 {Beer,Diaper} 3 Itemset Count {Bread,Milk,Diaper} 3 Items (1-itemsets) Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Minimum Support = 3 If every subset is considered, 6C1 + 6C2 + 6C3 = 41 With support-based pruning, 6 + 6 + 1 = 13
Apriori Algorithm Method Let k=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identified Generate length(k+1)candidate itemsets from length k frequent itemsets if their first k-1 items are identical o Prune candidate itemsets containing subsets of length k that are infrequent o Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent n Steinbach. Kumar Introduction to Data Mining 4/18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Apriori Algorithm Method: – Let k=1 – Generate frequent itemsets of length 1 – Repeat until no new frequent itemsets are identified ◆ Generate length (k+1) candidate itemsets from length k frequent itemsets if their first k-1 items are identical ◆ Prune candidate itemsets containing subsets of length k that are infrequent ◆ Count the support of each candidate by scanning the DB ◆ Eliminate candidates that are infrequent, leaving only those that are frequent