Efficient Implementation of Apriori in SQL Hard to get good performance out of pure SQL (SQL-92) based approaches alone Make use of object-relational extensions like UDFS BLOBs Table functions etc Get orders of magnitude improvement S Sarawagi, s. Thomas, andR. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In SIGMOD98 February 4, 2021 Data Mining: Concepts and Techniques 18
February 4, 2021 Data Mining: Concepts and Techniques 18 Efficient Implementation of Apriori in SQL ◼ Hard to get good performance out of pure SQL (SQL-92) based approaches alone ◼ Make use of object-relational extensions like UDFs, BLOBs, Table functions etc. ◼ Get orders of magnitude improvement ◼ S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In SIGMOD’98
Challenges of frequent Pattern Mining Challenges Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving apriori: general ideas Reduce passes of transaction database scans Shrink number of candidates Facilitate support counting of candidates February 4, 2021 Data Mining: Concepts and Techniques 19
February 4, 2021 Data Mining: Concepts and Techniques 19 Challenges of Frequent Pattern Mining ◼ Challenges ◼ Multiple scans of transaction database ◼ Huge number of candidates ◼ Tedious workload of support counting for candidates ◼ Improving Apriori: general ideas ◼ Reduce passes of transaction database scans ◼ Shrink number of candidates ◼ Facilitate support counting of candidates
Partition: Scan Database Only Twice Any itemset that is potentially frequent in db must be frequent in at least one of the partitions of db Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns A. Savasere, E. omiecinski, and s navathe. an efficient algorithm for mining association in large databases. In VLDB95 February 4, 2021 Data Mining: Concepts and Techniques
February 4, 2021 Data Mining: Concepts and Techniques 20 Partition: Scan Database Only Twice ◼ Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB ◼ Scan 1: partition database and find local frequent patterns ◼ Scan 2: consolidate global frequent patterns ◼ A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95
DHP: Reduce the number of candidates A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Candidates: a, b, c,d,e Hash entries: ab, ad ae tbd, be de] Frequent 1-itemset: a, b d, e ab is not a candidate 2-itemset if the sum of count of fab, ad ae is below support threshold 1. Park, m, chen and p yu, an effective hash-based algorithm for mining association rules. In SIGMOD95 February 4, 2021 Data Mining: Concepts and Techniques 21
February 4, 2021 Data Mining: Concepts and Techniques 21 DHP: Reduce the Number of Candidates ◼ A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent ◼ Candidates: a, b, c, d, e ◼ Hash entries: {ab, ad, ae} {bd, be, de} … ◼ Frequent 1-itemset: a, b, d, e ◼ ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold ◼ J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD’95
Sampling for Frequent Patterns Select a sample of original database mine frequent patterns within sample using apriori Scan database once to verify frequent itemsets found in sample only borders of closure of frequent patterns are checked Example: check abcd instead of ab ac,., ei Scan database again to find missed frequent patterns H. Toivonen Sampling large databases for association rules. In VLDB96 February 4, 2021 Data Mining: Concepts and Techniques 22
February 4, 2021 Data Mining: Concepts and Techniques 22 Sampling for Frequent Patterns ◼ Select a sample of original database, mine frequent patterns within sample using Apriori ◼ Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked ◼ Example: check abcd instead of ab, ac, …, etc. ◼ Scan database again to find missed frequent patterns ◼ H. Toivonen. Sampling large databases for association rules. In VLDB’96