Bloom Filters A bloom filter is a probabilistic data structure used to check membership of a value in a set May return true(with low probability)even if an element is not present But never returns false if an element is present Used to filter out irrelevant sets Key data structure is a single bitmap For a set with n elements,typical bitmap size is 10n Uses multiple independent hash functions With a single hash function h()with range=number of bits in bitmap: For each element s in set S compute h(s)and set bit h(s) To query an element v compute h(v),and check if bit h(v)is set Problem with single hash function:significant chance of false positive due to hash collision 10%chance with 10n bits Database System Concepts-7th Edition 24.2 ©Silberscha乜,Korth and Sudarshan
Database System Concepts - 7 24.2 ©Silberschatz, Korth and Sudarshan th Edition Bloom Filters ▪ A bloom filteris a probabilistic data structure used to check membership of a value in a set • May return true (with low probability) even if an element is not present • But never returns false if an element is present • Used to filter out irrelevant sets ▪ Key data structure is a single bitmap • For a set with n elements, typical bitmap size is 10n ▪ Uses multiple independent hash functions ▪ With a single hash function h() with range=number of bits in bitmap: • For each element s in set S compute h(s) and set bit h(s) • To query an element v compute h(v), and check if bit h(v) is set ▪ Problem with single hash function: significant chance of false positive due to hash collision • 10% chance with 10n bits
Bloom Filters (Cont.) Key idea of Bloom filter:reduce false positives by use multiple hash functions h()for i=1..k For each element s in set S for each i compute hi(s)and set bit hi(s) To query an element v for each i compute hi(v),and check if bit hi(v)is set If bit hi(v)is set for every i then report v as present in set Else report v as absent With 10n bits,and k =7,false positive rate reduces to 1%instead of 10%with k=1 Database System Concepts-7th Edition 24.3 @Silberschatz,Korth and Sudarshan
Database System Concepts - 7 24.3 ©Silberschatz, Korth and Sudarshan th Edition Bloom Filters (Cont.) ▪ Key idea of Bloom filter: reduce false positives by use multiple hash functions hi () for i = 1..k • For each element s in set S for each i compute hi (s) and set bit hi (s) • To query an element v for each i compute hi (v), and check if bit hi (v) is set ▪ If bit hi (v) is set for every i then report v as present in set ▪ Else report v as absent • With 10n bits, and k = 7, false positive rate reduces to 1% instead of 10% with k = 1
Write Optimized Indices Performance of B+-trees can be poor for write-intensive workloads One 1/O per leaf,assuming all internal nodes are in memory With magnetic disks,100 inserts per second per disk With flash memory,one page overwrite per insert Two approaches to reducing cost of writes Log-structured merge tree ·Buffer tree Database System Concepts-7th Edition 24.4 ©Silberscha乜,Korth and Sudarshan
Database System Concepts - 7 24.4 ©Silberschatz, Korth and Sudarshan th Edition Write Optimized Indices ▪ Performance of ▪ B + -trees can be poor for write-intensive workloads • One I/O per leaf, assuming all internal nodes are in memory • With magnetic disks, < 100 inserts per second per disk • With flash memory, one page overwrite per insert ▪ Two approaches to reducing cost of writes • Log-structured merge tree • Buffer tree
Log Structured Merge (LSM)Tree Consider only inserts/queries for now Lo Records inserted first into in- Memory memory tree (Lo tree) When in-memory tree is full Disk records moved to disk(L1 tree) B+-tree constructed using bottom-up build by merging existing L1 tree with records from Lo tree When L1 tree exceeds some threshold,merge into L2 tree And so on for more levels 。 Size threshold for Li+1 tree is k times size threshold for Li tree Merge creates a new B+-tree using bottom-up build Database System Concepts-7th Edition 24.5 @Silberschatz,Korth and Sudarshan
Database System Concepts - 7 24.5 ©Silberschatz, Korth and Sudarshan th Edition Log Structured Merge (LSM) Tree ▪ Consider only inserts/queries for now ▪ Records inserted first into inmemory tree (L0 tree) ▪ When in-memory tree is full, records moved to disk (L1 tree) • B + -tree constructed using bottom-up build by merging existing L1 tree with records from L0 tree ▪ When L1 tree exceeds some threshold, merge into L2 tree • And so on for more levels • Size threshold for Li+1 tree is k times size threshold for Li tree • Merge creates a new B+ -tree using bottom-up build
LSM Tree (Cont.) Benefits of LSM approach Inserts are done using only sequential l/O operations Leaves are full,avoiding space wastage Reduced number of I/O operations per record inserted as compared to normal B*-tree(up to some size) If each leaf has m entries,m/kentries merged in using 1 IO Total l/O operations:k/m logk(M/M)where /total number of entries,and Mis the size of Lo tree. Drawback of LSM approach Queries have to search multiple trees Entire content of each level copied multiple times Database System Concepts-7th Edition 24.6 ©Silberscha乜,Korth and Sudarshan
Database System Concepts - 7 24.6 ©Silberschatz, Korth and Sudarshan th Edition LSM Tree (Cont.) ▪ Benefits of LSM approach • Inserts are done using only sequential I/O operations • Leaves are full, avoiding space wastage • Reduced number of I/O operations per record inserted as compared to normal B+ -tree (up to some size) ▪ If each leaf has m entries, m/k entries merged in using 1 IO ▪ Total I/O operations: k/m logk (I/M) where I = total number of entries, and M is the size of L0 tree. ▪ Drawback of LSM approach • Queries have to search multiple trees • Entire content of each level copied multiple times