Index Compression Web Search and Mining Lecture 6: Index Compression
Index Compression 1 Lecture 6: Index Compression Web Search and Mining
Index Compression Last lecture-index construction Sort-based indexing a Naive in-memory inversion Blocked Sort-Based Indexing Merge sort is effective for disk-based sorting( avoid seeks Single-Pass In-Memory Indexing No global dictionary Generate separate dictionary for each block Dont sort postings Accumulate postings in postings lists as they occur Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
Index Compression 2 Last lecture – index construction ▪ Sort-based indexing ▪ Naïve in-memory inversion ▪ Blocked Sort-Based Indexing ▪ Merge sort is effective for disk-based sorting (avoid seeks!) ▪ Single-Pass In-Memory Indexing ▪ No global dictionary ▪ Generate separate dictionary for each block ▪ Don’t sort postings ▪ Accumulate postings in postings lists as they occur ▪ Distributed indexing using MapReduce ▪ Dynamic indexing: Multiple indices, logarithmic merge
Index Compression This lecture BRUTUS 1241131451731174 CAeSAR 2|4561657132 CALPURNIA→23154101 Collection statistics in more detail (with RCv1) How big will the dictionary and postings be? a Dictionary compression Postings compression
Index Compression 3 This lecture ▪ Collection statistics in more detail (with RCV1) ▪ How big will the dictionary and postings be? ▪ Dictionary compression ▪ Postings compression
Index Compression Why compression in general) Use less disk space Saves a little money Keep more stuff in memory Increases speed Increase speed of data transfer from disk to memory [read compressed data decompress] is faster than [read uncompressed datal Premise: Decompression algorithms are fast True of the decompression algorithms we use
Index Compression 4 Why compression (in general)? ▪ Use less disk space ▪ Saves a little money ▪ Keep more stuff in memory ▪ Increases speed ▪ Increase speed of data transfer from disk to memory ▪ [read compressed data | decompress] is faster than [read uncompressed data] ▪ Premise: Decompression algorithms are fast ▪ True of the decompression algorithms we use
Index Compression Why compression for inverted indexes? Dictionary Make it small enough to keep in main memory Make it so small that you can keep some postings lists in main memory too Posting gs file(s) Reduce disk space needed Decrease time needed to read postings lists from disk Large search engines keep a significant part of the postings in memory. Compression lets you keep more in memory We will devise various IR-specific compression schemes
Index Compression 5 Why compression for inverted indexes? ▪ Dictionary ▪ Make it small enough to keep in main memory ▪ Make it so small that you can keep some postings lists in main memory too ▪ Postings file(s) ▪ Reduce disk space needed ▪ Decrease time needed to read postings lists from disk ▪ Large search engines keep a significant part of the postings in memory. ▪ Compression lets you keep more in memory ▪ We will devise various IR-specific compression schemes