How Xmill Works. Three ideas Group the data values according to their types gzip structure gzip data1 gzip Data2 <apache entry> 20223.23.16 Get/htTp/1.0 +224.4245+GET/HTP11=133MB </apache: entry>
11 How Xmill Works: Three Ideas <apache:entry> . . . </apache:entry> 202.23.23.16 224.42.24.55 … gzip Structure gzip Data1 + =1.33MB GET / HTTP/1.0 GET / HTTP/1.1 … gzip Data2 + Group the data values according to their types:
How Xmill orks: Three ideas Apply semantic(specialized) compressors gzip structure gzip c1(Data1)+ gzip c2 (Data2)+ =082MB E . Xamples 8, 16, 32-bit integer encoding(signed/unsigned) differential compressing(e.g. 1999, 1995, 2001, 2000, 1995, .. compress lists, records(e.g. 104.32.23 1>4 bytes) Need user input to select the semantic compressor 12
12 How Xmill Works: Three Ideas gzip Structure + gzip c1(Data1) + gzip c2(Data2) + ... =0.82MB Apply semantic (specialized) compressors: Examples: • 8, 16, 32-bit integer encoding (signed/unsigned) • differential compressing (e.g. 1999, 1995, 2001, 2000, 1995, ...) • compress lists, records (e.g. 104.32.23.1 → 4 bytes) Need user input to select the semantic compressor
Path processor-structure container <Book><Title lang= English> Data Compression</Title> <Author>Gray</author> Reiters author> Dictionary One more entry Fewer storage for each new word 灿 due with container 14 bytes Replace end tag with O Replace tags/attributes with positive integer <B0 okBdatlelang配e触h6 Hi pbc:sotk0 <Aut311me40300 <Author>Reiteretorathor RePeated structures entries could be compressed effectively!
13 Path Processor – structure container: ▪ Replace data value with container number (negative integer) ▪ Replace end tag with 0 ▪ Replace tags/attributes with positive integer <Book><Title lang=“English”>Data Compression</Title> <Author>Gray</Author> <Author>Reiter</Author> </Book> <Book><Title lang=“English”>Data Compression</Title> <Author>Gray</Author> <Author>Reiter</Author> </Book> <Book><Title lang=-1>-2</Title> <Author>-3</Author> <Author>-3</Autor> </Book> <Book><Title lang= Book = 1, Title = 2, @lang = 3, Author = 4 -1 0>-2 0 <Author>-3 0 <Author>-3 0 0 1 2 3 -1 0 -2 0 4 -3 0 4 -3 0 0 Fewer storage! 14 bytes! Dictionary: One more entry for each new word Repeated structures entries could be compressed effectively!
XML Compression 区gzp(org 2 Kxmill ff 5 回xm 图xmil<u> 0.5 0 Weblog Swiss Prot Treebank DBLP XMill Evaluation using XMl datasets
14 XML Compression XMill Evaluation using XML datasets
Queriable Compressors XQzip: queriable XML compressor(our work EDBTO4 Existing XML compressors(survey in[WWWJO5D Unqueriable (e.g. XMill (SIGMODOOJ): exploit data commonalities> better compression rate than gzip) Queriable(e.g XGrind (ICDE02 XPRESS SIGMOD031 XQue C, XQzip edbto4), XcQ [KaiSjo5): compress data individually 2 inadequate compression rate and time) ° Features of XQzip Use the sit to aid query evaluation Block-compression: allow data commonalities to be exploited and used as buffers to reduce decompression overhead 15
15 Queriable Compressors ▪ XQzip: queriable XML compressor (our work [EDBT04]) • Existing XML compressors (survey in[WWWJ05]): ▪ Unqueriable (e.g. XMill [SIGMOD00]): exploit data commonalities ≥ better compression rate than gzip) ▪ Queriable (e.g. XGrind [ICDE02], XPRESS [SIGMOD03], XQueC, XQzip [EDBT04], XCQ [KAISJ05]): compress data individually ≥ inadequate compression rate and time) • Features of XQzip: ▪ Use the SIT to aid query evaluation ▪ Block-compression: allow data commonalities to be exploited and used as buffers to reduce decompression overhead