Issues in XML Compression Compression ratios Compression time Query coverage. memory Usage ...(see my survey paper in wwwJ) Technologies Compression Compression Memory Usage Time Compression Used (compared(compared( for compression Scheme with Gzip) with Gzip) Used Consistently Constant Not Support SAX Better Slower 8 MB (default) Querying Compress (UNIX) Much At least two Roughly Huffman Exact-match, SAX times longer Constant Coding Prefix-match Xpath Axes Child Attribute XPRESS At least two Roughly uffman Coding, Exact-match, SAX Constant Approximated Prefix-match Xpath Axes Arithmetic Child and Encoding Descendant Attribute prohibitively Constant onge XMLZip Comparable Much Proportional Not Support DOM Querying Input Data Size porti tructure Better Longer Io Compression Querying Slightl Much Proportional Differential Not Support DOM DDT Better DTD Tree Querying Input Data Size Compression, Comparison of existing technologies
6 Issues in XML Compression ▪ Compression ratios, Compression time, Query Coverage, Memory Usage…(see my survey paper in WWWJ) Comparison of existing technologies
An Example: Web Server logs ASCll File 15.9 Mb (gzipped 1.6MB) 202.239.238.16get/http:/1.otext/html2001997/10/01-00:00:021-14478-i-http://www.netjp/moziLla/3.1lJaj( XML-ized apache web log inflates to 24.2 Mb gzipped 2. 1MB) <apache entry> <apache: host> 202. 239.238.16</apache: host> Apacherequestline>get/http:/1.0</apacherequestline> <apache: content Type> text/html </apache: content Type> <apache: status Code> 200</apache: status Code> <apache: date>1997/10/01-00: 00: 02</apache date> <apache: byte Count> 4478</apache: byte Count> Kapachereferer>http://www.net.ip/</apache:referer> <apache: user Agent> Mozilla/3. 1S[SjaS]S()</apache user Agent> </apache: entry>
7 An Example:Web Server Logs 202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.net.jp/|Mozilla/3.1[ja](I) <apache:entry> <apache:host> 202.239.238.16 </apache:host> <apache:requestLine> GET / HTTP/1.0 </apache:requestLine> <apache:contentType> text/html </apache:contentType> <apache:statusCode> 200</apache:statusCode> <apache:date> 1997/10/01-00:00:02</apache:date> <apache:byteCount> 4478</apache:byteCount> <apache:referer> http://www.net.jp/ </apache:referer> <apache:userAgent> Mozilla/3.1$[$ja$]$(I)</apache:userAgent> </apache:entry> ASCII File 15.9 MB (gzipped 1.6MB): XML-ized apache web log inflates to 24.2 MB (gzipped 2.1MB):
XMill First specialized compressor for XML data SAX parser for parsing XML data Still using gzip as its underlying compressor Clever grouping of data into containers for compression Compress Xml via three basic techniques Compress the structure separately from the data Group the data values according to their types Apply semantic(specialized) compressors Downloadable www.cs.washington.edu/homes/suciu/xmill
8 XMill ▪ First specialized compressor for XML data • SAX parser for parsing XML data • Still using gzip as its underlying compressor • Clever grouping of data into containers for compression ▪ Compress XML via three basic techniques • Compress the structure separately from the data • Group the data values according to their types • Apply semantic (specialized) compressors: ▪ Downloadable: • www.cs.washington.edu/homes/suciu/XMILL
XMill Architecture nput file: XML Command line: Container Expressions <apache: atrp> P//apache: host=>IP apache: host>203.237.165. 15<, apache host> pache: request 11 e>GET /images/logo.gif -P// apache: requeatliae=>set("GET "t) P!/ apache: useragent>mozilla/ 4.0 SAX-Parser :203:172.222351 <apache: request liae>GET ,diat/testzi1 Path Processor Sem Compressor 1 Sem Compressor 2... Sem Compressor k Main memory Structure container Data container 1 Data container 2 Data container k CB ED 12C1#3c2 A5 0E Mo=i11a/4,0[ea] CB AC 16 02 dit/te計t,〓iP Output file: compressed XMl Figure 4: Architecture of the Compressor
9 XMill Architecture:
How Xmill Works. Three ideas Compress the structure separately from the data gzip structure gzip Data <apache entry> 202.23923816 <apache: host></apache: host> Get/htTp/1.0 text/html =1.75MB 200 </apache: entry>
10 How Xmill Works: Three Ideas <apache:entry> <apache:host> </apache:host> . . . </apache:entry> 202.239.238.16 GET / HTTP/1.0 text/html 200 … gzip Structure gzip Data + =1.75MB Compress the structure separately from the data: