Multi-part gzip file random access (in Java)

Posted by toluju on Stack Overflow See other posts from Stack Overflow or by toluju
Published on 2009-08-04T01:32:41Z Indexed on 2010/06/03 17:44 UTC
Read the original article Hit count: 369

This may fall in the realm of "not really feasible" or "not really worth the effort" but here goes.

I'm trying to randomly access records stored inside a multi-part gzip file. Specifically, the files I'm interested in are compressed Heretrix Arc files. (In case you aren't familiar with multi-part gzip files, the gzip spec allows multiple gzip streams to be concatenated in a single gzip file. They do not share any dictionary information, it is simple binary appending.)

I'm thinking it should be possible to do this by seeking to a certain offset within the file, then scan for the gzip magic header bytes (i.e. 0x1f8b, as per the RFC), and attempt to read the gzip stream from the following bytes. The problem with this approach is that those same bytes can appear inside the actual data as well, so seeking for those bytes can lead to an invalid position to start reading a gzip stream from. Is there a better way to handle random access, given that the record offsets aren't known a priori?

© Stack Overflow or respective owner

Related posts about compression

Related posts about gzip