Looking for Unix tool/script that, given an input path, will compress every batch of uncompressed 100MB text files into a single gzip file

Posted by newToFlume on Super User See other posts from Super User or by newToFlume
Published on 2012-07-03T21:24:27Z Indexed on 2012/07/04 3:18 UTC
Read the original article Hit count: 205

Filed under:
|
|

I have a dump of thousands of small text files (1-5MB) large, each containing lines of text. I need to "batch" them up, so that each batch is of a fixed size - say 100MB, and compress that batch.

Now that batch could be:

  1. A single file that is just a 'cat' of the contents of the individual text files, or
  2. Just the individual text files themselves

Caveats:

  1. unix split -b will not work here as I need to keep lines of text intact. Using the lines option is a bit complicated as there is a large variance in the number of bytes in each line.
  2. The files need not be a fixed size strictly, as long as it's within 5% of the requested size
  3. The lines are critical, and should not be lost: I need to confirm that the input made its way to output without loss - what rolling checksum (something like CRC32, BUT better/"stronger" in face of collisions)

A script should do nicely, but this seems like a task someone has done before, and it would be nice to see some code (preferably python or ruby) that does atleast something similar.

© Super User or respective owner

Related posts about unix

Related posts about batch