Looking for Unix tool/script that, given an input path, will compress every batch of uncompressed 100MB text files into a single gzip file
- by newToFlume
I have a dump of thousands of small text files (1-5MB) large, each containing lines of text. I need to "batch" them up, so that each batch is of a fixed size - say 100MB, and compress that batch.
Now that batch could be:
A single file that is just a 'cat' of the contents of the individual text files, or
Just the individual text files themselves
Caveats:
unix split -b will not work here as I need to keep lines of text intact. Using the lines option is a bit complicated as there is a large variance in the number of bytes in each line.
The files need not be a fixed size strictly, as long as it's within 5% of the requested size
The lines are critical, and should not be lost: I need to confirm that the input made its way to output without loss - what rolling checksum (something like CRC32, BUT better/"stronger" in face of collisions)
A script should do nicely, but this seems like a task someone has done before, and it would be nice to see some code (preferably python or ruby) that does atleast something similar.