Looking for Unix tool/script that, given an input path, will compress every batch of uncompressed 100MB text files into a single gzip file
Posted
by
newToFlume
on Super User
See other posts from Super User
or by newToFlume
Published on 2012-07-03T21:24:27Z
Indexed on
2012/07/04
3:18 UTC
Read the original article
Hit count: 205
I have a dump of thousands of small text files (1-5MB) large, each containing lines of text. I need to "batch" them up, so that each batch is of a fixed size - say 100MB, and compress that batch.
Now that batch could be:
- A single file that is just a 'cat' of the contents of the individual text files, or
- Just the individual text files themselves
Caveats:
- unix
split -b
will not work here as I need to keep lines of text intact. Using thelines
option is a bit complicated as there is a large variance in the number of bytes in each line. - The files need not be a fixed size strictly, as long as it's within 5% of the requested size
- The lines are critical, and should not be lost: I need to confirm that the input made its way to output without loss - what rolling checksum (something like CRC32, BUT better/"stronger" in face of collisions)
A script should do nicely, but this seems like a task someone has done before, and it would be nice to see some code (preferably python or ruby) that does atleast something similar.
© Super User or respective owner