Converting large files in python
- by Cenoc
I have a few files that are ~64GB in size that I think I would like to convert to hdf5 format. I was wondering what the best approach for doing so would be? Reading line-by-line seems to take more than 4 hours, so I was thinking of using multiprocessing in sequence, but was hoping for some direction on what would be the most efficient way without resorting to hadoop. Any help would be very much appreciated. (and thank you in advance)
EDIT:
Right now I'm just doing a for line in fd: approach. After that right now I just check to make sure I'm picking out the right sort of data, which is very short; I'm not writing anywhere, and it's taking around 4 hours to complete with that. I can't read blocks of data because the blocks in this weird file format I'm reading are not standard, it switches between three different sizes... and you can only tell which by reading the first few characters of the block.