I have a huge gzipped text file which I need to read, line by line. I go with the following:
for i, line in enumerate(codecs.getreader('utf-8')(gzip.open('file.gz'))):
print i, line
At some point late in the file, the python output diverges from the file. This is because lines are getting broken due to weird special characters that python thinks are newlines. When I open the file in 'vim', they are correct, but the suspect characters are formatted weirdly. Is there something I can do to fix this?
I've tried other codecs including utf-16, latin-1. I've also tried with no codec.
I looked at the file using 'od'. Sure enough, there are \n characters where they shouldn't be. But, the "wrong" ones are prepended by a weird character. I think there's some encoding here with some characters being 2-bytes, but the trailing byte being a \n if not viewed properly.
If I replace:
gzip.open('file.gz')
With:
os.popen('zcat file.gz')
It works fine (and actually, quite faster). But, I'd like to know where I'm going wrong.