Efficient and accurate way to compact and compare Python lists?
Posted
by daveslab
on Stack Overflow
See other posts from Stack Overflow
or by daveslab
Published on 2010-06-08T01:00:04Z
Indexed on
2010/06/08
1:02 UTC
Read the original article
Hit count: 196
Hi folks,
I'm trying to a somewhat sophisticated diff between individual rows in two CSV files. I need to ensure that a row from one file does not appear in the other file, but I am given no guarantee of the order of the rows in either file. As a starting point, I've been trying to compare the hashes of the string representations of the rows (i.e. Python lists). For example:
import csv
hashes = []
for row in csv.reader(open('old.csv','rb')):
hashes.append( hash(str(row)) )
for row in csv.reader(open('new.csv','rb')):
if hash(str(row)) not in hashes:
print 'Not found'
But this is failing miserably. I am constrained by artificially imposed memory limits that I cannot change, and thusly I went with the hashes instead of storing and comparing the lists directly. Some of the files I am comparing can be hundreds of megabytes in size. Any ideas for a way to accurately compress Python lists so that they can be compared in terms of simple equality to other lists? I.e. a hashing system that actually works? Bonus points: why didn't the above method work?
© Stack Overflow or respective owner