Python halts while iteratively processing my 1GB csv file

Posted by Dan on Stack Overflow See other posts from Stack Overflow or by Dan
Published on 2010-01-06T01:39:33Z Indexed on 2010/03/14 19:55 UTC
Read the original article Hit count: 354

Filed under:
|
|
|

I have two files:

  1. metadata.csv: contains an ID, followed by vendor name, a filename, etc
  2. hashes.csv: contains an ID, followed by a hash The ID is essentially a foreign key of sorts, relating file metadata to its hash.

I wrote this script to quickly extract out all hashes associated with a particular vendor. It craps out before it finishes processing hashes.csv

stored_ids = []

# this file is about 1 MB
entries = csv.reader(open(options.entries, "rb"))

for row in entries:
  # row[2] is the vendor
  if row[2] == options.vendor:
    # row[0] is the ID
    stored_ids.append(row[0])

# this file is 1 GB
hashes = open(options.hashes, "rb")

# I iteratively read the file here,
# just in case the csv module doesn't do this.
for line in hashes:

  # not sure if stored_ids contains strings or ints here...
  # this probably isn't the problem though
  if line.split(",")[0] in stored_ids:

    # if its one of the IDs we're looking for, print the file and hash to STDOUT
    print "%s,%s" % (line.split(",")[2], line.split(",")[4])

hashes.close()

This script gets about 2000 entries through hashes.csv before it halts. What am I doing wrong? I thought I was processing it line by line.

ps. the csv files are the popular HashKeeper format and the files I am parsing are the NSRL hash sets. http://www.nsrl.nist.gov/Downloads.htm#converter

UPDATE: working solution below. Thanks everyone who commented!

entries = csv.reader(open(options.entries, "rb"))           
stored_ids = dict((row[0],1) for row in entries if row[2] == options.vendor)

hashes = csv.reader(open(options.hashes, "rb"))
matches = dict((row[2], row[4]) for row in hashes if row[0] in stored_ids)

for k, v in matches.iteritems():
    print "%s,%s" % (k, v)

© Stack Overflow or respective owner

Related posts about python

Related posts about huge-files