Searching for duplicate records within a text file where the duplicate is determined by only two fie
Posted
by plg
on Stack Overflow
See other posts from Stack Overflow
or by plg
Published on 2010-06-07T18:09:28Z
Indexed on
2010/06/07
18:12 UTC
Read the original article
Hit count: 201
python
First, Python Newbie; be patient/kind.
Next, once a month I receive a large text file (think 7 Million records) to test for duplicate values. This is catalog information. I get 7 fields, but the two I'm interested in are a supplier code and a full orderable part number. To determine if the record is dupliacted, I compress all special characters from the part number (except . and #) and create a compressed part number. The test for duplicates becomes the supplier code and compressed part number combination. This part is fairly straight forward. Currently, I am just copying the original file with 2 new columns (compressed part and duplicate indicator). If the part is a duplicate, I put a "YES" in the last field. Now that this is done, I want to be able to go back (or better yet, at the same time) to get the previous record where there was a supplier code/compressed part number match.
So far, my code looks like this:
Compress Full Part to a Compressed Part
and Check for Duplicates on Supplier Code
and Compressed Part combination
import sys import re import time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
start=time.time()
try: file1 = open("C:\Accounting\May Accounting\May.txt", "r") except IOError: print >> sys.stderr, "Cannot Open Read File" sys.exit(1) try: file2 = open(file1.name[0:len(file1.name)-4] + "_" + "COMPRESSPN.txt", "a") except IOError: print >> sys.stderr, "Cannot Open Write File" sys.exit(1)
hdrList="CIGSUPPLIER|FULL_PART|PART_STATUS|ALIAS_FLAG|ACQUISITION_FLAG|COMPRESSED_PART|DUPLICATE_INDICATOR" file2.write(hdrList+chr(10)) lines_seen=set() affirm="YES"
records = file1.readlines() for record in records: fields = record.split(chr(124)) if fields[0]=="CIGSupplier": continue #If incoming file has a header line, skip it file2.write(fields[0]+"|"), #Supplier Code file2.write(fields[1]+"|"), #Full_Part file2.write(fields[2]+"|"), #Part Status file2.write(fields[3]+"|"), #Alias Flag file2.write(re.sub("[$\r\n]", "", fields[4])+"|"), #Acquisition Flag file2.write(re.sub("[^0-9a-zA-Z.#]", "", fields[1])+"|"), #Compressed_Part dupechk=fields[0]+"|"+re.sub("[^0-9a-zA-Z.#]", "", fields[1]) if dupechk not in lines_seen: file2.write(chr(10)) lines_seen.add(dupechk) else: file2.write(affirm+chr(10))
print "it took", time.time() - start, "seconds."
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
file2.close() file1.close()
It runs in less than 6 minutes, so I am happy with this part, even if it is not elegant. Right now, when I get my results, I import the results into Access and do a self join to locate the duplicates. Loading/querying/exporting results in Access a file this size takes around an hour, so I would like to be able to export the matched duplicates to another text file or an Excel file.
Confusing enough?
Thanks.
© Stack Overflow or respective owner