How to search for duplicate values in a huge text file having around Half Million records
Posted
by Shibu
on Stack Overflow
See other posts from Stack Overflow
or by Shibu
Published on 2010-04-08T05:10:36Z
Indexed on
2010/04/08
5:13 UTC
Read the original article
Hit count: 216
I have an input txt file which has data in the form of records (each row is a record and represents more or less like a DB table) and I need to find for duplicate values. For example:
Rec1: ACCOUNT_NBR_1*NAME_1*VALUE_1 Rec2: ACCOUNT_NBR_2*NAME_2*VALUE_2 Rec3: ACCOUNT_NBR_1*NAME_3*VALUE_3
In the above set, the Rec1 and Rec2 are considered to be duplicates as the ACCOUNT NUMBERS are same(ACCOUNT_NBR1).
Note: The input file shown above is a delimiter type file (the delimiter being *) however the file type can also be a fixed length file where each column starts and ends with a specified positions.
I am currently doing this with the following logic:
Loop thru each ACCOUNT NUMBER
Loop thru each line of the txt file and record and check if this is repeated.
If repeated record the same in a hashtable.
End
End
And I am using 'Pattern' & 'BufferedReader' java API's to perform the above task.
But since it is taking a long time, I would like to know a better way of handling it.
Thanks, Shibu
© Stack Overflow or respective owner