How to search for duplicate values in a huge text file having around Half Million records

Posted by Shibu on Stack Overflow See other posts from Stack Overflow or by Shibu
Published on 2010-04-08T05:10:36Z Indexed on 2010/04/08 5:13 UTC
Read the original article Hit count: 221

Filed under:
|
|
|

I have an input txt file which has data in the form of records (each row is a record and represents more or less like a DB table) and I need to find for duplicate values. For example:

Rec1: ACCOUNT_NBR_1*NAME_1*VALUE_1 Rec2: ACCOUNT_NBR_2*NAME_2*VALUE_2 Rec3: ACCOUNT_NBR_1*NAME_3*VALUE_3

In the above set, the Rec1 and Rec2 are considered to be duplicates as the ACCOUNT NUMBERS are same(ACCOUNT_NBR1).

Note: The input file shown above is a delimiter type file (the delimiter being *) however the file type can also be a fixed length file where each column starts and ends with a specified positions.

I am currently doing this with the following logic:

Loop thru each ACCOUNT NUMBER
  Loop thru each line of the txt file and record and check if this is repeated.
  If repeated record the same in a hashtable.
  End 
End

And I am using 'Pattern' & 'BufferedReader' java API's to perform the above task.

But since it is taking a long time, I would like to know a better way of handling it.

Thanks, Shibu

© Stack Overflow or respective owner

Related posts about java

Related posts about text