How to search for duplicate values in a huge text file having around Half Million records

Posted by Shibu on Stack Overflow See other posts from Stack Overflow or by Shibu
Published on 2010-04-08T05:10:36Z Indexed on 2010/04/08 5:13 UTC
Read the original article Hit count: 246

Filed under:

java

|

text

|

file

|

processing

I have an input txt file which has data in the form of records (each row is a record and represents more or less like a DB table) and I need to find for duplicate values. For example:

Rec1: ACCOUNT_NBR_1*NAME_1*VALUE_1 Rec2: ACCOUNT_NBR_2*NAME_2*VALUE_2 Rec3: ACCOUNT_NBR_1*NAME_3*VALUE_3

In the above set, the Rec1 and Rec2 are considered to be duplicates as the ACCOUNT NUMBERS are same(ACCOUNT_NBR1).

Note: The input file shown above is a delimiter type file (the delimiter being *) however the file type can also be a fixed length file where each column starts and ends with a specified positions.

I am currently doing this with the following logic:

Loop thru each ACCOUNT NUMBER
  Loop thru each line of the txt file and record and check if this is repeated.
  If repeated record the same in a hashtable.
  End 
End

And I am using 'Pattern' & 'BufferedReader' java API's to perform the above task.

But since it is taking a long time, I would like to know a better way of handling it.

Thanks, Shibu

© Stack Overflow or respective owner

Related posts about java

Tomcat 6: Access Control Exception?

as seen on Server Fault - Search for 'Server Fault'
I'm trying to setup a tomcat6 server, and I'm trying to match another setup someone else established. However, my deployment (default Ubuntu install) uses a policy.d/ directory structure, and the established server just uses a catalina.policy file. I've tried setting every entry in policy.d to match… >>> More
Problem in creation MDB Queue connection at Jboss StartUp

as seen on Stack Overflow - Search for 'Stack Overflow'
I am not able to create a Queue connection in JBOSS4.2.3GA Version & Java1.5, as I am using MDB as per the below details. I am putting this MDB in a jar file(named utsJar.jar) and copied it in deploy folder of JBOSS, In the test env. this MDB works well but in another env. [ env settings and… >>> More
failing to establish connection between Postgres db and gwt

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I am using Postgres and gwt 2.0 for one of my applications. I am facing problem connecting to the database. When I try to connect it gives "ClassNotFoundException". Here is what I get when I try to connect to database: java.lang.ClassNotFoundException: org.postgresql.Driver at java.net… >>> More
failing to establish connection between postgre db and gwt

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, For i am using postgre and gwt 2.0 for one of my applications. I am facing problem connecting to the database. When i try to connect it gives "ClassNotFoundException". Here is what i get when i try to connect to database: java.lang.ClassNotFoundException: org.postgresql.Driver at java.net… >>> More
Migration and deployement problems JBoss 4.2.2.GA to JBoss 6.0.0.M2

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I'm trying to migrate an application running on JBoss 4.2.2.GA to JBoss 6.0.0.M2 I give you some log to explain my problem : boot.log : 2010-03-16 09:59:29,406 ERROR [org.jboss.system.server.profileservice.ProfileServiceBootstrap] (Thread-2) Failed to load profile: Summary of incomplete deployments… >>> More

Related posts about text

Error in running script [closed]

as seen on Programmers - Search for 'Programmers'
I'm trying to run heathusf_v1.1.0.tar.gz found here I installed tcsh to make build_heathusf work. But, when I run ./build_heathusf, I get the following (I'm running that on a Fedora Linux system from Terminal): $ ./build_heathusf Compiling programs to build a library of image processing functions… >>> More
Coloring even heighten columns

as seen on Stack Overflow - Search for 'Stack Overflow'
I try to set different a background colors for left and right columns and to maintain the same height. So I set a background color for outer wrapper ("container" div) so it will set a color to rightBar. But this didn't work. Online Demo I want it to work on all browsers. Markup: <!DOCTYPE… >>> More
HTML: How to create a DIV with only vertical scroll-bar to show long paragraphs on a webpage?

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to show terms and condition note on my website. I dont want to use text field and also dont want to use my whole page. I just want to display my text in selected area and want to use only vertical scroll-bar to go down and read all text. Currently I am using this code: <div style="width:10;height:10;overflow:scroll"… >>> More
Qt Linking Error.

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I configure qt-x11 with following options ./configure -prefix /iTalk/qtx11 -prefix-install -bindir /iTalk/qtx11-install/bin -libdir /iTalk/qtx11-install/lib -docdir /iTalk/qtx11-install/doc -headerdir /iTalk/qtx11-install/include -datadir /iTalk/qtx11-install/data -examplesdir /iTalk/qtx11-install/examples… >>> More
XSLT Escape Character not working

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to use escape charaters in my text output, as i would like too surround the output in emailData tags. I am using <xsl:text><emailData></xsl:text> In the XSLT to esnure that this works however because i am using a tool called Cast Iron for some reason it… >>> More