How to delete duplicate/aggregate rows faster in a file using Java (no DB)

Posted by S. Singh on Stack Overflow See other posts from Stack Overflow or by S. Singh
Published on 2012-04-10T15:12:32Z Indexed on 2012/04/10 17:29 UTC
Read the original article Hit count: 291

Filed under:

java

|

Performance

|

collections

|

file-io

I have a 2GB big text file, it has 5 columns delimited by tab. A row will be called duplicate only if 4 out of 5 columns matches.

Right now, I am doing dduping by first loading each coloumn in separate List , then iterating through lists, deleting the duplicate rows as it encountered and aggregating.

The problem: it is taking more than 20 hours to process one file. I have 25 such files to process.

Can anyone please share their experience, how they would go about doing such dduping?

This dduping will be a throw away code. So, I was looking for some quick/dirty solution, to get job done as soon as possible.

Here is my pseudo code (roughly)

Iterate over the rows
  i=current_row_no.    
    Iterate over the row no. i+1 to last_row
                    if(col1 matches  //find duplicate
                        && col2 matches
                        && col3 matches  
                        && col4 matches)
                        { 
                           col5List.set(i,get col5); //aggregate 
                        }

Duplicate example

A and B will be duplicate A=(1,1,1,1,1), B=(1,1,1,1,2), C=(2,1,1,1,1) and output would be A=(1,1,1,1,1+2) C=(2,1,1,1,1) [notice that B has been kicked out]

Developer IT