How to delete duplicate/aggregate rows faster in a file using Java (no DB)
Posted
by
S. Singh
on Stack Overflow
See other posts from Stack Overflow
or by S. Singh
Published on 2012-04-10T15:12:32Z
Indexed on
2012/04/10
17:29 UTC
Read the original article
Hit count: 216
I have a 2GB big text file, it has 5 columns delimited by tab. A row will be called duplicate only if 4 out of 5 columns matches.
Right now, I am doing dduping by first loading each coloumn in separate List , then iterating through lists, deleting the duplicate rows as it encountered and aggregating.
The problem: it is taking more than 20 hours to process one file. I have 25 such files to process.
Can anyone please share their experience, how they would go about doing such dduping?
This dduping will be a throw away code. So, I was looking for some quick/dirty solution, to get job done as soon as possible.
Here is my pseudo code (roughly)
Iterate over the rows
i=current_row_no.
Iterate over the row no. i+1 to last_row
if(col1 matches //find duplicate
&& col2 matches
&& col3 matches
&& col4 matches)
{
col5List.set(i,get col5); //aggregate
}
Duplicate example
A and B will be duplicate A=(1,1,1,1,1), B=(1,1,1,1,2), C=(2,1,1,1,1) and output would be A=(1,1,1,1,1+2) C=(2,1,1,1,1) [notice that B has been kicked out]
© Stack Overflow or respective owner