find the difference between two very large list

Posted by user157195 on Stack Overflow See other posts from Stack Overflow or by user157195
Published on 2011-02-06T06:34:57Z Indexed on 2011/02/06 7:26 UTC
Read the original article Hit count: 105

Filed under:
|

I have two large list(could be a hundred million items), the source of each list can be either from a database table or a flat file. both lists are of comparable sizes, both unsorted. I need to find the difference between them. so I have 3 scenarios:
1. List1 is a database table(assume each row simply have one item(key) that is a string), List2 is a large file.
2. Both lists are from 2 db tables.
3. both lists are from two files.


in case 2, I plan to use:

select a.item from MyTable a where a.item not in (select b.item form MyTable b)

this clearly is inefficient, is there a better way?

Another approach is:
I plan to sort each list, and then walk down both of them to find the diff. If the list is from a file, I have to read it into a db table first, then use db sorting to output the list. Is the run time complexity still O(nlogn) in db sorting?

either approach is a pain and seems would be very slow when the list involved has hundreds of millions of items. any suggestions?

© Stack Overflow or respective owner

Related posts about database

Related posts about sorting