find the difference between two very large list
- by user157195
I have two large list(could be a hundred million items), the source of each list can be either from a database table or a flat file. both lists are of comparable sizes, both unsorted. I need to find the difference between them. so I have 3 scenarios:
1. List1 is a database table(assume each row simply have one item(key) that is a string), List2 is a large file.
2. Both lists are from 2 db tables.
3. both lists are from two files.
in case 2, I plan to use:
select a.item from MyTable a where a.item not in (select b.item form MyTable b)
this clearly is inefficient, is there a better way?
Another approach is:
I plan to sort each list, and then walk down both of them to find the diff. If the list is from a file, I have to read it into a db table first, then use db sorting to output the list. Is the run time complexity still O(nlogn) in db sorting?
either approach is a pain and seems would be very slow when the list involved has hundreds of millions of items. any suggestions?