What is the fastest way to find duplicates in multiple BIG txt files?
Posted
by
user2950750
on Stack Overflow
See other posts from Stack Overflow
or by user2950750
Published on 2013-11-03T21:21:43Z
Indexed on
2013/11/03
21:54 UTC
Read the original article
Hit count: 242
I am really in deep water here and I need a lifeline.
I have 10 txt files. Each file has up to 100.000.000 lines of data. Each line is simply a number representing something else. Numbers go up to 9 digits.
I need to (somehow) scan these 10 files and find the numbers that appear in all 10 files.
And here comes the tricky part. I have to do it in less than 2 seconds.
I am not a developer, so I need an explanation for dummies. I have done enough research to learn that hash tables and map reduce might be something that I can make use of. But can it really be used to make it this fast, or do I need more advanced solutions?
I have also been thinking about cutting up the files into smaller files. To that 1 file with 100.000.000 lines is transformed into 100 files with 1.000.000 lines.
But I do not know what is best: 10 files with 100 million lines or 1000 files with 1 million lines?
When I try to open the 100 million line file, it takes forever. So I think, maybe, it is just too big to be used. But I don't know if you can write code that will scan it without opening.
Speed is the most important factor in this, and I need to know if it can be done as fast as I need it, or if I have to store my data in another way, for example, in a database like mysql or something.
Thank you in advance to anybody that can give some good feedback.
© Stack Overflow or respective owner