What is the fastest way to find duplicates in multiple BIG txt files?
- by user2950750
I am really in deep water here and I need a lifeline.
I have 10 txt files. Each file has up to 100.000.000 lines of data. Each line is simply a number representing something else. Numbers go up to 9 digits.
I need to (somehow) scan these 10 files and find the numbers that appear in all 10 files.
And here comes the tricky part. I have to do it in less than 2 seconds.
I am not a developer, so I need an explanation for dummies. I have done enough research to learn that hash tables and map reduce might be something that I can make use of. But can it really be used to make it this fast, or do I need more advanced solutions?
I have also been thinking about cutting up the files into smaller files. To that 1 file with 100.000.000 lines is transformed into 100 files with 1.000.000 lines.
But I do not know what is best: 10 files with 100 million lines or 1000 files with 1 million lines?
When I try to open the 100 million line file, it takes forever. So I think, maybe, it is just too big to be used. But I don't know if you can write code that will scan it without opening.
Speed is the most important factor in this, and I need to know if it can be done as fast as I need it, or if I have to store my data in another way, for example, in a database like mysql or something.
Thank you in advance to anybody that can give some good feedback.