Datastructure choices for highspeed and memory efficient detection of duplicate of strings
- by Jonathan Holland
I have a interesting problem that could be solved in a number of ways:
I have a function that takes in a string.
If this function has never seen this string before, it needs to perform some processing.
If the function has seen the string before, it needs to skip processing.
After a specified amount of time, the function should accept duplicate strings.
This function may be called thousands of time per second, and the string data may be very large.
This is a highly abstracted explanation of the real application, just trying to get down to the core concept for the purpose of the question.
The function will need to store state in order to detect duplicates. It also will need to store an associated timestamp in order to expire duplicates.
It does NOT need to store the strings, a unique hash of the string would be fine, providing there is no false positives due to collisions (Use a perfect hash?), and the hash function was performant enough.
The naive implementation would be simply (in C#):
Dictionary<String,DateTime>
though in the interest of lowering memory footprint and potentially increasing performance I'm evaluating a custom data structures to handle this instead of a basic hashtable.
So, given these constraints, what would you use?
EDIT, some additional information that might change proposed implementations:
99% of the strings will not be duplicates.
Almost all of the duplicates will arrive back to back, or nearly sequentially.
In the real world, the function will be called from multiple worker threads, so state management will need to be synchronized.