Fastest way to perform subset test operation on a large collection of sets with same domain
- by niktech
Assume we have trillions of sets stored somewhere. The domain for each of these sets is the same. It is also finite and discrete. So each set may be stored as a bit field (eg: 0000100111...) of a relatively short length (eg: 1024). That is, bit X in the bitfield indicates whether item X (of 1024 possible items) is included in the given set or not.
Now, I want to devise a storage structure and an algorithm to efficiently answer the query: what sets in the data store have set Y as a subset. Set Y itself is not present in the data store and is specified at run time.
Now the simplest way to solve this would be to AND the bitfield for set Y with bit fields of every set in the data store one by one, picking the ones whose AND result matches Y's bitfield.
How can I speed this up? Is there a tree structure (index) or some smart algorithm that would allow me to perform this query without having to AND every stored set's bitfield?
Are there databases that already support such operations on large collections of sets?