Problem I'm trying to solve, apologies in advance for the length:
Given a large number of stored records, each with a unique (String) field S. I'd like to be able to find through an indexed query all records where Hash(S) % N == K for any arbitrary N, K (e.g. given a million strings, find all strings where HashCode(s) % 17 = 5. Is there some way of memoizing this so that we can quickly answer any question of this form without doing the % on every value?
The motivation for this is a system of N distributed nodes, where each record has to be assigned to at least one node. The nodes are numbered 0 - (K-1) , and each node has to load up all of the records that match it's number:
If we have 3 nodes
Node 0 loads all records where Hash % 3 ==0
Node 1 loads all records where Hash % 3 ==1
Node 2 loads all records where Hash % 3 ==2
adding a 4th node, obviously all the assignments have to be recomputed -
Node 0 loads all records where Hash % 4 ==0
...
etc
I'd like to easily find these records through an indexed query without having to compute the mod individually.
The best I've been able to come up with so far:
If we take the prime factors of N (p1 * p2 * ... )
if N % M == I then p % M == I % p for all of N's prime factors
e.g. 10 nodes :
N % 10 == 6 then
N % 2 = 0 == 6 %2
N % 5 = 1 == 6 %5
so storing an array of the "%" of N for the first "reasonable" number of primes for my data set should be helpful. For example in the above example we store the hash and the primes
HASH PRIMES (array of %2, %3, %5, %7, ... ])
16 [0 1 1 2 .. ]
so looking for N%10 == 6 is equivalent to looking for all values where array[1]==1 and array[2] == 1.
However, this breaks at the first prime larger than the highest number I'm storing in the factor table. Is there a better way?