How does the rsync algorithm correctly identify repeating blocks?
- by Kai
I'm on a personal quest to learn how the rsync algorithm works. After some reading and thinking, I've come up with a situation where I think the algorithm fails. I'm trying to figure out how this is resolved in an actual implementation.
Consider this example, where A is the receiver and B is the sender.
A = abcde1234512345fghij
B = abcde12345fghij
As you can see, the only change is that 12345 has been removed.
Now, to make this example interesting, let's choose a block size of 5 bytes (chars). Hashing the values on the sender's side using the weak checksum gives the following values list.
abcde|12345|fghij
abcde -> 495
12345 -> 255
fghij -> 520
values = [495, 255, 520]
Next we check to see if any hash values differ in A. If there's a matching block we can skip to the end of that block for the next check. If there's a non-matching block then we've found a difference. I'll step through this process.
Hash the first block. Does this hash exist in the values list? abcde -> 495 (yes, so skip)
Hash the second block. Does this hash exist in the values list? 12345 -> 255 (yes, so skip)
Hash the third block. Does this hash exist in the values list? 12345 -> 255 (yes, so skip)
Hash the fourth block. Does this hash exist in the values list? fghij -> 520 (yes, so skip)
No more data, we're done.
Since every hash was found in the values list, we conclude that A and B are the same. Which, in my humble opinion, isn't true.
It seems to me this will happen whenever there is more than one block that share the same hash. What am I missing?