reservoir sampling problem: correctness of proof
- by eSKay
This MSDN article proves the correctness of Reservoir Sampling algorithm as follows:
Base case is trivial. For the k+1st
case, the probability a given
element i with position <= k is in R
is s/k.
The probability i is replaced is the probability k+1st element is chosen multiplied by i being chosen to be replaced, which is: s/(k+1) * 1/s = 1/(k+1), and prob that i is not replaced is k/k+1.
So any given element's probability of lasting after k+1 rounds is: (chosen in k steps, and not removed in k steps) = s/k * k/(k+1), which is s/(k+1).
So, when k+1 = n, any element is present with probability s/n.
about step 3:
What are the k+1 rounds mentioned?
What is chosen in k steps, and not removed in k steps?
Why are we only calculating this probability for elements that were already in R after the first s steps?