How to pick random (small) data samples using Map/Reduce?
- by Andrei Savu
I want to write a map/reduce job to select a number of random samples from a large dataset based on a row level condition. I want to minimize the number of intermediate keys.
Pseudocode:
for each row
if row matches condition
put the row.id in the bucket if the bucket is not already large enough
Have you done something like this? Is there any well known algorithm?
A sample containing sequential rows is also good enough.
Thanks.