How to pick random (small) data samples using Map/Reduce?
Posted
by Andrei Savu
on Stack Overflow
See other posts from Stack Overflow
or by Andrei Savu
Published on 2010-03-25T08:48:16Z
Indexed on
2010/03/25
8:53 UTC
Read the original article
Hit count: 209
I want to write a map/reduce job to select a number of random samples from a large dataset based on a row level condition. I want to minimize the number of intermediate keys.
Pseudocode:
for each row
if row matches condition
put the row.id in the bucket if the bucket is not already large enough
Have you done something like this? Is there any well known algorithm?
A sample containing sequential rows is also good enough.
Thanks.
© Stack Overflow or respective owner