Using Hadoop, are my reducers guaranteed to get all the records with the same key?
- by samg
I'm running a hadoop job (using hive actually) which is supposed to uniq lines in a lot of text file. More specifically it chooses the most recently timestamped record for each key in the reduce step.
Does hadoop guarantee that every record with the same key, output by the map step, will go to a single reducer, even if there are many reducers running across a cluster?
I'm worried that the mapper output might be split after the shuffle happens, in the middle of a set of records with the same key.