Chaining multiple MapReduce jobs in Hadoop.
Posted
by Niels Basjes
on Stack Overflow
See other posts from Stack Overflow
or by Niels Basjes
Published on 2010-03-23T11:55:14Z
Indexed on
2010/03/23
13:13 UTC
Read the original article
Hit count: 782
In many real-life situations where you apply MapReduce, the final algorithms end up being several MapReduce steps.
I.e. Map1 , Reduce1 , Map2 , Reduce2 , etc.
So you have the output from the last reduce that is needed as the input for the next map.
The intermediate data is something you (in general) do not want to keep once the pipeline has been successfully completed. Also because this intermediate data is in general some data structure (like a 'map' or a 'set') you don't want to put too much effort in writing and reading these key-value pairs.
What is the recommended way of doing that in Hadoop?
Is there a (simple) example that shows how to handle this intermediate data in the correct way, including the cleanup afterward?
Thanks.
© Stack Overflow or respective owner