Chaining multiple MapReduce jobs in Hadoop.

Posted by Niels Basjes on Stack Overflow See other posts from Stack Overflow or by Niels Basjes
Published on 2010-03-23T11:55:14Z Indexed on 2010/03/23 13:13 UTC
Read the original article Hit count: 782

Filed under:
|

In many real-life situations where you apply MapReduce, the final algorithms end up being several MapReduce steps.

I.e. Map1 , Reduce1 , Map2 , Reduce2 , etc.

So you have the output from the last reduce that is needed as the input for the next map.

The intermediate data is something you (in general) do not want to keep once the pipeline has been successfully completed. Also because this intermediate data is in general some data structure (like a 'map' or a 'set') you don't want to put too much effort in writing and reading these key-value pairs.

What is the recommended way of doing that in Hadoop?

Is there a (simple) example that shows how to handle this intermediate data in the correct way, including the cleanup afterward?

Thanks.

© Stack Overflow or respective owner

Related posts about mapreduce

Related posts about hadoop