How is intermediate data organized in MapReduce?

Posted by Pedro Cattori on Programmers See other posts from Programmers or by Pedro Cattori
Published on 2014-02-06T04:12:16Z Indexed on 2014/06/07 3:45 UTC
Read the original article Hit count: 298

From what I understand, each mapper outputs an intermediate file. The intermediate data (data contained in each intermediate file) is then sorted by key.

Then, a reducer is assigned a key by the master. The reducer reads from the intermediate file containing the key and then calls reduce using the data it has read.

But in detail, how is the intermediate data organized? Can a data corresponding to a key be held in multiple intermediate files? What happens when there is too much data corresponding to one key to be held by a single file?

In short, how do intermediate partitions differ from intermediate files and how are these differences dealt with in the implementation?

© Programmers or respective owner

Related posts about design

Related posts about functional-programming