How is intermediate data organized in MapReduce?
- by Pedro Cattori
From what I understand, each mapper outputs an intermediate file. The intermediate data (data contained in each intermediate file) is then sorted by key.
Then, a reducer is assigned a key by the master. The reducer reads from the intermediate file containing the key and then calls reduce using the data it has read.
But in detail, how is the intermediate data organized? Can a data corresponding to a key be held in multiple intermediate files? What happens when there is too much data corresponding to one key to be held by a single file?
In short, how do intermediate partitions differ from intermediate files and how are these differences dealt with in the implementation?