Disco/MapReduce: Using results of previous iteration as input to new iteration

Posted by muckabout on Stack Overflow See other posts from Stack Overflow or by muckabout
Published on 2010-04-02T11:36:16Z Indexed on 2010/04/02 11:53 UTC
Read the original article Hit count: 599

Filed under:
|
|

Currently am implementing PageRank on Disco. As an iterative algorithm, the results of one iteration are used as input to the next iteration.

I have a large file which represents all the links, with each row representing a page and the values in the row representing the pages to which it links.

For Disco, I break this file into N chunks, then run MapReduce for one round. As a result, I get a set of (page, rank) tuples.

I'd like to feed this rank to the next iteration. However, now my mapper needs two inputs: the graph file, and the pageranks.

  1. I would like to "zip" together the graph file and the page ranks, such that each line represents a page, it's rank, and it's out links.
  2. Since this graph file is separated into N chunks, I need to split the pagerank vector into N parallel chunks, and zip the regions of the pagerank vectors to the graph chunks

This all seems more complicated than necessary, and as a pretty straightforward operation (with the quintessential mapreduce algorithm), it seems I'm missing something about Disco that could really simplify the approach.

Any thoughts?

© Stack Overflow or respective owner

Related posts about python

Related posts about disco