Currently am implementing PageRank on Disco. As an iterative algorithm, the results of one iteration are used as input to the next iteration.
I have a large file which represents all the links, with each row representing a page and the values in the row representing the pages to which it links.
For Disco, I break this file into N chunks, then run MapReduce for one round. As a result, I get a set of (page, rank) tuples.
I'd like to feed this rank to the next iteration. However, now my mapper needs two inputs: the graph file, and the pageranks.
I would like to "zip" together
the graph file and the page ranks,
such that each line represents a
page, it's rank, and it's out links.
Since this graph file is separated into N chunks, I need to split the pagerank vector into N
parallel chunks, and zip the regions
of the pagerank vectors to the graph
chunks
This all seems more complicated than necessary, and as a pretty straightforward operation (with the quintessential mapreduce algorithm), it seems I'm missing something about Disco that could really simplify the approach.
Any thoughts?