How do you execute a Unix shell command (awk script, a pipe etc) on a cluster in parallel (step 1) and collect the results back to a central node (step 2)
Hadoop seems to be a huge overkill with its 600k LOC and its performance is terrible
(takes minutes just to initialize the job)
i don't need shared memory, or - something like MPI/openMP as i dont need to synchronize or share anything, don't need a distributed VM or anything as complex
Google's SawZall seems to work only with Google proprietary MapReduce API
some distributed shell packages i found failed to compile, but there must be a simple way to run a data-centric batch job on a cluster, something as close as possible to native OS, may be using unix RPC calls
i liked rsync simplicity but it seem to update remote notes sequentially, and you cant use it for executing scripts as afar as i know
switching to Plan 9 or some other network oriented OS looks like another overkill
i'm looking for a simple, distributed way to run awk scripts or similar - as close as possible to data with a minimal initialization overhead, in a nothing-shared, nothing-synchronized fashion
Thanks
Alex