Running Awk command on a cluster

Posted by alex on Stack Overflow See other posts from Stack Overflow or by alex
Published on 2010-04-16T16:28:12Z Indexed on 2010/04/16 16:33 UTC
Read the original article Hit count: 403

How do you execute a Unix shell command (awk script, a pipe etc) on a cluster in parallel (step 1) and collect the results back to a central node (step 2)

  • Hadoop seems to be a huge overkill with its 600k LOC and its performance is terrible (takes minutes just to initialize the job)
  • i don't need shared memory, or - something like MPI/openMP as i dont need to synchronize or share anything, don't need a distributed VM or anything as complex
  • Google's SawZall seems to work only with Google proprietary MapReduce API
  • some distributed shell packages i found failed to compile, but there must be a simple way to run a data-centric batch job on a cluster, something as close as possible to native OS, may be using unix RPC calls
  • i liked rsync simplicity but it seem to update remote notes sequentially, and you cant use it for executing scripts as afar as i know
  • switching to Plan 9 or some other network oriented OS looks like another overkill

i'm looking for a simple, distributed way to run awk scripts or similar - as close as possible to data with a minimal initialization overhead, in a nothing-shared, nothing-synchronized fashion

Thanks Alex

© Stack Overflow or respective owner

Related posts about parallel-processing

Related posts about distributed