Running Awk command on a cluster
Posted
by alex
on Stack Overflow
See other posts from Stack Overflow
or by alex
Published on 2010-04-16T16:28:12Z
Indexed on
2010/04/16
16:33 UTC
Read the original article
Hit count: 398
How do you execute a Unix shell command (awk script, a pipe etc) on a cluster in parallel (step 1) and collect the results back to a central node (step 2)
- Hadoop seems to be a huge overkill with its 600k LOC and its performance is terrible (takes minutes just to initialize the job)
- i don't need shared memory, or - something like MPI/openMP as i dont need to synchronize or share anything, don't need a distributed VM or anything as complex
- Google's SawZall seems to work only with Google proprietary MapReduce API
- some distributed shell packages i found failed to compile, but there must be a simple way to run a data-centric batch job on a cluster, something as close as possible to native OS, may be using unix RPC calls
- i liked rsync simplicity but it seem to update remote notes sequentially, and you cant use it for executing scripts as afar as i know
- switching to Plan 9 or some other network oriented OS looks like another overkill
i'm looking for a simple, distributed way to run awk scripts or similar - as close as possible to data with a minimal initialization overhead, in a nothing-shared, nothing-synchronized fashion
Thanks Alex
© Stack Overflow or respective owner