Running Awk command on a cluster

Posted by alex on Stack Overflow See other posts from Stack Overflow or by alex
Published on 2010-04-16T16:28:12Z Indexed on 2010/04/16 16:33 UTC
Read the original article Hit count: 469

Filed under:

parallel-processing

|

distributed

|

rpc

|

mapreduce

|

filesystems

How do you execute a Unix shell command (awk script, a pipe etc) on a cluster in parallel (step 1) and collect the results back to a central node (step 2)

Hadoop seems to be a huge overkill with its 600k LOC and its performance is terrible (takes minutes just to initialize the job)
i don't need shared memory, or - something like MPI/openMP as i dont need to synchronize or share anything, don't need a distributed VM or anything as complex
Google's SawZall seems to work only with Google proprietary MapReduce API
some distributed shell packages i found failed to compile, but there must be a simple way to run a data-centric batch job on a cluster, something as close as possible to native OS, may be using unix RPC calls
i liked rsync simplicity but it seem to update remote notes sequentially, and you cant use it for executing scripts as afar as i know
switching to Plan 9 or some other network oriented OS looks like another overkill

i'm looking for a simple, distributed way to run awk scripts or similar - as close as possible to data with a minimal initialization overhead, in a nothing-shared, nothing-synchronized fashion

Thanks Alex

© Stack Overflow or respective owner

Related posts about parallel-processing

Intel Core 2 duo / AMD athlon X2 parallel processing capability

as seen on Super User - Search for 'Super User'
Does Intel Core 2 duo/ AMD athlon X2 really have 2 separate processors? i.e are they capable of doing real parallel processing? What I don't understand is the difference when somebody says Cores or Processors. >>> More
JVM (embarrasingly) parallel processing libraries/tools

as seen on Stack Overflow - Search for 'Stack Overflow'
I am looking for something that will make it easy to run (correctly coded) embarrassingly parallel JVM code on a cluster (so that I can use Clojure + Incanter). I have used Parallel Python in the past to do this. We have a new PBS cluster and our admin will soon set up IPython nodes that use PBS… >>> More
Parallel processing in R 2.11 Windows 64-bit using SNOW not quite working

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm running R 2.11 64-bit on a WinXP64 machine with 8 processors. With R 2.10.1 the following code spawned 6 R processes for parallel processing: require(foreach) require(doSNOW) cl = makeCluster(6, type='SOCK') registerDoSNOW(cl) bl2 = foreach(i=icount(length(unqmrno))) %dopar% { (Some code… >>> More
Databases supporting parallel processing across multiple servers

as seen on Server Fault - Search for 'Server Fault'
I need a database engine that can utilize multiple servers for processing a single SQL query in parallel. So far I know that this is possible with the some engines, though none of them are feasible for me either because of pricing or missing features. The engines currently known to me are: MS SQL… >>> More
Multithreading/Parallel Processing in PHP

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a PHP script that will generate a report using PHPExcel from data queried from a MySQL DB. Currently, it is linear in processing in that it gets the data back from MySQL, reads in the Excel template, writes the data to the template, then outputs it. I have optimized the code to the point that… >>> More

Related posts about distributed

Recommendations for distributed processing/distributed storage systems

as seen on Server Fault - Search for 'Server Fault'
At my organization we have a processing and storage system spread across two dozen linux machines that handles over a petabyte of data. The system right now is very ad-hoc; processing automation and data management is handled by a collection of large perl programs on independent machines. I am looking… >>> More
Java - System design with distributed Queues and Locks

as seen on Programmers - Search for 'Programmers'
Looking for inputs to evaluate a design for a system (java) which would have a distributed queue serving several (but not too many) nodes. These nodes would process objects present in the distributed queue and on occasion require a distributed lock across the cluster on an arbitrary (distributed)… >>> More
How are distributed services better than distributed objects?

as seen on Stack Overflow - Search for 'Stack Overflow'
I am not interested in the technology e.g. CORBA vs Web Services, I am interested in principles. When we are doing OOP, why should we have something so procedural at higher level? Is not it the same as with OOP and relational databases? Often services are supported through code generation, apart from… >>> More
Building a StackOverflow inspired Knowledge Exchange Three Tiers to MVC – Distributed systems: addin

as seen on Dot net Slackers - Search for 'Dot net Slackers'
In this article we are going to look at how distributable our current code base is. We will find that even with all the refactoring and modifications that we have done we are still pretty married to a fairly hardwired infrastructure. If one piece of our code requires more resources than any other… >>> More
Mnesia: A Distributed DBMS Rooted in Concurrency

as seen on Internet.com - Search for 'Internet.com'
Find out what makes Mnesia, the Erlang-based database management system, perfect for distribution across a network of computers. >>> More