Bash Parallelization of CPU-intensive processes

Posted by ehsanul on Server Fault See other posts from Server Fault or by ehsanul
Published on 2011-01-14T22:30:24Z Indexed on 2011/01/14 22:55 UTC
Read the original article Hit count: 377

Filed under:

stdin

tee forwards its stdin to every single file specified, while pee does the same, but for pipes. These programs send every single line of their stdin to each and every file/pipe specified.

However, I was looking for a way to "load balance" the stdin to different pipes, so one line is sent to the first pipe, another line to the second, etc. It would also be nice if the stdout of the pipes are collected into one stream as well.

The use case is simple parallelization of CPU intensive processes that work on a line-by-line basis. I was doing a sed on a 14GB file, and it could have run much faster if I could use multiple sed processes. The command was like this:

pv infile | sed 's/something//' > outfile

To parallelize, the best would be if GNU parallel would support this functionality like so (made up the --demux-stdin option):

pv infile | parallel -u -j4 --demux-stdin "sed 's/something//'" > outfile

However, there's no option like this and parallel always uses its stdin as arguments for the command it invokes, like xargs. So I tried this, but it's hopelessly slow, and it's clear why:

pv infile | parallel -u -j4 "echo {} | sed 's/something//'" > outfile

I just wanted to know if there's any other way to do this (short of coding it up myself). If there was a "load-balancing" tee (let's call it lee), I could do this:

pv infile | lee >(sed 's/something//' >> outfile) >(sed 's/something//' >> outfile) >(sed 's/something//' >> outfile) >(sed 's/something//' >> outfile)

Not pretty, so I'd definitely prefer something like the made up parallel version, but this would work too.

Developer IT

Bash Parallelization of CPU-intensive processes - Developer IT

Bash Parallelization of CPU-intensive processes

pipe

parallel

stdin

Related posts about pipe

Using SAS Macro to pipe a list of filenames from a Windows directory

How to rate-limit a pipe under linux ?

Behavior of a pipe after a fork()

How do I pipe terminal standard output (stdout) to the clipboard?

I can't upgrade 13.10 because of broken pipe

Related posts about parallel

Going Parallel with the Task Parallel Library and PLINQ

Parallel Desktops: installing Parallel Tools on Ubuntu

Improving Partitioned Table Join Performance

Async.Parallel or Array.Parallel.Map ?

Parallel Debugging

Categories cloud