Improve efficiency when using parallel to read from compressed stream

Posted by Yoga on Server Fault See other posts from Server Fault or by Yoga
Published on 2014-05-28T14:42:08Z Indexed on 2014/05/28 15:31 UTC
Read the original article Hit count: 268

Filed under:

parallel

Is another question extended from the previous one [1]

I have a compressed file and stream them to feed into a python program, e.g.

bzcat data.bz2 | parallel --no-notice -j16 --pipe python parse.py > result.txt

The parse.py can read from stdin continusuoly and print to stdout

My ec2 instance is 16 cores but from the top command it is showing 3 to 4 load average only.

From the ps, I am seeing a lot of stuffs like..

sh -c 'dd bs=1 count=1 of=/tmp/7D_YxccfY7.chr 2>/dev/null';

I know I can improve using the -a in.txtto improve performance, but with my case I am streaming from bz2 (I cannot exact it since I don't have enought disk space)

How to improve the efficiency for my case?

[1] Gnu parallel not utilizing all the CPU

Related posts about parallel

Going Parallel with the Task Parallel Library and PLINQ

as seen on Internet.com - Search for 'Internet.com'
With more and more computers using a multi-core processor, the free lunch of increased clock speeds and the inherent performance gains are over. Software developers must instead make sure their applications take use of all the cores available in an efficient manner. New features in .NET 4.0 mean that… >>> More
Parallel Desktops: installing Parallel Tools on Ubuntu

as seen on Super User - Search for 'Super User'
hi, I get the following error when I try to install Parallel Tools on my Ubuntu in Parallel Desktop. I follow the istructions, running sh install from terminal: I follow the UI istructions and then the installation stops with this error message: E: Couldn't find package dkms Fri May 7 14:34:20… >>> More
Improving Partitioned Table Join Performance

as seen on SQL Blog - Search for 'SQL Blog'
The query optimizer does not always choose an optimal strategy when joining partitioned tables. This post looks at an example, showing how a manual rewrite of the query can almost double performance, while reducing the memory grant to almost nothing. Test Data The two tables in this example use… >>> More
Async.Parallel or Array.Parallel.Map ?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello- I'm trying to implement a pattern I read from Don Syme's blog (http://blogs.msdn.com/dsyme/archive/2010/01/09/async-and-parallel-design-patterns-in-f-parallelizing-cpu-and-i-o-computations.aspx) which suggests that there are opportunities for massive performance improvements from leveraging… >>> More
Parallel Debugging

as seen on Daniel Moth - Search for 'Daniel Moth'
Using Visual Studio 2010 parallel debugging is easy. Two new debugging windows provide a total view of the internals of your PPL and TPL applications with hints on where to start investigations. These are not mere extensions to VS, but tightly integrated with the rest of the debugger experience, so… >>> More

Developer IT