Improve efficiency when using parallel to read from compressed stream

Posted by Yoga on Server Fault See other posts from Server Fault or by Yoga
Published on 2014-05-28T14:42:08Z Indexed on 2014/05/28 15:31 UTC
Read the original article Hit count: 206

Filed under:

Is another question extended from the previous one [1]

I have a compressed file and stream them to feed into a python program, e.g.

bzcat data.bz2 | parallel --no-notice -j16 --pipe python parse.py > result.txt

The parse.py can read from stdin continusuoly and print to stdout

My ec2 instance is 16 cores but from the top command it is showing 3 to 4 load average only.

From the ps, I am seeing a lot of stuffs like..

sh -c 'dd bs=1 count=1 of=/tmp/7D_YxccfY7.chr 2>/dev/null';       

I know I can improve using the -a in.txtto improve performance, but with my case I am streaming from bz2 (I cannot exact it since I don't have enought disk space)

How to improve the efficiency for my case?

[1] Gnu parallel not utilizing all the CPU

© Server Fault or respective owner

Related posts about parallel