Improve efficiency when using parallel to read from compressed stream
Posted
by
Yoga
on Server Fault
See other posts from Server Fault
or by Yoga
Published on 2014-05-28T14:42:08Z
Indexed on
2014/05/28
15:31 UTC
Read the original article
Hit count: 196
parallel
Is another question extended from the previous one [1]
I have a compressed file and stream them to feed into a python program, e.g.
bzcat data.bz2 | parallel --no-notice -j16 --pipe python parse.py > result.txt
The parse.py can read from stdin continusuoly and print to stdout
My ec2 instance is 16 cores but from the top command it is showing 3 to 4 load average only.
From the ps
, I am seeing a lot of stuffs like..
sh -c 'dd bs=1 count=1 of=/tmp/7D_YxccfY7.chr 2>/dev/null';
I know I can improve using the -a in.txt
to improve performance, but with my case I am streaming from bz2 (I cannot exact it since I don't have enought disk space)
How to improve the efficiency for my case?
© Server Fault or respective owner