Parallel processing slower than sequential?
- by zebediah49
EDIT: For anyone who stumbles upon this in the future:
Imagemagick uses a MP library. It's faster to use available cores if they're around, but if you have parallel jobs, it's unhelpful.
Do one of the following:
do your jobs serially (with Imagemagick in parallel mode)
set MAGICK_THREAD_LIMIT=1 for your invocation of the imagemagick binary in question.
By making Imagemagick use only one thread, it slows down by 20-30% in my test cases, but meant I could run one job per core without issues, for a significant net increase in performance.
Original question:
While converting some images using ImageMagick, I noticed a somewhat strange effect. Using xargs was significantly slower than a standard for loop. Since xargs limited to a single process should act like a for loop, I tested that, and found it to be about the same.
Thus, we have this demonstration.
Quad core (AMD Athalon X4, 2.6GHz)
Working entirely on a tempfs (16g ram total; no swap)
No other major loads
Results:
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 1 convert -auto-level
real 0m3.784s
user 0m2.240s
sys 0m0.230s
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 2 convert -auto-level
real 0m9.097s
user 0m28.020s
sys 0m0.910s
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 10 convert -auto-level
real 0m9.844s
user 0m33.200s
sys 0m1.270s
Can anyone think of a reason why running two instances of this program takes more than twice as long in real time, and more than ten times as long in processor time to complete the same task? After that initial hit, more processes do not seem to have as significant of an effect.
I thought it might have to do with disk seeking, so I did that test entirely in ram. Could it have something to do with how Convert works, and having more than one copy at once means it cannot use processor cache as efficiently or something?
EDIT: When done with 1000x 769KB files, performance is as expected. Interesting.
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 1 convert -auto-level
real 3m37.679s
user 5m6.980s
sys 0m6.340s
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 1 convert -auto-level
real 3m37.152s
user 5m6.140s
sys 0m6.530s
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 2 convert -auto-level
real 2m7.578s
user 5m35.410s
sys 0m6.050s
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 4 convert -auto-level
real 1m36.959s
user 5m48.900s
sys 0m6.350s
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 10 convert -auto-level
real 1m36.392s
user 5m54.840s
sys 0m5.650s