Bash script 'while read' loop causes 'broken pipe' error when run with GNU Parallel
- by Joe White
According to the GNU Parallel mailing list this is not a GNU Parallel-specific problem. They suggested that I post my problem here.
The error I'm getting is a "broken pipe" error, but I feel I should first explain the context of my problem and what causes this error. It happens when trying to use any bash script containing a 'while read' loop in GNU Parallel.
I have a basic bash script like this:
#!/bin/bash
# linkcheck.sh
while read domain
do
host "$domain"
done
Assume that I want to pipe in a large list (250mb say).
cat urllist | ./linkcheck.sh
Running host command on 250mb worth of URLs is rather slow. To speed things up I want to break up the input into chunks before piping it and then run multiple jobs in parallel. GNU Parallel is capable of doing this.
cat urllist | parallel --pipe -j0 parallel ./linkcheck.sh {}
{} is substituted by the contents of urllist line-by-line. Assume that my systems default setup is capable of running 500ish jobs per instance of parallel. To get round this limitation we can parallelize Parallel itself:
cat urllist | parallel -j10 --pipe parallel -j0 ./linkcheck.sh {}
This will run 5000'ish jobs. It will also, sadly, cause the error "broken pipe" (bash FAQ). Yet the script starts to work if I remove the while read loop and take input directly from whatever is fed into {} e.g.,
#!/bin/bash
# linkchecker.sh
domain="$1"
host "$1"
Why will it not work with a while read loop? Is it safe to just turn off the SIGPIPE signal to stop the "broken pipe" message, or will that have side effects such as data corruption?
Thanks for reading.