I have a parallelized automation script which needs to call many other scripts, some of which hang because they (incorrectly) wait for standard input. That's not a big deal because I catch those with alarm. The trick is to shut down those hung grandchild processes when the child shuts down. I thought various incantations of SIGCHLD, waiting, and process groups could do the trick, but they all block and the grandchildren aren't reaped.
My solution, which works, just doesn't seem like it is the right solution. I'm not especially interested in the Windows solution just yet, but I'll eventually need that too. Mine only works for Unix, which is fine for now.
I wrote a small script that takes the number of simultaneous parallel children to run and the total number of forks:
$ fork_bomb <parallel jobs> <number of forks>
$ fork_bomb 8 500
This will probably hit the per-user process limit within a couple of minutes. Many solutions I've found just tell you to increase the per-user process limit, but I need this to run about 300,000 times, so that isn't going to work. Similarly, suggestions to re-exec and so on to clear the process table aren't what I need. I'd like to actually fix the problem instead of slapping duct tape over it.
I crawl the process table looking for the child processes and shut down the hung processes individually in the SIGALRM handler, which needs to die because the rest of real code has no hope of success after that. The kludgey crawl through the process table doesn't bother me from a performance perspective, but I wouldn't mind not doing it:
use Parallel::ForkManager;
use Proc::ProcessTable;
my $pm = Parallel::ForkManager->new( $ARGV[0] );
my $alarm_sub = sub {
kill 9,
map { $_->{pid} }
grep { $_->{ppid} == $$ }
@{ Proc::ProcessTable->new->table };
die "Alarm rang for $$!\n";
};
foreach ( 0 .. $ARGV[1] )
{
print ".";
print "\n" unless $count++ % 50;
my $pid = $pm->start and next;
local $SIG{ALRM} = $alarm_sub;
eval {
alarm( 2 );
system "$^X -le '<STDIN>'"; # this will hang
alarm( 0 );
};
$pm->finish;
}
If you want to run out of processes, take out the kill.
I thought that setting a process group would work so I could kill everything together, but that blocks:
my $alarm_sub = sub {
kill 9, -$$; # blocks here
die "Alarm rang for $$!\n";
};
foreach ( 0 .. $ARGV[1] )
{
print ".";
print "\n" unless $count++ % 50;
my $pid = $pm->start and next;
setpgrp(0, 0);
local $SIG{ALRM} = $alarm_sub;
eval {
alarm( 2 );
system "$^X -le '<STDIN>'"; # this will hang
alarm( 0 );
};
$pm->finish;
}
The same thing with POSIX's setsid didn't work either, and I think that actually broke things in a different way since I'm not really daemonizing this.
Curiously, Parallel::ForkManager's run_on_finish happens too late for the same clean-up code: the grandchildren are apparently already disassociated from the child processes at that point.