Parallel prologue and epilogue in Grid Engine
Posted
by
ajdecon
on Server Fault
See other posts from Server Fault
or by ajdecon
Published on 2011-03-02T18:43:48Z
Indexed on
2011/03/06
0:12 UTC
Read the original article
Hit count: 624
We have a cluster being used to run MPI jobs for a customer. Previously this cluster used Torque as the scheduler, but we are transitioning to Grid Engine 6.2u5 (for some other features). Unfortunately, we are having trouble duplicating some of our maintenance scripts in the Grid Engine environment.
In Torque, we have a prologue.parallel script which is used to carry out an automated health-check on the node. If this script returns a fail condition, Torque will helpfully offline the node and re-queue the job to use a different group of nodes.
In Grid Engine, however, the queue "prolog" only runs on the head node of the job. We can manually run our prologue script from the startmpi.sh initialization script, for the mpi parallel environment; but I can't figure out how to detect a fail condition and carry out the same "mark offline and requeue" procedure.
Any suggestions?
© Server Fault or respective owner