Parallel prologue and epilogue in Grid Engine

Posted by ajdecon on Server Fault See other posts from Server Fault or by ajdecon
Published on 2011-03-02T18:43:48Z Indexed on 2011/03/06 0:12 UTC
Read the original article Hit count: 765

Filed under:

clustering

|

gridengine

|

torque

We have a cluster being used to run MPI jobs for a customer. Previously this cluster used Torque as the scheduler, but we are transitioning to Grid Engine 6.2u5 (for some other features). Unfortunately, we are having trouble duplicating some of our maintenance scripts in the Grid Engine environment.

In Torque, we have a prologue.parallel script which is used to carry out an automated health-check on the node. If this script returns a fail condition, Torque will helpfully offline the node and re-queue the job to use a different group of nodes.

In Grid Engine, however, the queue "prolog" only runs on the head node of the job. We can manually run our prologue script from the startmpi.sh initialization script, for the mpi parallel environment; but I can't figure out how to detect a fail condition and carry out the same "mark offline and requeue" procedure.

Any suggestions?

© Server Fault or respective owner

Related posts about clustering

agglomerative clustering java

as seen on Stack Overflow - Search for 'Stack Overflow'
Is there any java file that I can use to perform "agglomerative clustering" Result should provide me every level nodes id help................. >>> More
MySQL Clustering in a Sandbox

as seen on Internet.com - Search for 'Internet.com'
MySQL's unique architecture allows for plugin storage engines. There is the MyISAM storage engine, the ARCHIVE storage engine and the InnoDB storage engine; so it makes sense then that MySQL's clustering solution involves a storage engine as well, namely the NDB (Network DataBase) storage engine. >>> More
Clustering for Mere Mortals (Pt2)

as seen on SQL Team - Search for 'SQL Team'
Planning. I could stop there and let that be the entirety post #2 in this series. Planning is the single most important element in building a cluster and the Laptop Demo Cluster is no exception. One of the more awkward parts of actually creating a cluster is coordinating information between Windows… >>> More
Microsoft SQL Server High-Availability Videos and Q&A Log

as seen on SQL Blog - Search for 'SQL Blog'
You Want Videos? We Got Videos! I always enjoy getting the chance to catch up with author, consultant, and Microsoft Clustering MVP Allan Hirt . Allan and I recently presented two sessions covering an overview of high availability in Microsoft SQL Server and, the following week, a demo of how to implement… >>> More
I need advice about iscsi + zfs(or ntfs) + windows 2008 clustering

as seen on Server Fault - Search for 'Server Fault'
I want to setup a storage farm with iSCSI. I have 2 cluster node machine, 1 iscsi target machine that has 8TB installed as RAID 10. The capacity is now 8TB, but I'll upgrade the capacity in future. Let's say, I installed clusters as file server, and I connected these servers to iscsi target, then… >>> More

Related posts about gridengine

Garbled UI with vnc, chicken-of-the-vnc, gridengine, qmon

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
Ubuntu 11.04 gridengine version 6.2u5-1ubuntu1 I start a desktop with the vncserver command. My ~/.vnc/xstartup ends with 'gnome-session'. I connect to it using chicken-of-the-VNC on my MacOSX Lion system. I run 'qmon'. Much of qmon works, but several critical tasks show hopelessly garbled grid… >>> More
Is there a way to tell SGE to run specific jobs as root on the execution node?

as seen on Server Fault - Search for 'Server Fault'
The title kinda says it all... We're using SGE/OGE to submit jobs to a set of worker nodes that then do things with specific pieces of equipment. The programs and scripts that have been created that manipulate this equipment rely on running as root. I'd like SGE to handle allocation of resources… >>> More
SGE: invoking qmake raises "critical error: can't resolve group"

as seen on Server Fault - Search for 'Server Fault'
I'm new to SGE an I'm trying to run qmake with the simple following Makefile with our very new cluster: merge.txt: job1.txt job2.txt job3.txt ... cat $^ > $@ job1.txt: sleep 1 echo "Hello From " $@ > $@ sleep 1 job2.txt: sleep 2 echo "Hello From " $@ > $@ sleep… >>> More
Parallel prologue and epilogue in Grid Engine

as seen on Server Fault - Search for 'Server Fault'
We have a cluster being used to run MPI jobs for a customer. Previously this cluster used Torque as the scheduler, but we are transitioning to Grid Engine 6.2u5 (for some other features). Unfortunately, we are having trouble duplicating some of our maintenance scripts in the Grid Engine environment… >>> More
SGE - limit a user to a certain host, using resource quota configuration

as seen on Server Fault - Search for 'Server Fault'
Is it possible to limit a user to a particular host, using the Resource Quota Configuration option in qmon for Sun Grid Engine? I'm thinking of a line to the effect of: { ... limit users {john} to hostname=compute-1-1.local } The documentation mentions built in resource types: slots, arch, mem_total… >>> More