Simultanious process mysteriously ending
Posted
by
Matt
on Server Fault
See other posts from Server Fault
or by Matt
Published on 2014-06-02T15:34:59Z
Indexed on
2014/06/03
3:30 UTC
Read the original article
Hit count: 354
I'm trying to run a large air quality model, written in FORTRAN, setup with bash scripts, and run in a work queue (slurm.)
The first part of the modeling is to run an "entry" model, this runs with MPI in the work queue but only on one process.
At one point in the logs, there's a mysterious FORTRAN STOP
, and then later the model fails because something wasn't set up properly.
This FORTRAN STOP
isn't from the main process, which continues running.
This is a huge model, but as far as I know there should not be any other processes running at the same time.
It consistently fails at the exact same spot. (I can move it by adding debug, but the debug is in the main process)
How can I determine what this process is?
I've tried added a call to
strace -feprocess $SHELL
in the run script, but I'm new to this, so if it has offered any info, I haven't been able to use it yet. The is no trace output around the FORTRAN STOP
.
The whole process occurs so fast that I can't seem to observe it by using ps
.
Is there a way I can somehow monitor all the processes being initiated from the time the work queue starts? Or some other way I can figure out what is failing?
This is running on CentOS 6.4, with Slurm, compiled with PGI 13.
© Server Fault or respective owner