Simultanious process mysteriously ending

Posted by Matt on Server Fault See other posts from Server Fault or by Matt
Published on 2014-06-02T15:34:59Z Indexed on 2014/06/03 3:30 UTC
Read the original article Hit count: 354

Filed under:
|

I'm trying to run a large air quality model, written in FORTRAN, setup with bash scripts, and run in a work queue (slurm.)

The first part of the modeling is to run an "entry" model, this runs with MPI in the work queue but only on one process.

At one point in the logs, there's a mysterious FORTRAN STOP, and then later the model fails because something wasn't set up properly.

This FORTRAN STOP isn't from the main process, which continues running.

This is a huge model, but as far as I know there should not be any other processes running at the same time.

It consistently fails at the exact same spot. (I can move it by adding debug, but the debug is in the main process)

How can I determine what this process is?

I've tried added a call to

strace -feprocess $SHELL

in the run script, but I'm new to this, so if it has offered any info, I haven't been able to use it yet. The is no trace output around the FORTRAN STOP.

The whole process occurs so fast that I can't seem to observe it by using ps.

Is there a way I can somehow monitor all the processes being initiated from the time the work queue starts? Or some other way I can figure out what is failing?

This is running on CentOS 6.4, with Slurm, compiled with PGI 13.

© Server Fault or respective owner

Related posts about linux

Related posts about fortran