Unable to run OpenMPI across more than two machines
- by rcollyer
When attempting to run the first example in the boost::mpi tutorial, I was unable to run across more than two machines. Specifically, this seemed to run fine:
mpirun -hostfile hostnames -np 4 boost1
with each hostname in hostnames as <node_name> slots=2 max_slots=2. But, when I increase the number of processes to 5, it just hangs. I have decreased the number of slots/max_slots to 1 with the same result when I exceed 2 machines. On the nodes, this shows up in the job list:
<user> Ss orted --daemonize -mca ess env -mca orte_ess_jobid 388497408 \
-mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 -hnp-uri \
388497408.0;tcp://<node_ip>:48823
Additionally, when I kill it, I get this message:
node2- daemon did not report back when launched
node3- daemon did not report back when launched
The cluster is set up with the mpi and boost libs accessible on an NFS mounted drive. Am I running into a deadlock with NFS? Or, is something else going on?