Description
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
Intel OPA build of OMPI v3.1.3
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Internal CI testing on a back-to-back OPA fabric.
Please describe the system on which you are running
- Operating system/version: RHEL 7.5, 7.6, SLES 12 SP3, SLES 12 SP4
- Computer hardware: Xeon Skylake or better
- Network type: OPA
Details of the problem
Recently got this issue from one of our testers saying that the following command line:
/usr/mpi/gcc/openmpi-3.1.3-hfi/bin/mpirun -H hds1fna5102.hd.intel.com,hds1fna5103.hd.intel.com,hds1fna5104.hd.intel.com,hds1fna5101.hd.intel.com --allow-run-as-root --mca oob tcp --mca pml ob1 --mca btl tcp,vader,self --mca btl_tcp_if_include ib0 -np 200 --map-by node --oversubscribe /usr/mpi/gcc/openmpi-3.1.3-hfi/tests/IMB-4.0/IMB-MPI1 Sendrecv -npmin 200 -iter 150 -iter_policy off
is hanging "unless they remove --mca pml ob1". Of course, looking at this command line, I'm pretty sure that if they do that they stop using OPA altogether.
Since I'm still trying to learn the OMPI internals, I'm unsure how to approach this, so a few questions:
- Any chance this is a known issue?
- Assuming it's not, any suggestions on how to debug a silent hang? (I mean, besides staring at the screen and hoping for inspiration to strike...)