Skip to content

Need advice on debugging an issue with 3.1.3 and -mca pml ob1 #6833

Closed
@mwheinz

Description

@mwheinz

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Intel OPA build of OMPI v3.1.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Internal CI testing on a back-to-back OPA fabric.

Please describe the system on which you are running

  • Operating system/version: RHEL 7.5, 7.6, SLES 12 SP3, SLES 12 SP4
  • Computer hardware: Xeon Skylake or better
  • Network type: OPA

Details of the problem

Recently got this issue from one of our testers saying that the following command line:

/usr/mpi/gcc/openmpi-3.1.3-hfi/bin/mpirun -H hds1fna5102.hd.intel.com,hds1fna5103.hd.intel.com,hds1fna5104.hd.intel.com,hds1fna5101.hd.intel.com --allow-run-as-root --mca oob tcp --mca pml ob1 --mca btl tcp,vader,self --mca btl_tcp_if_include ib0 -np 200 --map-by node --oversubscribe /usr/mpi/gcc/openmpi-3.1.3-hfi/tests/IMB-4.0/IMB-MPI1 Sendrecv -npmin 200 -iter 150 -iter_policy off

is hanging "unless they remove --mca pml ob1". Of course, looking at this command line, I'm pretty sure that if they do that they stop using OPA altogether.

Since I'm still trying to learn the OMPI internals, I'm unsure how to approach this, so a few questions:

  1. Any chance this is a known issue?
  2. Assuming it's not, any suggestions on how to debug a silent hang? (I mean, besides staring at the screen and hoping for inspiration to strike...)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions