Skip to content

[openmpi-4.1.2] No components were able to be opened in the pml framework #9838

Open
@Honggang-LI

Description

@Honggang-LI

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.2 (openmpi-4.1.2-1.el8)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

+ ./configure --prefix=/usr/lib64/openmpi --mandir=/usr/share/man/openmpi-x86_64 --includedir=/usr/include/openmpi-x86_64 --sysconfdir=/etc/openmpi-x86_64 --disable-silent-rules --enable-builtin-atomics --enable-mpi-cxx --enable-mpi-java --enable-mpi1-compatibility --with-sge --with-valgrind --enable-memchecker --with-hwloc=/usr --with-libevent=external --with-pmix=external

Please describe the system on which you are running

  • Operating system/version:
rhel-8.6
  • Computer hardware:
x64
  • Network type:
RDMA (IB/IWARP/ROCE/OPA)

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

shell$ + [21-12-16 09:20:42] timeout 3m /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include mlx4_0:1 -mca mtl '^psm2,psm,ofi' -mca btl '^openib' -mca btl_openib_allow_ib 1 --mca mtl_openib_verbose 100 --mca btl_base_verbose 100 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=mlx4_ib0 --mca osc_ucx_verbose 100 --mca pml_ucx_verbose 100 -hostfile /root/hfile_one_core -np 2 /usr/lib64/openmpi/bin/mpitests-IMB-MPI1 PingPong
[rdma-virt-00:79988] mca: base: components_register: registering framework btl components
[rdma-virt-00:79988] mca: base: components_register: found loaded component ofi
[rdma-virt-00:79988] mca: base: components_register: component ofi register function successful
[rdma-virt-00:79988] mca: base: components_register: found loaded component self
[rdma-virt-00:79988] mca: base: components_register: component self register function successful
[rdma-virt-00:79988] mca: base: components_register: found loaded component sm
[rdma-virt-00:79988] mca: base: components_register: found loaded component tcp
[rdma-virt-00:79988] mca: base: components_register: component tcp register function successful
[rdma-virt-00:79988] mca: base: components_register: found loaded component usnic
[rdma-virt-00:79988] mca: base: components_register: component usnic register function successful
[rdma-virt-00:79988] mca: base: components_register: found loaded component vader
[rdma-virt-00:79988] mca: base: components_register: component vader register function successful
[rdma-virt-00:79988] mca: base: components_open: opening btl components
[rdma-virt-00:79988] mca: base: components_open: found loaded component ofi
[rdma-virt-00:79988] mca: base: components_open: component ofi open function successful
[rdma-virt-00:79988] mca: base: components_open: found loaded component self
[rdma-virt-00:79988] mca: base: components_open: component self open function successful
[rdma-virt-00:79988] mca: base: components_open: found loaded component tcp
[rdma-virt-00:79988] mca: base: components_open: component tcp open function successful
[rdma-virt-00:79988] mca: base: components_open: found loaded component usnic
[rdma-virt-00:79988] mca: base: components_open: component usnic open function successful
[rdma-virt-00:79988] mca: base: components_open: found loaded component vader
[rdma-virt-00:79988] mca: base: components_open: component vader open function successful
[rdma-virt-00:79988] select: initializing btl component ofi
[rdma-virt-01:80287] mca: base: components_register: registering framework btl components
[rdma-virt-01:80287] mca: base: components_register: found loaded component ofi
[rdma-virt-01:80287] mca: base: components_register: component ofi register function successful
[rdma-virt-01:80287] mca: base: components_register: found loaded component self
[rdma-virt-01:80287] mca: base: components_register: component self register function successful
[rdma-virt-01:80287] mca: base: components_register: found loaded component sm
[rdma-virt-01:80287] mca: base: components_register: found loaded component tcp
[rdma-virt-01:80287] mca: base: components_register: component tcp register function successful
[rdma-virt-01:80287] mca: base: components_register: found loaded component usnic
[rdma-virt-01:80287] mca: base: components_register: component usnic register function successful
[rdma-virt-01:80287] mca: base: components_register: found loaded component vader
[rdma-virt-01:80287] mca: base: components_register: component vader register function successful
[rdma-virt-01:80287] mca: base: components_open: opening btl components
[rdma-virt-01:80287] mca: base: components_open: found loaded component ofi
[rdma-virt-01:80287] mca: base: components_open: component ofi open function successful
[rdma-virt-01:80287] mca: base: components_open: found loaded component self
[rdma-virt-01:80287] mca: base: components_open: component self open function successful
[rdma-virt-01:80287] mca: base: components_open: found loaded component tcp
[rdma-virt-01:80287] mca: base: components_open: component tcp open function successful
[rdma-virt-01:80287] mca: base: components_open: found loaded component usnic
[rdma-virt-01:80287] mca: base: components_open: component usnic open function successful
[rdma-virt-01:80287] mca: base: components_open: found loaded component vader
[rdma-virt-01:80287] mca: base: components_open: component vader open function successful
[rdma-virt-01:80287] select: initializing btl component ofi
[rdma-virt-00:79988] select: init of component ofi returned success
[rdma-virt-00:79988] select: initializing btl component self
[rdma-virt-00:79988] select: init of component self returned success
[rdma-virt-00:79988] select: initializing btl component tcp
[rdma-virt-00:79988] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[rdma-virt-00:79988] btl: tcp: Found match: 127.0.0.1 (lo)
[rdma-virt-00:79988] btl:tcp: Attempting to bind to AF_INET port 1024
[rdma-virt-00:79988] btl:tcp: Successfully bound to AF_INET port 1024
[rdma-virt-00:79988] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[rdma-virt-00:79988] btl:tcp: examining interface mlx4_roce
[rdma-virt-00:79988] btl:tcp: using ipv6 interface mlx4_roce
[rdma-virt-00:79988] btl:tcp: examining interface mlx4_ib0
[rdma-virt-00:79988] btl:tcp: using ipv6 interface mlx4_ib0
[rdma-virt-00:79988] btl:tcp: examining interface mlx4_ib0.8004
[rdma-virt-00:79988] btl:tcp: using ipv6 interface mlx4_ib0.8004
[rdma-virt-00:79988] btl:tcp: examining interface mlx4_ib0.8002
[rdma-virt-00:79988] btl:tcp: using ipv6 interface mlx4_ib0.8002
[rdma-virt-00:79988] btl:tcp: examining interface mlx4_roce.43
[rdma-virt-00:79988] btl:tcp: using ipv6 interface mlx4_roce.43
[rdma-virt-00:79988] btl:tcp: examining interface mlx4_roce.45
[rdma-virt-00:79988] btl:tcp: using ipv6 interface mlx4_roce.45
[rdma-virt-00:79988] btl:tcp: examining interface lab-bridge0
[rdma-virt-00:79988] btl:tcp: using ipv6 interface lab-bridge0
[rdma-virt-00:79988] select: init of component tcp returned success
[rdma-virt-00:79988] select: initializing btl component usnic
[rdma-virt-00:79988] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[rdma-virt-00:79988] select: init of component usnic returned failure
[rdma-virt-00:79988] mca: base: close: component usnic closed
[rdma-virt-00:79988] mca: base: close: unloading component usnic
[rdma-virt-00:79988] select: initializing btl component vader
[rdma-virt-00:79988] select: init of component vader returned failure
[rdma-virt-00:79988] mca: base: close: component vader closed
[rdma-virt-00:79988] mca: base: close: unloading component vader
[rdma-virt-01:80287] select: init of component ofi returned success
[rdma-virt-01:80287] select: initializing btl component self
[rdma-virt-01:80287] select: init of component self returned success
[rdma-virt-01:80287] select: initializing btl component tcp
[rdma-virt-01:80287] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[rdma-virt-01:80287] btl: tcp: Found match: 127.0.0.1 (lo)
[rdma-virt-01:80287] btl:tcp: Attempting to bind to AF_INET port 1024
[rdma-virt-01:80287] btl:tcp: Successfully bound to AF_INET port 1024
[rdma-virt-01:80287] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[rdma-virt-01:80287] btl:tcp: examining interface mlx4_roce
[rdma-virt-01:80287] btl:tcp: using ipv6 interface mlx4_roce
[rdma-virt-01:80287] btl:tcp: examining interface mlx4_ib0
[rdma-virt-01:80287] btl:tcp: using ipv6 interface mlx4_ib0
[rdma-virt-01:80287] btl:tcp: examining interface mlx4_ib1
[rdma-virt-01:80287] btl:tcp: using ipv6 interface mlx4_ib1
[rdma-virt-01:80287] btl:tcp: examining interface mlx4_ib0.8004
[rdma-virt-01:80287] btl:tcp: using ipv6 interface mlx4_ib0.8004
[rdma-virt-01:80287] btl:tcp: examining interface mlx4_ib0.8002
[rdma-virt-01:80287] btl:tcp: using ipv6 interface mlx4_ib0.8002
[rdma-virt-01:80287] btl:tcp: examining interface mlx4_ib1.8003
[rdma-virt-01:80287] btl:tcp: using ipv6 interface mlx4_ib1.8003
[rdma-virt-01:80287] btl:tcp: examining interface mlx4_ib1.8005
[rdma-virt-01:80287] btl:tcp: using ipv6 interface mlx4_ib1.8005
[rdma-virt-01:80287] btl:tcp: examining interface mlx4_roce.43
[rdma-virt-01:80287] btl:tcp: using ipv6 interface mlx4_roce.43
[rdma-virt-01:80287] btl:tcp: examining interface mlx4_roce.45
[rdma-virt-01:80287] btl:tcp: using ipv6 interface mlx4_roce.45
[rdma-virt-01:80287] btl:tcp: examining interface lab-bridge0
[rdma-virt-01:80287] btl:tcp: using ipv6 interface lab-bridge0
[rdma-virt-01:80287] select: init of component tcp returned success
[rdma-virt-01:80287] select: initializing btl component usnic
[rdma-virt-01:80287] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[rdma-virt-01:80287] select: init of component usnic returned failure
[rdma-virt-01:80287] mca: base: close: component usnic closed
[rdma-virt-01:80287] mca: base: close: unloading component usnic
[rdma-virt-01:80287] select: initializing btl component vader
[rdma-virt-01:80287] select: init of component vader returned failure
[rdma-virt-01:80287] mca: base: close: component vader closed
[rdma-virt-01:80287] mca: base: close: unloading component vader
[rdma-virt-00:79988] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2
[rdma-virt-01:80287] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2
[rdma-virt-00:79988] common_ucx.c:304 posix/memory: did not match transport list
[rdma-virt-00:79988] common_ucx.c:304 sysv/memory: did not match transport list
[rdma-virt-00:79988] common_ucx.c:304 self/memory0: did not match transport list
[rdma-virt-00:79988] common_ucx.c:304 tcp/mlx4_ib0: did not match transport list
[rdma-virt-00:79988] common_ucx.c:311 support level is none
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      rdma-virt-00
  Framework: pml
--------------------------------------------------------------------------
[rdma-virt-00:79988] PML ucx cannot be selected
[rdma-virt-01:80287] common_ucx.c:304 posix/memory: did not match transport list
[rdma-virt-01:80287] common_ucx.c:304 sysv/memory: did not match transport list
[rdma-virt-01:80287] common_ucx.c:304 self/memory0: did not match transport list
[rdma-virt-01:80287] common_ucx.c:304 tcp/mlx4_ib0: did not match transport list
[rdma-virt-01:80287] common_ucx.c:311 support level is none
[rdma-virt-01:80282] 1 more process has sent help message help-mca-base.txt / find-available:none found
[rdma-virt-01:80282] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
v4.1.1 also has this issue. We revert c36d7459b6331c4da825cad5a64326e7c1a272aa for v4.1.1 to avoid this issue.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions