-
Notifications
You must be signed in to change notification settings - Fork 773
Description
I tested the execution of a simple inter-node job between two nodes over our Infiniband network with updates 5, 6 and 7 of Intel MPI v2019 and I found very different results for each release. All tests were carried out with iccifort/2020.1.217 as base of the toolchain.
Characteristics of the testing system
- CPU: 2x Intel(R) Xeon(R) Gold 6126
- Adapter: Mellanox Technologies MT27700 Family [ConnectX-4]
- Operative System: Cent OS 7.7
- Related system libraries: UCX v1.5.1, OFED v4.7-3.2.9
- ICC: v2020.1 (from Easybuild)
- Resource manager: Torque
Steps to reproduce:
- Start a job on two nodes
- Load
impi mpicc ${EBROOTIMPI}/test/test.c -o testmpirun ./test
Intel MPI v2019 update 5: works out of the box
$ module load impi/2019.5.281-iccifort-2020.1.217
$ fi_info --version
fi_info: 1.7.2a
libfabric: 1.7.2a
libfabric api: 1.7
$ fi_info | grep provider
provider: verbs;ofi_rxm
provider: verbs;ofi_rxd
provider: verbs
provider: verbs
provider: verbs
$ mpirun ./test
Hello world: rank 0 of 2 running on node357.hydra.os
Hello world: rank 1 of 2 running on node356.hydra.os
Intel MPI v2019 update 6: does NOT work out of the box, but can be fixed
$ module load impi/2019.6.166-iccifort-2020.1.217
$ fi_info --version
fi_info: 1.9.0a1
libfabric: 1.9.0a1-impi
libfabric api: 1.8
$ fi_info | grep provider
provider: mlx
provider: mlx;ofi_rxm
$ mpirun ./test
[1585832682.960816] [node357:302190:0] select.c:406 UCX ERROR no active messages transport to <no debug data>: self/self - Destination is unreachable, rdmacm/sockaddr - no am bcopy, mm/sysv - Destination is unreachable, mm/posix - Destination is unreachable, cma/cma - no am bcopy
Abort(1091471) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........:
MPID_Init(958)...............:
MPIDI_OFI_mpi_init_hook(1382): OFI get address vector map failed
- Solution 1: use
verbsortcplibfabric providers instead ofmlx
$ module load impi/2019.6.166-iccifort-2020.1.217
$ FI_PROVIDER=verbs,tcp mpirun ./test
Hello world: rank 0 of 2 running on node357.hydra.os
Hello world: rank 1 of 2 running on node356.hydra.os
- Solution 2: use a more up to date UCX. Intel claims that at least v1.4 is required for
mlx, but for us it only works with UCX v1.7 (available in Easybuild).
$ module load impi/2019.6.166-iccifort-2020.1.217
$ module load UCX/1.7.0-GCCcore-9.3.0
$ ucx_info
# UCT version=1.7.0 revision
# configured with: --prefix=/user/brussel/101/vsc10122/.local/easybuild/software/UCX/1.7.0-GCCcore-9.3.0 --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --enable-optimizations --enable-cma --enable-mt --with-verbs --without-java --disable-doxygen-doc
$ FI_PROVIDER=mlx mpirun ./test
Hello world: rank 0 of 2 running on node357.hydra.os
Hello world: rank 1 of 2 running on node356.hydra.os
- Solution 3: use external libfabric v1.9.1. Upstream libfabric dropped
mlxwith version 1.9.0
$ module load impi/2019.6.166-iccifort-2020.1.217
$ module load libfabric/1.9.1-GCCcore-9.3.0
$ export FI_PROVIDER_PATH=
$ fi_info --version
fi_info: 1.9.1
libfabric: 1.9.1
libfabric api: 1.9
$ mpirun ./test
Hello world: rank 0 of 2 running on node357.hydra.os
Hello world: rank 1 of 2 running on node356.hydra.os
Intel MPI v2019 update 7: does NOT work at all
$ module load impi/2019.7.217-iccifort-2020.1.217
$ fi_info --version
fi_info: 1.10.0a1
libfabric: 1.10.0a1-impi
libfabric api: 1.9
$ fi_info | grep provider
provider: verbs;ofi_rxm
[...]
provider: tcp;ofi_rxm
[...]
provider: verbs
[...]
provider: tcp
[...]
provider: sockets
[...]
$ $ I_MPI_DEBUG=4 I_MPI_HYDRA_DEBUG=on FI_LOG_LEVEL=debug mpirun ./test
[[email protected]] Launch arguments: /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_bstrap_proxy --upstream-host node357.hydra.brussel.vsc --upstream-port 40969 --pgid 0 --launcher ssh --launcher-number 0 --base-path /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 0 --node-id 0 --subtree-size 1 --upstream-fd 7 /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
[[email protected]] Launch arguments: /usr/bin/ssh -q -x node356.hydra.brussel.vsc /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_bstrap_proxy --upstream-host node357.hydra.brussel.vsc --upstream-port 40969 --pgid 0 --launcher ssh --launcher-number 0 --base-path /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 1 --node-id 1 --subtree-size 1 /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
[proxy:0:[email protected]] Warning - oversubscription detected: 1 processes will be placed on 0 cores
[proxy:0:[email protected]] pmi cmd from fd 4: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:[email protected]] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:[email protected]] pmi cmd from fd 4: cmd=get_maxes
[proxy:0:[email protected]] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:[email protected]] pmi cmd from fd 4: cmd=get_appnum
[proxy:0:[email protected]] PMI response: cmd=appnum appnum=0
[proxy:0:[email protected]] pmi cmd from fd 4: cmd=get_my_kvsname
[proxy:0:[email protected]] PMI response: cmd=my_kvsname kvsname=kvs_309778_0
[proxy:0:[email protected]] pmi cmd from fd 4: cmd=get kvsname=kvs_309778_0 key=PMI_process_mapping
[proxy:0:[email protected]] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,2,1))
[proxy:0:[email protected]] pmi cmd from fd 4: cmd=barrier_in
(the execution does not stop, it just hangs at this point)
The system log of the node shows the following entry
traps: hydra_pmi_proxy[549] trap divide error ip:4436ed sp:7ffed012ef50 error:0 in hydra_pmi_proxy[400000+ab000]
This error with IMPI v2019.7 happens way before initializing libfabric. Therefore, it does not depend on the provider or the version of UCX. It happens all the time.
Update
- Solution with Torque: MPI inter-node issues with Intel MPI v2019 on Mellanox IB #10314 (comment)