-
Notifications
You must be signed in to change notification settings - Fork 773
Description
As mentioned in the Slack already, we have issues with MPI executables build against iimpi-2020a starting multinode. Within a node I am not aware of issues.
The problem seems associated with the UCX/1.8.0 dependency. Executables utilising iimpi/2020.00, which utilises Intel's MPI 19.6 without an UCX dependency work multinode. Also if I "massage" the easyconfig impi-2019.7.217-iccifort-2020.1.217.eb and comment the line
('UCX', '1.8.0'),
in the dependencies list, basic hello world codes or the HPL for intel/2020a will run. Though performance, when compared to an HPL build with intel/2017b is 10% poorer. Using the HPL from PR #10864 the performance is within 1% of the one from intel/2017b.
A few details on our cluster. The system is using Intel Xeon Xeon E5-2650 v3 (Haswell) and 4xFDR InfiniBand. We are using CentOS 7, currently 7.6 or 7.8, linux kernel 3.10, infiniband stuff from CentOS. Slurm is setup with cgroups for process control and accounting
(TaskPlugin=task/cgroup, ProctrackType=proctrack/cgroup ). The slurm is quite old slurm 17.02.
To get the Intel MPI started I add (in an editor)
setenv("I_MPI_PMI_LIBRARY", "/lib64/libpmi.so")
to the impi modules (we have versions as far back as iimpi/7.3.5, predating iimpi/2016b). I tested multiple times, but libpmi2.sodoes not work for us. From the methods to start an Intel MPI jobs, described in the slurm guide, only srun works for us. We never got hydra or MPD to work. I tested, setting 'I_MPI_HYDRA_TOPOLIB': 'ipl'does not help anything.
When running I load:
ml iccifort/2020.1.217 impi/2019.7.217
The modules are build with unmodified configs from EB 4.2.1. When compiling and running a simple MPI hello world code, I get the following in stdout:
[1593610029.017424] [au220:9811 :0] select.c:433 UCX ERROR no active messages transport to <no debug data>: po
six/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreacha
ble, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
[1593610029.017913] [au219:19723:0] select.c:433 UCX ERROR no active messages transport to <no debug data>: po
six/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreacha
ble, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
and this in stderr:
Abort(1091215) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........:
MPID_Init(904)...............:
MPIDI_OFI_mpi_init_hook(1471): OFI get address vector map failed
In: PMI_Abort(1091215, Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........:
MPID_Init(904)...............:
MPIDI_OFI_mpi_init_hook(1471): OFI get address vector map failed)
slurmstepd: error: *** STEP 4574138.0 ON au219 CANCELLED AT 2020-07-01T15:27:09 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
Abort(1091215) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........:
MPID_Init(904)...............:
MPIDI_OFI_mpi_init_hook(1471): OFI get address vector map failed
In: PMI_Abort(1091215, Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........:
MPID_Init(904)...............:
MPIDI_OFI_mpi_init_hook(1471): OFI get address vector map failed)
srun: error: au219: task 0: Killed
srun: error: au220: task 1: Exited with exit code 143
Ok, that went long. Any suggestions would be highly appreciated.