Problem with multinode running with iimpi-2020a

As mentioned in the Slack already, we have issues with MPI executables build against iimpi-2020a starting multinode.  Within a node I am not aware of issues.

The problem seems associated with the UCX/1.8.0 dependency.  Executables utilising  iimpi/2020.00, which utilises Intel's MPI 19.6 without an UCX dependency work multinode.  Also if I "massage" the easyconfig `impi-2019.7.217-iccifort-2020.1.217.eb` and comment the line
```
 ('UCX', '1.8.0'),
```
in the dependencies list, basic hello world codes or the HPL for intel/2020a will run.  Though performance, when compared to an HPL build with intel/2017b is 10% poorer.  Using the HPL from PR #10864 the performance is within 1% of the one from intel/2017b.

A few details on our cluster.  The system is using Intel **Xeon Xeon E5-2650 v3 (Haswell)** and **4xFDR InfiniBand**.  We are using **CentOS 7**, currently 7.6 or 7.8, linux kernel 3.10, **infiniband stuff from CentOS**.  **Slurm** is setup with cgroups for process control and accounting
(TaskPlugin=task/cgroup, ProctrackType=proctrack/cgroup ).  The slurm is quite old **slurm 17.02**.

To get the Intel MPI started I add (in an editor) 
```
setenv("I_MPI_PMI_LIBRARY", "/lib64/libpmi.so")
```
to the impi modules (we have versions as far back as iimpi/7.3.5, predating iimpi/2016b).  I tested multiple times, but `libpmi2.so`does not work for us.   From the methods to start an Intel MPI jobs, described in the [slurm guide](https://slurm.schedmd.com/mpi_guide.html#intel_mpi), only srun works for us.  We never got hydra or MPD to work.  I tested, setting `'I_MPI_HYDRA_TOPOLIB': 'ipl'`does not help anything.  

When running I load:
```
ml iccifort/2020.1.217 impi/2019.7.217
```
The modules are build with unmodified configs from EB 4.2.1.  When compiling and running a simple MPI hello world code, I get the following in stdout:
```
[1593610029.017424] [au220:9811 :0]         select.c:433  UCX  ERROR no active messages transport to <no debug data>: po
six/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreacha
ble, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
[1593610029.017913] [au219:19723:0]         select.c:433  UCX  ERROR no active messages transport to <no debug data>: po
six/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreacha
ble, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
```
and this in stderr:
```
Abort(1091215) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........: 
MPID_Init(904)...............: 
MPIDI_OFI_mpi_init_hook(1471): OFI get address vector map failed
In: PMI_Abort(1091215, Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........: 
MPID_Init(904)...............: 
MPIDI_OFI_mpi_init_hook(1471): OFI get address vector map failed)
slurmstepd: error: *** STEP 4574138.0 ON au219 CANCELLED AT 2020-07-01T15:27:09 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
Abort(1091215) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........: 
MPID_Init(904)...............: 
MPIDI_OFI_mpi_init_hook(1471): OFI get address vector map failed
In: PMI_Abort(1091215, Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........: 
MPID_Init(904)...............: 
MPIDI_OFI_mpi_init_hook(1471): OFI get address vector map failed)
srun: error: au219: task 0: Killed
srun: error: au220: task 1: Exited with exit code 143
```

Ok, that went long.  Any suggestions would be highly appreciated. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problem with multinode running with iimpi-2020a #10899

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Problem with multinode running with iimpi-2020a #10899

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions