Skip to content

Problem with multinode running with iimpi-2020a #10899

@jhein32

Description

@jhein32

As mentioned in the Slack already, we have issues with MPI executables build against iimpi-2020a starting multinode. Within a node I am not aware of issues.

The problem seems associated with the UCX/1.8.0 dependency. Executables utilising iimpi/2020.00, which utilises Intel's MPI 19.6 without an UCX dependency work multinode. Also if I "massage" the easyconfig impi-2019.7.217-iccifort-2020.1.217.eb and comment the line

 ('UCX', '1.8.0'),

in the dependencies list, basic hello world codes or the HPL for intel/2020a will run. Though performance, when compared to an HPL build with intel/2017b is 10% poorer. Using the HPL from PR #10864 the performance is within 1% of the one from intel/2017b.

A few details on our cluster. The system is using Intel Xeon Xeon E5-2650 v3 (Haswell) and 4xFDR InfiniBand. We are using CentOS 7, currently 7.6 or 7.8, linux kernel 3.10, infiniband stuff from CentOS. Slurm is setup with cgroups for process control and accounting
(TaskPlugin=task/cgroup, ProctrackType=proctrack/cgroup ). The slurm is quite old slurm 17.02.

To get the Intel MPI started I add (in an editor)

setenv("I_MPI_PMI_LIBRARY", "/lib64/libpmi.so")

to the impi modules (we have versions as far back as iimpi/7.3.5, predating iimpi/2016b). I tested multiple times, but libpmi2.sodoes not work for us. From the methods to start an Intel MPI jobs, described in the slurm guide, only srun works for us. We never got hydra or MPD to work. I tested, setting 'I_MPI_HYDRA_TOPOLIB': 'ipl'does not help anything.

When running I load:

ml iccifort/2020.1.217 impi/2019.7.217

The modules are build with unmodified configs from EB 4.2.1. When compiling and running a simple MPI hello world code, I get the following in stdout:

[1593610029.017424] [au220:9811 :0]         select.c:433  UCX  ERROR no active messages transport to <no debug data>: po
six/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreacha
ble, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
[1593610029.017913] [au219:19723:0]         select.c:433  UCX  ERROR no active messages transport to <no debug data>: po
six/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreacha
ble, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy

and this in stderr:

Abort(1091215) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........: 
MPID_Init(904)...............: 
MPIDI_OFI_mpi_init_hook(1471): OFI get address vector map failed
In: PMI_Abort(1091215, Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........: 
MPID_Init(904)...............: 
MPIDI_OFI_mpi_init_hook(1471): OFI get address vector map failed)
slurmstepd: error: *** STEP 4574138.0 ON au219 CANCELLED AT 2020-07-01T15:27:09 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
Abort(1091215) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........: 
MPID_Init(904)...............: 
MPIDI_OFI_mpi_init_hook(1471): OFI get address vector map failed
In: PMI_Abort(1091215, Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........: 
MPID_Init(904)...............: 
MPIDI_OFI_mpi_init_hook(1471): OFI get address vector map failed)
srun: error: au219: task 0: Killed
srun: error: au220: task 1: Exited with exit code 143

Ok, that went long. Any suggestions would be highly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions