Skip to content

MPI jobs fail with intel toolchains after upgrade of EL8 Linux from 8.5 to 8.6 #15651

@OleHolmNielsen

Description

@OleHolmNielsen

I'm testing the upgrade of our compute nodes from Almalinux 8.5 to 8.6 (the RHEL 8 clone similar to Rocky Linux).

We have found that all MPI codes built with any of the Intel toolchains intel/2020b or intel/2021b fail after the 8.5 to 8.6 upgrade. The codes fail also on login nodes, so the Slurm queue system is not involved.
The FOSS toolchains foss/2020b and foss/2021b work perfectly on EL 8.6, however.

My simple test uses the attached trivial MPI Hello World code running on a single node:

$ module load intel/2021b
$ mpicc mpi_hello_world.c
$ mpirun ./a.out

Now the mpirun command enters an infinite loop (running many minutes) and we see these processes with "ps":

/bin/sh /home/modules/software/impi/2021.4.0-intel-compilers-2021.4.0/mpi/2021.4.0/bin/mpirun ./a.out
mpiexec.hydra ./a.out

The mpiexec.hydra process doesn't respond to 15/SIGTERM and I have to kill it with 9/SIGKILL. I've tried to enable debugging output with

export I_MPI_HYDRA_DEBUG=1
export I_MPI_DEBUG=5

but nothing gets printed from this.

Question: Has anyone tried an EL 8.6 Linux with the Intel toolchain and mpiexec.hydra? Can you suggest how I may debug this issue?

OS information:

$ cat /etc/redhat-release
AlmaLinux release 8.6 (Sky Tiger)
$ uname -r
4.18.0-372.9.1.el8.x86_64

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions