Skip to content

Open MPI 2.1.0: MPI_Finalize hangs because cuIpcCloseMemHandle fails #3244

Closed
@Evgueni-Petrov-aka-espetrov

Description

Hi Open MPI,

Thank you very much for fixing #3042!

We want to switch from version 2.0.2 to 2.1.0 containing the fix but, if we do, our application starts hanging in MPI_Finalize.
From our point of view, this behavior is a regression in version 2.1.0 w.r.t version 2.0.2.

First, MPI_Finalize warns that cuIpcCloseMemHandle failed with the return value of 4 (CUDA_DEINITIALIZED), and then it prints the following messages in a loop:

[hostname:87484] Sleep on 87484
[hostname:87483] Sleep on 87483
[hostname:87478] 1 more process has sent help message help-mpi-common-cuda.txt / cuIpcCloseMemHandle failed
[hostname:87478] 1 more process has sent help message help-mpi-common-cuda.txt / cuIpcCloseMemHandle failed

...
Gdb shows the following stack:

#0  0x00007f12df23393d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f12df2337d4 in __sleep (seconds=0)
    at ../sysdeps/unix/sysv/linux/sleep.c:137
#2  0x00007f12d59f63bd in cuda_closememhandle ()
   from /home/espetrov/sandbox/install_mpi/lib/libmca_common_cuda.so.20
#3  0x00007f12d55e93c9 in mca_rcache_rgpusm_finalize ()
   from /home/espetrov/sandbox/install_mpi/lib/openmpi/mca_rcache_rgpusm.so
#4  0x00007f12deca5b92 in mca_rcache_base_module_destroy ()
   from /home/espetrov/sandbox/install_mpi/lib/libopen-pal.so.20
#5  0x00007f12d435e57a in mca_btl_smcuda_del_procs ()
   from /home/espetrov/sandbox/install_mpi/lib/openmpi/mca_btl_smcuda.so
#6  0x00007f12d51e1042 in mca_bml_r2_del_procs ()
   from /home/espetrov/sandbox/install_mpi/lib/openmpi/mca_bml_r2.so
#7  0x00007f12dfaa2918 in ompi_mpi_finalize ()
   from /home/espetrov/sandbox/install_mpi/lib/libmpi.so.20

I am not sure but I would say that MPI_Finalize tries to close a remote memory handle after the remote MPI process unloaded libcuda.so.

Probably, getting CUDA_DEINITIALIZED from cuIpcCloseMemHandle is OK?
Our CUDA version is 7.5, CUDA driver version is 361.93.02.

Evgueni.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions