Closed
Description
Hi Open MPI,
Thank you very much for fixing #3042!
We want to switch from version 2.0.2 to 2.1.0 containing the fix but, if we do, our application starts hanging in MPI_Finalize.
From our point of view, this behavior is a regression in version 2.1.0 w.r.t version 2.0.2.
First, MPI_Finalize warns that cuIpcCloseMemHandle failed with the return value of 4 (CUDA_DEINITIALIZED), and then it prints the following messages in a loop:
[hostname:87484] Sleep on 87484
[hostname:87483] Sleep on 87483
[hostname:87478] 1 more process has sent help message help-mpi-common-cuda.txt / cuIpcCloseMemHandle failed
[hostname:87478] 1 more process has sent help message help-mpi-common-cuda.txt / cuIpcCloseMemHandle failed
...
Gdb shows the following stack:
#0 0x00007f12df23393d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007f12df2337d4 in __sleep (seconds=0)
at ../sysdeps/unix/sysv/linux/sleep.c:137
#2 0x00007f12d59f63bd in cuda_closememhandle ()
from /home/espetrov/sandbox/install_mpi/lib/libmca_common_cuda.so.20
#3 0x00007f12d55e93c9 in mca_rcache_rgpusm_finalize ()
from /home/espetrov/sandbox/install_mpi/lib/openmpi/mca_rcache_rgpusm.so
#4 0x00007f12deca5b92 in mca_rcache_base_module_destroy ()
from /home/espetrov/sandbox/install_mpi/lib/libopen-pal.so.20
#5 0x00007f12d435e57a in mca_btl_smcuda_del_procs ()
from /home/espetrov/sandbox/install_mpi/lib/openmpi/mca_btl_smcuda.so
#6 0x00007f12d51e1042 in mca_bml_r2_del_procs ()
from /home/espetrov/sandbox/install_mpi/lib/openmpi/mca_bml_r2.so
#7 0x00007f12dfaa2918 in ompi_mpi_finalize ()
from /home/espetrov/sandbox/install_mpi/lib/libmpi.so.20
I am not sure but I would say that MPI_Finalize tries to close a remote memory handle after the remote MPI process unloaded libcuda.so.
Probably, getting CUDA_DEINITIALIZED from cuIpcCloseMemHandle is OK?
Our CUDA version is 7.5, CUDA driver version is 361.93.02.
Evgueni.