-
Notifications
You must be signed in to change notification settings - Fork 772
Description
Issue
Since 2021a the support for CUDA aware MPI communication has changed: rather than using a fosscuda toolchain, we now use a non-CUDA aware MPI, with a CUDA aware UCX (UCX-CUDA/1.10.0-GCCcore-10.3.0-CUDA-11.3.1). However, I'm seeing various (but not all) OSU tests fail with that setup with a segfault:
==== backtrace (tid:2888500) ====
0 0x000000000002a160 ucs_debug_print_backtrace() /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucs/debug/debug.c:656
1 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0
2 0x000000000016065c __memmove_avx_unaligned_erms() :0
3 0x0000000000052aeb non_overlap_copy_content_same_ddt() opal_datatype_copy.c:0
4 0x00000000000874b9 ompi_datatype_sndrcv() ???:0
5 0x00000000000e2d84 ompi_coll_base_alltoall_intra_pairwise() ???:0
6 0x000000000000646c ompi_coll_tuned_alltoall_intra_dec_fixed() ???:0
7 0x0000000000089a4f MPI_Alltoall() ???:0
8 0x0000000000402d40 main() ???:0
9 0x0000000000023493 __libc_start_main() ???:0
10 0x000000000040318e _start() ???:0
=================================
[gcn30:2888500] *** Process received signal ***
[gcn30:2888500] Signal: Segmentation fault (11)
[gcn30:2888500] Signal code: (-6)
[gcn30:2888500] Failing at address: 0xb155002c1334
[gcn30:2888500] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x1464e51a8b20]
[gcn30:2888500] [ 1] /lib64/libc.so.6(+0x16065c)[0x1464e4f3165c]
[gcn30:2888500] [ 2] /sw/arch/Centos8/EB_production/2021/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libopen-pal.so.40(+0x52aeb)[0x1464e485daeb]
[gcn30:2888500] [ 3] /sw/arch/Centos8/EB_production/2021/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so.40(ompi_datatype_sndrcv+0x949)[0x1464e72454b9]
[gcn30:2888500] [ 4] /sw/arch/Centos8/EB_production/2021/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so.40(ompi_coll_base_alltoall_intra_pairwise+0x174)[0x1464e72a0d84]
[gcn30:2888500] [ 5] /sw/arch/Centos8/EB_production/2021/software/OpenMPI/4.1.1-GCC-10.3.0/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_alltoall_intra_dec_fixed+0x7c)[0x1464d494746c]
[gcn30:2888500] [ 6] /sw/arch/Centos8/EB_production/2021/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so.40(MPI_Alltoall+0x15f)[0x1464e7247a4f]
[gcn30:2888500] [ 7] osu_alltoall[0x402d40]
[gcn30:2888500] [ 8] /lib64/libc.so.6(__libc_start_main+0xf3)[0x1464e4df4493]
[gcn30:2888500] [ 9] osu_alltoall[0x40318e]
[gcn30:2888500] *** End of error message ***
Working/failing tests
I haven't run all OSU benchmarks, but a few that run without issues are:
mpirun -np 2 osu_latency -d cuda D D
mpirun -np 2 osu_bw -d cuda D D
mpirun -np 2 osu_bcast -d cuda D D
A few that produce the segfault are:
mpirun -np 2 osu_gather -d cuda D D
mpirun -np 2 osu_alltoall -d cuda D D
Note that for the osu_latency and osu_bw tests I get results that correspond with what can be expected from GPU direct RDMA, so they really seem to function properly.
Summary of discussion on EasyBuild Slack
This thread on OpenMPI seems to suggest that UCX has limited support for GPU operations (see here) and might only work for pt2pt, but not for collectives (see here). The relevant parts:
OK I would like to drag in @jsquyres for confirming this, because my impression from Jeff was that for this case Open MPI could still be CUDA-aware. @bureddy Are you saying collectives won't work but point-to-point could? Interesting...
Yes. it might change in the future when UCX handle all datatypes pack/unpack and collectives
The thread was from a while ago though, it is possible things have changed, but it would be unsure in which version (if anything changed).
If UCX indeed does not fully support operations on GPU buffers, then we might have to build OpenMPI with GPU support again (currently we build OpenMPI with UCX support, and then build UCX with CUDA support), i.e. build the smcuda BTL again. It's a bit odd, since it seemed OpenMPI wanted to move away from that (and towards using UCX). Plus, it reintroduces the original issue that we prefer not to have a fosscuda toolchain, because that causes duplication of a lot of modules (see the original discussion here). As @Micket suggested on the chat: if it proves to be needed, we could try to build the smcuda BTL seperately, as an add-on, like we do for UCX-CUDA.
Open questions:
- Can others confirm that the tests that fail for me also fail for them?
- Can we get confirmation from experts (e.g. on OpenMPI / UCX issue tracker) that indeed
UCXstill has this limited support?