Various benchmarks from OSU-Micro-Benchmarks/5.7.1-gompi-2021a-CUDA-11.3.1 segfault when using CUDA buffers

**Issue**

Since `2021a` the support for CUDA aware MPI communication has changed: rather than using a `fosscuda` toolchain, we now use a non-CUDA aware MPI, with a CUDA aware UCX (`UCX-CUDA/1.10.0-GCCcore-10.3.0-CUDA-11.3.1`). However, I'm seeing various (but not all) OSU tests fail with that setup with a segfault:
```
==== backtrace (tid:2888500) ====
 0 0x000000000002a160 ucs_debug_print_backtrace()  /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucs/debug/debug.c:656
 1 0x0000000000012b20 .annobin_sigaction.c()  sigaction.c:0
 2 0x000000000016065c __memmove_avx_unaligned_erms()  :0
 3 0x0000000000052aeb non_overlap_copy_content_same_ddt()  opal_datatype_copy.c:0
 4 0x00000000000874b9 ompi_datatype_sndrcv()  ???:0
 5 0x00000000000e2d84 ompi_coll_base_alltoall_intra_pairwise()  ???:0
 6 0x000000000000646c ompi_coll_tuned_alltoall_intra_dec_fixed()  ???:0
 7 0x0000000000089a4f MPI_Alltoall()  ???:0
 8 0x0000000000402d40 main()  ???:0
 9 0x0000000000023493 __libc_start_main()  ???:0
10 0x000000000040318e _start()  ???:0
=================================
[gcn30:2888500] *** Process received signal ***
[gcn30:2888500] Signal: Segmentation fault (11)
[gcn30:2888500] Signal code:  (-6)
[gcn30:2888500] Failing at address: 0xb155002c1334
[gcn30:2888500] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x1464e51a8b20]
[gcn30:2888500] [ 1] /lib64/libc.so.6(+0x16065c)[0x1464e4f3165c]
[gcn30:2888500] [ 2] /sw/arch/Centos8/EB_production/2021/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libopen-pal.so.40(+0x52aeb)[0x1464e485daeb]
[gcn30:2888500] [ 3] /sw/arch/Centos8/EB_production/2021/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so.40(ompi_datatype_sndrcv+0x949)[0x1464e72454b9]
[gcn30:2888500] [ 4] /sw/arch/Centos8/EB_production/2021/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so.40(ompi_coll_base_alltoall_intra_pairwise+0x174)[0x1464e72a0d84]
[gcn30:2888500] [ 5] /sw/arch/Centos8/EB_production/2021/software/OpenMPI/4.1.1-GCC-10.3.0/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_alltoall_intra_dec_fixed+0x7c)[0x1464d494746c]
[gcn30:2888500] [ 6] /sw/arch/Centos8/EB_production/2021/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so.40(MPI_Alltoall+0x15f)[0x1464e7247a4f]
[gcn30:2888500] [ 7] osu_alltoall[0x402d40]
[gcn30:2888500] [ 8] /lib64/libc.so.6(__libc_start_main+0xf3)[0x1464e4df4493]
[gcn30:2888500] [ 9] osu_alltoall[0x40318e]
[gcn30:2888500] *** End of error message ***
```

**Working/failing tests**

I haven't run all OSU benchmarks, but a few that run without issues are:
```
mpirun -np 2 osu_latency -d cuda D D
mpirun -np 2 osu_bw -d cuda D D
mpirun -np 2 osu_bcast -d cuda D D
```
A few that produce the segfault are:
```
mpirun -np 2 osu_gather -d cuda D D
mpirun -np 2 osu_alltoall -d cuda D D
```
Note that for the `osu_latency` and `osu_bw` tests I get results that correspond with what can be expected from GPU direct RDMA, so they really seem to function properly.

**Summary of discussion on EasyBuild Slack**

This thread on OpenMPI seems to suggest that UCX has limited support for GPU operations (see [here](https://github.com/open-mpi/ompi/pull/7970#issuecomment-665317074)) and might only work for pt2pt, but not for collectives (see [here](https://github.com/open-mpi/ompi/pull/7970#issuecomment-666635384)). The relevant parts:

>> OK I would like to drag in @jsquyres for confirming this, because my impression from Jeff was that for this case Open MPI could still be CUDA-aware. @bureddy Are you saying collectives won't work but point-to-point could? Interesting...

> Yes. it might change in the future when UCX handle all datatypes pack/unpack and collectives

The thread was from a while ago though, it is possible things have changed, but it would be unsure in which version (_if_ anything changed).

If UCX indeed does not _fully_ support operations on GPU buffers, then we might have to build `OpenMPI` with GPU support again (currently we build `OpenMPI` with `UCX` support, and then build `UCX` with CUDA support), i.e. build the `smcuda`  BTL again. It's a bit odd, since it seemed OpenMPI wanted to move away from that (and towards using UCX). Plus, it reintroduces the original issue that we prefer not to have a `fosscuda` toolchain, because that causes duplication of a lot of modules (see the original discussion [here](https://github.com/easybuilders/easybuild-easyconfigs/issues/12484)). As @Micket suggested on the chat: if it proves to be needed, we could try to build the `smcuda` BTL seperately, as an add-on, like we do for `UCX-CUDA`.

Open questions: 
- Can others confirm that the tests that fail for me also fail for them?
- Can we get confirmation from experts (e.g. on OpenMPI / UCX issue tracker) that indeed `UCX` still has this limited support?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Various benchmarks from OSU-Micro-Benchmarks/5.7.1-gompi-2021a-CUDA-11.3.1 segfault when using CUDA buffers #14801

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Various benchmarks from OSU-Micro-Benchmarks/5.7.1-gompi-2021a-CUDA-11.3.1 segfault when using CUDA buffers #14801

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions