Open
Description
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
Open MPI main branch (0ba5d47)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Installed with CUDA support (11.0.3 - the default on Summit) and UCX 1.14.0.
Loaded modules:
$ module list
Currently Loaded Modules:
1) lsf-tools/2.0 3) darshan-runtime/3.4.0-lite 5) DefApps 7) nsight-compute/2021.2.1 9) cuda/11.0.3
2) hsi/5.0.2.p5 4) xalt/1.2.1 6) gcc/9.3.0 8) nsight-systems/2021.3.1.54 10) gdrcopy/2.3
UCX config summary:
configure: =========================================================
configure: UCX build configuration:
configure: Build prefix: /ccs/home/jschuchart/opt-summit/ucx-1.14.0
configure: Configuration dir: ${prefix}/etc/ucx
configure: Preprocessor flags: -DCPU_FLAGS="" -I${abs_top_srcdir}/src -I${abs_top_builddir} -I${abs_top_builddir}/src
configure: C compiler: gcc -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch -Wno-pointer-sign -Werror-implicit-function-declaration -Wno-format-zero-length -Wnested-externs -Wshadow -Werror=declaration-after-statement
configure: C++ compiler: g++ -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch
configure: Multi-thread: disabled
configure: NUMA support: enabled
configure: MPI tests: disabled
configure: VFS support: no
configure: Devel headers: no
configure: io_demo CUDA support: no
configure: Bindings: < >
configure: UCS modules: < >
configure: UCT modules: < cuda ib rdmacm cma knem >
configure: CUDA modules: < gdrcopy >
configure: ROCM modules: < >
configure: IB modules: < >
configure: UCM modules: < cuda >
configure: Perf modules: < cuda >
configure: =========================================================
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
$ git submodule status
7d25bd021b57d4e3cea40d23bd3662180c269827 3rd-party/openpmix (v1.1.3-3825-g7d25bd02)
4725d89abe53c52343eeb49c90986c4d407d6392 3rd-party/prrte (psrvr-v2.0.0rc1-4609-g4725d89abe)
237ceff1a8ed996d855d69f372be9aaea44919ea config/oac (237ceff)
Please describe the system on which you are running
- Operating system/version:
- Computer hardware:
- Network type:
Details of the problem
I am seeing high latencies in the osu_latency benchmark (OSU benchmarks 7.0.1) when using pml/ucx
:
$ UCX_IB_GPU_DIRECT_RDMA=yes OMPI_MCA_pml=ucx jsrun -n 2 -a 1 -c 1 -g 1 -r 1 -b none ~/src/osu-micro-benchmarks-7.0.1/build-main/c/mpi/pt2pt/osu_latency -d cuda -m $((2*4096)) D D
accelerator device: cuda
# OSU MPI-CUDA Latency Test v7.0
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Latency (us)
0 2.73
1 437.98
2 441.63
4 433.29
8 438.72
16 433.87
32 439.61
64 437.90
128 432.96
256 443.43
512 431.55
1024 439.35
2048 436.77
4096 440.73
8192 441.46
If I disable UCX_IB_GPU_DIRECT_RDMA
the latency is slightly lower, but still too high:
$ UCX_IB_GPU_DIRECT_RDMA=no OMPI_MCA_coll=^han,cuda OMPI_MCA_pml=ucx jsrun -n 2 -a 1 -c 1 -g 1 -r 1 -b none ~/src/osu-micro-benchmarks-7.0.1/build-main/c/mpi/pt2pt/osu_latency -d cuda -m $((2*4096)) D D
accelerator device: cuda
# OSU MPI-CUDA Latency Test v7.0
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Latency (us)
0 2.70
1 370.96
2 362.72
4 369.19
8 363.51
16 371.36
32 362.82
64 368.86
128 363.66
256 369.75
512 365.19
1024 369.61
2048 365.51
4096 372.03
8192 368.54
With pml/ob1
and btl/uct
the latency is reasonable but I get a Segfault on messages above 4KB:
$ UCX_IB_GPU_DIRECT_RDMA=yes OMPI_MCA_coll=^han,cuda OMPI_MCA_pml=^ucx jsrun -n 2 -a 1 -c 1 -g 1 -r 1 -b none ~/src/osu-micro-benchmarks-7.0.1/build-main/c/mpi/pt2pt/osu_latency -d cuda -m $((2*4096)) D D
accelerator device: cuda
# OSU MPI-CUDA Latency Test v7.0
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Latency (us)
0 2.23
1 29.48
2 29.44
4 29.46
8 29.42
16 29.43
32 29.50
64 29.54
128 29.87
256 29.92
512 30.32
1024 30.63
2048 31.47
4096 32.78
[e28n13:249326:0:249326] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x200053200000)
==== backtrace (tid: 249326) ====
0 0x00000000000ada48 __memcpy_power7() :0
1 0x00000000000297e4 uct_rc_verbs_ep_am_bcopy() /ccs/home/jschuchart/src/ucx/ucx-1.14.0/build/src/uct/ib/../../../../src/uct/ib/rc/verbs/rc_verbs_ep.c:317
2 0x00000000000e2fd8 uct_ep_am_bcopy() /ccs/home/jschuchart/opt-summit/ucx-1.14.0/include/uct/api/uct.h:3020
3 0x00000000000e2fd8 mca_btl_uct_send_frag() /ccs/home/jschuchart/src/openmpi/ompi-clean/build/opal/mca/btl/uct/../../../../../opal/mca/btl/uct/btl_uct_am.c:212
4 0x00000000000d9d30 mca_btl_uct_component_progress_pending() /ccs/home/jschuchart/src/openmpi/ompi-clean/build/opal/mca/btl/uct/../../../../../opal/mca/btl/uct/btl_uct_component.c:600
5 0x00000000000d9fc4 mca_btl_uct_component_progress() /ccs/home/jschuchart/src/openmpi/ompi-clean/build/opal/mca/btl/uct/../../../../../opal/mca/btl/uct/btl_uct_component.c:649
6 0x000000000003289c opal_progress() /ccs/home/jschuchart/src/openmpi/ompi-clean/build/opal/../../opal/runtime/opal_progress.c:224
7 0x0000000000383410 ompi_request_wait_completion() /ccs/home/jschuchart/src/openmpi/ompi-clean/build/ompi/mca/pml/ob1/../../../../../ompi/request/request.h:492
8 0x0000000000385de4 mca_pml_ob1_send() /ccs/home/jschuchart/src/openmpi/ompi-clean/build/ompi/mca/pml/ob1/../../../../../ompi/mca/pml/ob1/pml_ob1_isend.c:327
9 0x0000000000189f98 PMPI_Send() /ccs/home/jschuchart/src/openmpi/ompi-clean/build/ompi/mpi/c/../../../../ompi/mpi/c/send.c:93
10 0x00000000100032d8 main() /autofs/nccs-svm1_home1/jschuchart/src/osu-micro-benchmarks-7.0.1/build-main/c/mpi/pt2pt/../../../../c/mpi/pt2pt/osu_latency.c:168
11 0x0000000000024078 .annobin_libc_start.c() libc-start.c:0
12 0x0000000000024264 __libc_start_main() ???:0
=================================
ERROR: One or more process (first noticed rank 0) terminated with signal 11
I repeated the experiment with CUDA 11.5.2 (the newest available module on Summit). The latencies are similar, but with pml/ucx I get this warning:
[1679092339.660667] [d35n17:308900:0] cuda_ipc_iface.c:135 UCX ERROR nvmlInit_v2() failed: Driver/library version mismatch
What am I doing wrong here? Anything I should set to get latencies that are closer to those I get with host memory?