Description
Hi all,
I am contacting you because of some OSC issues on GPUs.
First of all, I have noticed that since version 5.0 the pt2pt has been removed. Therefore, if I do not compile with UCX, which is optional, the OSC feature on GPUs at least does not work.
(Note that I have tried to compile ompi 5.0 with the latest UCX and a simple MPI_Put from GPU to GPU causes a deadlock.)
I found out that OMPI 4.0.4 with the latest UCX is compiling and running.
(Note that, for small sizes, i.e., say < 1e6 doubles, we need to set the variable UCX_ZCOPY_THRESH=1.)
The context is the following:
I have a bunch of GPUs that are exchanging data using MPI_Put routine and MPI_Fence, mainly.
When using pt2pt, I get correct bandwidth per rank. However, when using UCX, the same bandwidth drops.
I have done a reproducer so that you can have a try.
This reproducer gives me the following performance on Summit :
$ mpirun -n 12 -H c05n05:6,c05n06:6 --mca osc ^ucx -x LD_LIBRARY_PATH -x UCX_ZCOPY_THRESH=1 -x UCX_LOG_LEVEL=WARN ./bench_ucx
Label| nval(x1e+00)| size| Volume(x1MB)| PingPongX 20(ms)| avg_send(ms)| GB/s
___
MPI_Alltoall 1000 8 0 10.49 0.52 0.17
MPI_Alltoall 10000 8 0 15.54 0.78 1.15
MPI_Alltoall 50000 8 4 33.96 1.70 2.63
MPI_Alltoall 100000 8 9 72.16 3.61 2.48
MPI_Alltoall 250000 8 22 142.01 7.10 3.15
MPI_Alltoall 500000 8 45 273.70 13.69 3.27
MPI_Alltoall 1000000 8 91 515.96 25.80 3.47
MPI_Alltoall 10000000 8 915 5057.64 252.88 3.54
Coll:MPI_Isend 1000 8 0 11.11 0.56 0.16
Coll:MPI_Isend 10000 8 0 21.60 1.08 0.83
Coll:MPI_Isend 50000 8 4 30.88 1.54 2.90
Coll:MPI_Isend 100000 8 9 66.84 3.34 2.68
Coll:MPI_Isend 250000 8 22 137.92 6.90 3.24
Coll:MPI_Isend 500000 8 45 253.12 12.66 3.53
Coll:MPI_Isend 1000000 8 91 496.56 24.83 3.60
Coll:MPI_Isend 10000000 8 915 4930.47 246.52 3.63
Coll:MPI_Put 1000 8 0 9.06 0.45 0.20
Coll:MPI_Put 10000 8 0 8.14 0.41 2.20
Coll:MPI_Put 50000 8 4 29.46 1.47 3.04
Coll:MPI_Put 100000 8 9 58.10 2.90 3.08
Coll:MPI_Put 250000 8 22 141.57 7.08 3.16
Coll:MPI_Put 500000 8 45 282.01 14.10 3.17
Coll:MPI_Put 1000000 8 91 560.79 28.04 3.19
Coll:MPI_Put 10000000 8 915 5589.39 279.47 3.20
Now, when using UCX, I got :
$ mpirun -n 12 -H c05n05:6,c05n06:6 --mca osc ucx -x LD_LIBRARY_PATH -x UCX_ZCOPY_THRESH=1 -x UCX_LOG_LEVEL=WARN ./bench_ucx
Label| nval(x1e+00)| size| Volume(x1MB)| PingPongX 20(ms)| avg_send(ms)| GB/s
___
MPI_Alltoall 1000 8 0 9.77 0.49 0.18
MPI_Alltoall 10000 8 0 15.59 0.78 1.15
MPI_Alltoall 50000 8 4 43.95 2.20 2.03
MPI_Alltoall 100000 8 9 79.04 3.95 2.26
MPI_Alltoall 250000 8 22 159.94 8.00 2.79
MPI_Alltoall 500000 8 45 276.29 13.81 3.24
MPI_Alltoall 1000000 8 91 524.07 26.20 3.41
MPI_Alltoall 10000000 8 915 5048.09 252.40 3.54
Coll:MPI_Isend 1000 8 0 8.10 0.40 0.22
Coll:MPI_Isend 10000 8 0 32.06 1.60 0.56
Coll:MPI_Isend 50000 8 4 52.25 2.61 1.71
Coll:MPI_Isend 100000 8 9 59.39 2.97 3.01
Coll:MPI_Isend 250000 8 22 126.40 6.32 3.54
Coll:MPI_Isend 500000 8 45 257.02 12.85 3.48
Coll:MPI_Isend 1000000 8 91 542.05 27.10 3.30
Coll:MPI_Isend 10000000 8 915 4891.06 244.55 3.66
Coll:MPI_Put 1000 8 0 2.26 0.11 0.79
Coll:MPI_Put 10000 8 0 7.09 0.35 2.52
Coll:MPI_Put 50000 8 4 37.25 1.86 2.40
Coll:MPI_Put 100000 8 9 74.29 3.71 2.41
Coll:MPI_Put 250000 8 22 186.02 9.30 2.40
Coll:MPI_Put 500000 8 45 372.22 18.61 2.40
Coll:MPI_Put 1000000 8 91 740.21 37.01 2.42
Coll:MPI_Put 10000000 8 915 7467.59 373.38 2.39
From these output, you can see that on the last column, the BW per rank goes up to 3.5ish GB/s when using pt2pt, whereas, when using UCX, the BW does not exceed the 2.5 GB/s.
I have tried many flags, like -x UCX_TLS=ib,cuda_copy,cuda_ipc
and others but none gave me something similar BW
to as the pt2pt, or better.
So, if you have any idea, maybe @bosilca, that would be really great.
Many thanks.
Details of the installation
Here are the flags that I used to install it on Summit
$ wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.4.tar.gz &&
tar -xzvf openmpi-4.0.4.tar.gz &&
cd openmpi-4.0.4 &&
./configure \
--prefix=<prefix_path> \
--enable-picky \
--enable-visibility \
--enable-contrib-no-build=vt \
--enable-mpirun-prefix-by-default \
--enable-dlopen \
--enable-mpi1-compatibility \
--enable-shared \
--enable-mpirun-prefix-by-default \
--with-cma \
--with-hwloc=${HWLOC_ROOT} \
--with-cuda=${CUDA_ROOT} \
--with-zlib=/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-7.4.0/zlib-1.2.11-tdykbkiueylpgx2rshpms3k3ncw5g3f6 \
--with-ucx=${UCX_ROOT} \
--with-mxm=/opt/mellanox/mxm \
--with-pmix=internal \
--with-wrapper-ldflags= \
--without-lsf \
--without-psm \
--without-libfabric \
--without-verbs \
--without-psm2 \
--without-alps \
--without-sge \
--without-slurm \
--without-tm \
--without-loadleveler \
--disable-debug \
--disable-memchecker \
--disable-oshmem \
--disable-java \
--disable-mpi-java \
--disable-man-pages &&
make -j 20 &&
make install
Note that the output of ucx_info is:
$ ucx_info -v
# UCT version=1.10.0 revision bbf159e
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/ccs/home/scayrols/Installs/ucx-git/gcc-7.4.0/gcc-/hwloc-1.11.11/cuda-/gdrcopy- --enable-compiler-opt=3 --enable-optimizations --disable-profiling --disable-frame-pointer --disable-memtrack --disable-debug --disable-debug-data --disable-params-check --disable-backtrace-detail --disable-logging --disable-mt --with-cuda=/sw/summit/cuda/10.1.243 --with-gdrcopy=/sw/summit/gdrcopy/2.0
Reproducer
Note that I had to change the extension .cu with .LOG to attach it.