OMPI+UCX on GPUs : drop of performance compared with pt2pt

Hi all,

I am contacting you because of some OSC issues on GPUs.
First of all, I have noticed that since version 5.0 the pt2pt has been removed. Therefore, if I do not compile with UCX, which is optional, the OSC feature on GPUs at least does not work.
(Note that I have tried to compile ompi 5.0 with the latest UCX and a simple MPI_Put from GPU to GPU causes a deadlock.)

I found out that OMPI 4.0.4 with the latest UCX is compiling and running.
(Note that, for small sizes, i.e., say < 1e6 doubles, we need to set the variable UCX_ZCOPY_THRESH=1.)
 
The context is the following:
I have a bunch of GPUs that are exchanging data using MPI_Put routine and MPI_Fence, mainly.
When using pt2pt, I get correct bandwidth per rank. However, when using UCX, the same bandwidth drops.

I have done a reproducer so that you can have a try.
This reproducer gives me the following performance on Summit : 

```shell
$ mpirun -n 12 -H c05n05:6,c05n06:6 --mca osc ^ucx -x LD_LIBRARY_PATH -x UCX_ZCOPY_THRESH=1 -x UCX_LOG_LEVEL=WARN ./bench_ucx

Label|	     nval(x1e+00)|  size|	Volume(x1MB)|	PingPongX 20(ms)| avg_send(ms)|	GB/s
___

   MPI_Alltoall        1000	     8	         0	     10.49	      0.52	      0.17
   MPI_Alltoall       10000	     8	         0	     15.54	      0.78	      1.15
   MPI_Alltoall       50000	     8	         4	     33.96	      1.70	      2.63
   MPI_Alltoall      100000	     8	         9	     72.16	      3.61	      2.48
   MPI_Alltoall      250000	     8	        22	    142.01	      7.10	      3.15
   MPI_Alltoall      500000	     8	        45	    273.70	     13.69	      3.27
   MPI_Alltoall     1000000	     8	        91	    515.96	     25.80	      3.47
   MPI_Alltoall    10000000	     8	       915	   5057.64	    252.88	      3.54

 Coll:MPI_Isend        1000	     8	         0	     11.11	      0.56	      0.16
 Coll:MPI_Isend       10000	     8	         0	     21.60	      1.08	      0.83
 Coll:MPI_Isend       50000	     8	         4	     30.88	      1.54	      2.90
 Coll:MPI_Isend      100000	     8	         9	     66.84	      3.34	      2.68
 Coll:MPI_Isend      250000	     8	        22	    137.92	      6.90	      3.24
 Coll:MPI_Isend      500000	     8	        45	    253.12	     12.66	      3.53
 Coll:MPI_Isend     1000000	     8	        91	    496.56	     24.83	      3.60
 Coll:MPI_Isend    10000000	     8	       915	   4930.47	    246.52	      3.63

   Coll:MPI_Put        1000	     8	         0	      9.06	      0.45	      0.20
   Coll:MPI_Put       10000	     8	         0	      8.14	      0.41	      2.20
   Coll:MPI_Put       50000	     8	         4	     29.46	      1.47	      3.04
   Coll:MPI_Put      100000	     8	         9	     58.10	      2.90	      3.08
   Coll:MPI_Put      250000	     8	        22	    141.57	      7.08	      3.16
   Coll:MPI_Put      500000	     8	        45	    282.01	     14.10	      3.17
   Coll:MPI_Put     1000000	     8	        91	    560.79	     28.04	      3.19
   Coll:MPI_Put    10000000	     8	       915	   5589.39	    279.47	      3.20
```

Now, when using UCX, I got : 
```shell
$ mpirun -n 12 -H c05n05:6,c05n06:6 --mca osc ucx -x LD_LIBRARY_PATH -x UCX_ZCOPY_THRESH=1 -x UCX_LOG_LEVEL=WARN ./bench_ucx

Label|	     nval(x1e+00)|  size|	Volume(x1MB)|	PingPongX 20(ms)| avg_send(ms)|	GB/s
___
   MPI_Alltoall        1000	     8	         0	      9.77	      0.49	      0.18
   MPI_Alltoall       10000	     8	         0	     15.59	      0.78	      1.15
   MPI_Alltoall       50000	     8	         4	     43.95	      2.20	      2.03
   MPI_Alltoall      100000	     8	         9	     79.04	      3.95	      2.26
   MPI_Alltoall      250000	     8	        22	    159.94	      8.00	      2.79
   MPI_Alltoall      500000	     8	        45	    276.29	     13.81	      3.24
   MPI_Alltoall     1000000	     8	        91	    524.07	     26.20	      3.41
   MPI_Alltoall    10000000	     8	       915	   5048.09	    252.40	      3.54

 Coll:MPI_Isend        1000	     8	         0	      8.10	      0.40	      0.22
 Coll:MPI_Isend       10000	     8	         0	     32.06	      1.60	      0.56
 Coll:MPI_Isend       50000	     8	         4	     52.25	      2.61	      1.71
 Coll:MPI_Isend      100000	     8	         9	     59.39	      2.97	      3.01
 Coll:MPI_Isend      250000	     8	        22	    126.40	      6.32	      3.54
 Coll:MPI_Isend      500000	     8	        45	    257.02	     12.85	      3.48
 Coll:MPI_Isend     1000000	     8	        91	    542.05	     27.10	      3.30
 Coll:MPI_Isend    10000000	     8	       915	   4891.06	    244.55	      3.66

   Coll:MPI_Put        1000	     8	         0	      2.26	      0.11	      0.79
   Coll:MPI_Put       10000	     8	         0	      7.09	      0.35	      2.52
   Coll:MPI_Put       50000	     8	         4	     37.25	      1.86	      2.40
   Coll:MPI_Put      100000	     8	         9	     74.29	      3.71	      2.41
   Coll:MPI_Put      250000	     8	        22	    186.02	      9.30	      2.40
   Coll:MPI_Put      500000	     8	        45	    372.22	     18.61	      2.40
   Coll:MPI_Put     1000000	     8	        91	    740.21	     37.01	      2.42
   Coll:MPI_Put    10000000	     8	       915	   7467.59	    373.38	      2.39
```

From these output, you can see that on the last column, the BW per rank goes up to 3.5ish GB/s when using pt2pt, whereas, when using UCX, the BW does not exceed the 2.5 GB/s. 

I have tried many flags, like ```-x UCX_TLS=ib,cuda_copy,cuda_ipc``` and others but none gave me something similar BW
to as the pt2pt, or better.

So, if you have any idea, maybe @bosilca, that would be really great.

Many thanks.

-----------------------------
## Details of the installation

Here are the flags that I used to install it on Summit

```bash
$ wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.4.tar.gz && 
   tar -xzvf openmpi-4.0.4.tar.gz && 
   cd openmpi-4.0.4 && 
   ./configure \
    --prefix=<prefix_path> \
    --enable-picky \                                                            
    --enable-visibility \                                                       
    --enable-contrib-no-build=vt \                                              
    --enable-mpirun-prefix-by-default \                                         
    --enable-dlopen \                                                           
    --enable-mpi1-compatibility \                                               
    --enable-shared \                                                           
    --enable-mpirun-prefix-by-default \                                         
    --with-cma \                                                                
    --with-hwloc=${HWLOC_ROOT} \                                                
    --with-cuda=${CUDA_ROOT} \                                                  
    --with-zlib=/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-7.4.0/zlib-1.2.11-tdykbkiueylpgx2rshpms3k3ncw5g3f6 \
    --with-ucx=${UCX_ROOT} \                                                    
    --with-mxm=/opt/mellanox/mxm \                                              
    --with-pmix=internal \                                                      
    --with-wrapper-ldflags= \                                                   
    --without-lsf \                                                             
    --without-psm \                                                             
    --without-libfabric  \                                                      
    --without-verbs \                                                           
    --without-psm2 \                                                            
    --without-alps \                                                            
    --without-sge \                                                             
    --without-slurm \                                                           
    --without-tm \                                                              
    --without-loadleveler \                                                     
    --disable-debug \                                                           
    --disable-memchecker \                                                      
    --disable-oshmem \                                                          
    --disable-java \                                                            
    --disable-mpi-java \                                                        
    --disable-man-pages &&                                                      
  make -j 20 &&                                                            
  make install
```
Note that the output of ucx_info is:
```bash
$ ucx_info -v
# UCT version=1.10.0 revision bbf159e
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/ccs/home/scayrols/Installs/ucx-git/gcc-7.4.0/gcc-/hwloc-1.11.11/cuda-/gdrcopy- --enable-compiler-opt=3 --enable-optimizations --disable-profiling --disable-frame-pointer --disable-memtrack --disable-debug --disable-debug-data --disable-params-check --disable-backtrace-detail --disable-logging --disable-mt --with-cuda=/sw/summit/cuda/10.1.243 --with-gdrcopy=/sw/summit/gdrcopy/2.0
```
## Reproducer

Note that I had to change the extension .cu with .LOG to attach it.

[bench_ucx.LOG](https://github.com/open-mpi/ompi/files/4974878/bench_ucx.LOG)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OMPI+UCX on GPUs : drop of performance compared with pt2pt #7965

Details of the installation

Reproducer

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OMPI+UCX on GPUs : drop of performance compared with pt2pt #7965

Description

Details of the installation

Reproducer

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions