Skip to content

OMPI+UCX on GPUs : drop of performance compared with pt2pt #7965

Open
@cayrols

Description

@cayrols

Hi all,

I am contacting you because of some OSC issues on GPUs.
First of all, I have noticed that since version 5.0 the pt2pt has been removed. Therefore, if I do not compile with UCX, which is optional, the OSC feature on GPUs at least does not work.
(Note that I have tried to compile ompi 5.0 with the latest UCX and a simple MPI_Put from GPU to GPU causes a deadlock.)

I found out that OMPI 4.0.4 with the latest UCX is compiling and running.
(Note that, for small sizes, i.e., say < 1e6 doubles, we need to set the variable UCX_ZCOPY_THRESH=1.)

The context is the following:
I have a bunch of GPUs that are exchanging data using MPI_Put routine and MPI_Fence, mainly.
When using pt2pt, I get correct bandwidth per rank. However, when using UCX, the same bandwidth drops.

I have done a reproducer so that you can have a try.
This reproducer gives me the following performance on Summit :

$ mpirun -n 12 -H c05n05:6,c05n06:6 --mca osc ^ucx -x LD_LIBRARY_PATH -x UCX_ZCOPY_THRESH=1 -x UCX_LOG_LEVEL=WARN ./bench_ucx

Label|	     nval(x1e+00)|  size|	Volume(x1MB)|	PingPongX 20(ms)| avg_send(ms)|	GB/s
___

   MPI_Alltoall        1000	     8	         0	     10.49	      0.52	      0.17
   MPI_Alltoall       10000	     8	         0	     15.54	      0.78	      1.15
   MPI_Alltoall       50000	     8	         4	     33.96	      1.70	      2.63
   MPI_Alltoall      100000	     8	         9	     72.16	      3.61	      2.48
   MPI_Alltoall      250000	     8	        22	    142.01	      7.10	      3.15
   MPI_Alltoall      500000	     8	        45	    273.70	     13.69	      3.27
   MPI_Alltoall     1000000	     8	        91	    515.96	     25.80	      3.47
   MPI_Alltoall    10000000	     8	       915	   5057.64	    252.88	      3.54

 Coll:MPI_Isend        1000	     8	         0	     11.11	      0.56	      0.16
 Coll:MPI_Isend       10000	     8	         0	     21.60	      1.08	      0.83
 Coll:MPI_Isend       50000	     8	         4	     30.88	      1.54	      2.90
 Coll:MPI_Isend      100000	     8	         9	     66.84	      3.34	      2.68
 Coll:MPI_Isend      250000	     8	        22	    137.92	      6.90	      3.24
 Coll:MPI_Isend      500000	     8	        45	    253.12	     12.66	      3.53
 Coll:MPI_Isend     1000000	     8	        91	    496.56	     24.83	      3.60
 Coll:MPI_Isend    10000000	     8	       915	   4930.47	    246.52	      3.63

   Coll:MPI_Put        1000	     8	         0	      9.06	      0.45	      0.20
   Coll:MPI_Put       10000	     8	         0	      8.14	      0.41	      2.20
   Coll:MPI_Put       50000	     8	         4	     29.46	      1.47	      3.04
   Coll:MPI_Put      100000	     8	         9	     58.10	      2.90	      3.08
   Coll:MPI_Put      250000	     8	        22	    141.57	      7.08	      3.16
   Coll:MPI_Put      500000	     8	        45	    282.01	     14.10	      3.17
   Coll:MPI_Put     1000000	     8	        91	    560.79	     28.04	      3.19
   Coll:MPI_Put    10000000	     8	       915	   5589.39	    279.47	      3.20

Now, when using UCX, I got :

$ mpirun -n 12 -H c05n05:6,c05n06:6 --mca osc ucx -x LD_LIBRARY_PATH -x UCX_ZCOPY_THRESH=1 -x UCX_LOG_LEVEL=WARN ./bench_ucx

Label|	     nval(x1e+00)|  size|	Volume(x1MB)|	PingPongX 20(ms)| avg_send(ms)|	GB/s
___
   MPI_Alltoall        1000	     8	         0	      9.77	      0.49	      0.18
   MPI_Alltoall       10000	     8	         0	     15.59	      0.78	      1.15
   MPI_Alltoall       50000	     8	         4	     43.95	      2.20	      2.03
   MPI_Alltoall      100000	     8	         9	     79.04	      3.95	      2.26
   MPI_Alltoall      250000	     8	        22	    159.94	      8.00	      2.79
   MPI_Alltoall      500000	     8	        45	    276.29	     13.81	      3.24
   MPI_Alltoall     1000000	     8	        91	    524.07	     26.20	      3.41
   MPI_Alltoall    10000000	     8	       915	   5048.09	    252.40	      3.54

 Coll:MPI_Isend        1000	     8	         0	      8.10	      0.40	      0.22
 Coll:MPI_Isend       10000	     8	         0	     32.06	      1.60	      0.56
 Coll:MPI_Isend       50000	     8	         4	     52.25	      2.61	      1.71
 Coll:MPI_Isend      100000	     8	         9	     59.39	      2.97	      3.01
 Coll:MPI_Isend      250000	     8	        22	    126.40	      6.32	      3.54
 Coll:MPI_Isend      500000	     8	        45	    257.02	     12.85	      3.48
 Coll:MPI_Isend     1000000	     8	        91	    542.05	     27.10	      3.30
 Coll:MPI_Isend    10000000	     8	       915	   4891.06	    244.55	      3.66

   Coll:MPI_Put        1000	     8	         0	      2.26	      0.11	      0.79
   Coll:MPI_Put       10000	     8	         0	      7.09	      0.35	      2.52
   Coll:MPI_Put       50000	     8	         4	     37.25	      1.86	      2.40
   Coll:MPI_Put      100000	     8	         9	     74.29	      3.71	      2.41
   Coll:MPI_Put      250000	     8	        22	    186.02	      9.30	      2.40
   Coll:MPI_Put      500000	     8	        45	    372.22	     18.61	      2.40
   Coll:MPI_Put     1000000	     8	        91	    740.21	     37.01	      2.42
   Coll:MPI_Put    10000000	     8	       915	   7467.59	    373.38	      2.39

From these output, you can see that on the last column, the BW per rank goes up to 3.5ish GB/s when using pt2pt, whereas, when using UCX, the BW does not exceed the 2.5 GB/s.

I have tried many flags, like -x UCX_TLS=ib,cuda_copy,cuda_ipc and others but none gave me something similar BW
to as the pt2pt, or better.

So, if you have any idea, maybe @bosilca, that would be really great.

Many thanks.


Details of the installation

Here are the flags that I used to install it on Summit

$ wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.4.tar.gz && 
   tar -xzvf openmpi-4.0.4.tar.gz && 
   cd openmpi-4.0.4 && 
   ./configure \
    --prefix=<prefix_path> \
    --enable-picky \                                                            
    --enable-visibility \                                                       
    --enable-contrib-no-build=vt \                                              
    --enable-mpirun-prefix-by-default \                                         
    --enable-dlopen \                                                           
    --enable-mpi1-compatibility \                                               
    --enable-shared \                                                           
    --enable-mpirun-prefix-by-default \                                         
    --with-cma \                                                                
    --with-hwloc=${HWLOC_ROOT} \                                                
    --with-cuda=${CUDA_ROOT} \                                                  
    --with-zlib=/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-7.4.0/zlib-1.2.11-tdykbkiueylpgx2rshpms3k3ncw5g3f6 \
    --with-ucx=${UCX_ROOT} \                                                    
    --with-mxm=/opt/mellanox/mxm \                                              
    --with-pmix=internal \                                                      
    --with-wrapper-ldflags= \                                                   
    --without-lsf \                                                             
    --without-psm \                                                             
    --without-libfabric  \                                                      
    --without-verbs \                                                           
    --without-psm2 \                                                            
    --without-alps \                                                            
    --without-sge \                                                             
    --without-slurm \                                                           
    --without-tm \                                                              
    --without-loadleveler \                                                     
    --disable-debug \                                                           
    --disable-memchecker \                                                      
    --disable-oshmem \                                                          
    --disable-java \                                                            
    --disable-mpi-java \                                                        
    --disable-man-pages &&                                                      
  make -j 20 &&                                                            
  make install

Note that the output of ucx_info is:

$ ucx_info -v
# UCT version=1.10.0 revision bbf159e
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/ccs/home/scayrols/Installs/ucx-git/gcc-7.4.0/gcc-/hwloc-1.11.11/cuda-/gdrcopy- --enable-compiler-opt=3 --enable-optimizations --disable-profiling --disable-frame-pointer --disable-memtrack --disable-debug --disable-debug-data --disable-params-check --disable-backtrace-detail --disable-logging --disable-mt --with-cuda=/sw/summit/cuda/10.1.243 --with-gdrcopy=/sw/summit/gdrcopy/2.0

Reproducer

Note that I had to change the extension .cu with .LOG to attach it.

bench_ucx.LOG

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions