Skip to content

Begin Step slower on large meshes (Benchmarking Adios) #4254

@abhishek1297

Description

@abhishek1297

We, Melissa, are currently trying to shift from zeromq implementation to Adios2 for leveraging RDMA support. After several issues we faced when using Adios SST, we were finally able to run it on Jean-zay which has a support of OmniPath network by setting some of the MPI variables as follows,

export OMPI_MCA_plm=slurm
export OMPI_MCA_mtl=psm2
export OMPI_MCA_osc=rdma
export OMPI_MCA_pml=ob1
export PSM2_MULTIRAIL=1 # required when the reader is on a different node.

These flag ensure that DataTransport=rdma is set properly and throw no runtime errors.

On jean-zay, OpenMPI PML is configured with cm (default) and ob1 options. If we use cm and pass DataTransport=rdma, Adios overrides this choice to evpath. When using ob1, it keeps rdma. So, we know RDMA is being used.

Setup

We are now benchmarking adios' communication latency against our zeromq implementation using a Heat PDE solver for sending 1000x1000 mesh for 100 timesteps.

We are running this benchmark on 3 nodes

  • 10 Writers on two different nodes (5 writers each node). 4 MPI ranks per writer.
  • The reader on a single node without any other process running alongside. 1 MPI rank.

Problem

Strangely, Adios seems slower than the zmq implementation. The total time for the entire program is,

zmq=00:08:46
adios=00:15:09

Whatever post-reception processing is done is exactly the same in both zmq and adios. So, I did some profiling on the reader's receiving loop and found that the begin_step call is extremely slow in our case when the mesh is as large as 1000x1000, and is taking most of the execution time.

These are the logs of memory_profiler for functions within the receive() call sorted by their total execution time.

Mon Jul 22 14:02:45 2024    internode/nodes3/1000x1000/rdma_1K_internode/profiles/receive.prof

         449269 function calls (439269 primitive calls) in 906.311 seconds

   Ordered by: internal time
   List reduced from 191 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
=======================================================================
     1000  863.424    0.863  863.426    0.863 engine.py:49(begin_step)
=======================================================================
     1000   30.814    0.031   32.831    0.033 iterative_moments.py:31(increment)
     1000    3.874    0.004    3.876    0.004 engine.py:123(get)
     1000    2.017    0.002    2.017    0.002 iterative_moments.py:48(get_delta)
       10    1.431    0.143    1.431    0.143 io.py:245(open)
       10    1.251    0.125    1.251    0.125 engine.py:32(close)

I am assuming that major chunk of the time inside begin_step() is due to some MPI collective calls but it seems absurdly large (0.8 seconds per call) for RDMA. In the SstVerbose outputs, adios is picking the hfi_* OmniPath interfaces. So, we know that at least, it chooses RDMA correctly.

I did the same run but with a 100x100 mesh and begin_step time is negligible.

Mon Jul 22 15:38:01 2024    internode/nodes3/100x100/rdma_100_internode/profiles/receive.prof

         513337 function calls (503337 primitive calls) in 7.644 seconds

   Ordered by: internal time
   List reduced from 191 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10    1.455    0.145    1.455    0.146 io.py:245(open)
     1000    1.422    0.001    1.425    0.001 engine.py:123(get)
       10    1.103    0.110    1.103    0.110 engine.py:32(close)
11000/1000    0.427    0.000    0.760    0.001 arrayprint.py:789(recurser)
     1000    0.223    0.000    3.592    0.004 base_server.py:668(get_data)
     1000    0.207    0.000    0.225    0.000 iterative_moments.py:31(increment)
     1000    0.181    0.000    0.519    0.001 arrayprint.py:934(fillFormat)
=======================================================================
     1000    0.152    0.000    0.153    0.000 engine.py:49(begin_step)
=======================================================================

I am not really sure if this is hardware-specific or not but can you help us investigate what could potentially be a bottleneck inside begin_step calls ?

I provide our reader-side code file base_server.py.

  • Line 828 receive(): The receiver loop going over all active sst files at the moment until all are writers have finished.
  • Line 614 check_data(): calls begin_step()
  • Line 662: get_data(): calls get() and end_step()

Extra

To calculate the communication time for each timestep I am doing the following,

engine.begin_step()
start = MPI.Wtime()
engine.get()
engine.end_step()
end = MPI.Wtime()
comm_time = end - start

Should the begin_step call also be included in the communication time ?

zmq is a bit different in the reception such that we send partial timesteps and then stitch the partial data before processing it. So, zmq communication time starts when the first partial timestep is received and ends when the entire timestep is stitched together.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions