Begin Step slower on large meshes (Benchmarking Adios)

We, [Melissa](https://melissa.gitlabpages.inria.fr/melissa/#summary), are currently trying to shift from zeromq implementation to  Adios2 for leveraging RDMA support. After several issues we faced when using Adios SST, we were finally able to run it on Jean-zay which has a support of OmniPath network by setting some of the MPI variables as follows,

```bash
export OMPI_MCA_plm=slurm
export OMPI_MCA_mtl=psm2
export OMPI_MCA_osc=rdma
export OMPI_MCA_pml=ob1
export PSM2_MULTIRAIL=1 # required when the reader is on a different node.
```
These flag ensure that `DataTransport=rdma` is set properly and throw no runtime errors.

_On jean-zay, OpenMPI PML is configured with `cm` (default) and `ob1` options. If we use `cm` and pass `DataTransport=rdma`, Adios overrides this choice to `evpath`. When using `ob1`, it keeps `rdma`. So, we know RDMA is being used._

## Setup
We are now benchmarking adios' communication latency against our zeromq implementation using a Heat PDE solver for sending 1000x1000 mesh for 100 timesteps.

We are running this benchmark on 3 nodes
- 10 Writers on two different nodes (5 writers each node). 4 MPI ranks per writer.
- The reader on a single node without any other process running alongside. 1 MPI rank.

## Problem
Strangely, Adios seems slower than the zmq implementation. The total time for the entire program is,
```
zmq=00:08:46
adios=00:15:09
```

Whatever post-reception processing is done is exactly the same in both zmq and adios. So, I did some profiling on the reader's receiving loop and found that the `begin_step` call is extremely slow in our case when the mesh is as large as 1000x1000, and is taking most of the execution time.

These are the logs of `memory_profiler` for functions within the `receive()` call sorted by their total execution time.

```
Mon Jul 22 14:02:45 2024    internode/nodes3/1000x1000/rdma_1K_internode/profiles/receive.prof

         449269 function calls (439269 primitive calls) in 906.311 seconds

   Ordered by: internal time
   List reduced from 191 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
=======================================================================
     1000  863.424    0.863  863.426    0.863 engine.py:49(begin_step)
=======================================================================
     1000   30.814    0.031   32.831    0.033 iterative_moments.py:31(increment)
     1000    3.874    0.004    3.876    0.004 engine.py:123(get)
     1000    2.017    0.002    2.017    0.002 iterative_moments.py:48(get_delta)
       10    1.431    0.143    1.431    0.143 io.py:245(open)
       10    1.251    0.125    1.251    0.125 engine.py:32(close)
```

I am assuming that major chunk of the time inside `begin_step()` is due to some MPI collective calls but it seems absurdly large (0.8 seconds per call) for RDMA. In the `SstVerbose` outputs, adios is picking the `hfi_*` OmniPath interfaces. So, we know that at least, it chooses RDMA correctly.

I did the same run but with a 100x100 mesh and `begin_step` time is negligible.
```
Mon Jul 22 15:38:01 2024    internode/nodes3/100x100/rdma_100_internode/profiles/receive.prof

         513337 function calls (503337 primitive calls) in 7.644 seconds

   Ordered by: internal time
   List reduced from 191 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10    1.455    0.145    1.455    0.146 io.py:245(open)
     1000    1.422    0.001    1.425    0.001 engine.py:123(get)
       10    1.103    0.110    1.103    0.110 engine.py:32(close)
11000/1000    0.427    0.000    0.760    0.001 arrayprint.py:789(recurser)
     1000    0.223    0.000    3.592    0.004 base_server.py:668(get_data)
     1000    0.207    0.000    0.225    0.000 iterative_moments.py:31(increment)
     1000    0.181    0.000    0.519    0.001 arrayprint.py:934(fillFormat)
=======================================================================
     1000    0.152    0.000    0.153    0.000 engine.py:49(begin_step)
=======================================================================
```

_I am not really sure if this is hardware-specific or not but can you help us investigate what could potentially be a bottleneck inside `begin_step` calls ?_

I provide our reader-side code file [base_server.py](https://gitlab.inria.fr/melissa/melissa/-/blob/benchmark-adios/melissa/server/base_server.py).
- Line 828 `receive()`: The receiver loop going over all active sst files at the moment until all are writers have finished.
- Line 614 `check_data()`: calls `begin_step()`
- Line 662: `get_data()`: calls `get()` and `end_step()`

## Extra
To calculate the communication time for each timestep I am doing the following,
```python
engine.begin_step()
start = MPI.Wtime()
engine.get()
engine.end_step()
end = MPI.Wtime()
comm_time = end - start
```
_Should the `begin_step` call also be included in the communication time ?_

zmq is a bit different in the reception such that we send partial timesteps and then stitch the partial data before processing it. So, zmq communication time starts when the first partial timestep is received and ends when the entire timestep is stitched together.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Begin Step slower on large meshes (Benchmarking Adios) #4254

Setup

Problem

Extra

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Begin Step slower on large meshes (Benchmarking Adios) #4254

Description

Setup

Problem

Extra

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions