-
Notifications
You must be signed in to change notification settings - Fork 146
Description
We, Melissa, are currently trying to shift from zeromq implementation to Adios2 for leveraging RDMA support. After several issues we faced when using Adios SST, we were finally able to run it on Jean-zay which has a support of OmniPath network by setting some of the MPI variables as follows,
export OMPI_MCA_plm=slurm
export OMPI_MCA_mtl=psm2
export OMPI_MCA_osc=rdma
export OMPI_MCA_pml=ob1
export PSM2_MULTIRAIL=1 # required when the reader is on a different node.These flag ensure that DataTransport=rdma is set properly and throw no runtime errors.
On jean-zay, OpenMPI PML is configured with cm (default) and ob1 options. If we use cm and pass DataTransport=rdma, Adios overrides this choice to evpath. When using ob1, it keeps rdma. So, we know RDMA is being used.
Setup
We are now benchmarking adios' communication latency against our zeromq implementation using a Heat PDE solver for sending 1000x1000 mesh for 100 timesteps.
We are running this benchmark on 3 nodes
- 10 Writers on two different nodes (5 writers each node). 4 MPI ranks per writer.
- The reader on a single node without any other process running alongside. 1 MPI rank.
Problem
Strangely, Adios seems slower than the zmq implementation. The total time for the entire program is,
zmq=00:08:46
adios=00:15:09
Whatever post-reception processing is done is exactly the same in both zmq and adios. So, I did some profiling on the reader's receiving loop and found that the begin_step call is extremely slow in our case when the mesh is as large as 1000x1000, and is taking most of the execution time.
These are the logs of memory_profiler for functions within the receive() call sorted by their total execution time.
Mon Jul 22 14:02:45 2024 internode/nodes3/1000x1000/rdma_1K_internode/profiles/receive.prof
449269 function calls (439269 primitive calls) in 906.311 seconds
Ordered by: internal time
List reduced from 191 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
=======================================================================
1000 863.424 0.863 863.426 0.863 engine.py:49(begin_step)
=======================================================================
1000 30.814 0.031 32.831 0.033 iterative_moments.py:31(increment)
1000 3.874 0.004 3.876 0.004 engine.py:123(get)
1000 2.017 0.002 2.017 0.002 iterative_moments.py:48(get_delta)
10 1.431 0.143 1.431 0.143 io.py:245(open)
10 1.251 0.125 1.251 0.125 engine.py:32(close)
I am assuming that major chunk of the time inside begin_step() is due to some MPI collective calls but it seems absurdly large (0.8 seconds per call) for RDMA. In the SstVerbose outputs, adios is picking the hfi_* OmniPath interfaces. So, we know that at least, it chooses RDMA correctly.
I did the same run but with a 100x100 mesh and begin_step time is negligible.
Mon Jul 22 15:38:01 2024 internode/nodes3/100x100/rdma_100_internode/profiles/receive.prof
513337 function calls (503337 primitive calls) in 7.644 seconds
Ordered by: internal time
List reduced from 191 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
10 1.455 0.145 1.455 0.146 io.py:245(open)
1000 1.422 0.001 1.425 0.001 engine.py:123(get)
10 1.103 0.110 1.103 0.110 engine.py:32(close)
11000/1000 0.427 0.000 0.760 0.001 arrayprint.py:789(recurser)
1000 0.223 0.000 3.592 0.004 base_server.py:668(get_data)
1000 0.207 0.000 0.225 0.000 iterative_moments.py:31(increment)
1000 0.181 0.000 0.519 0.001 arrayprint.py:934(fillFormat)
=======================================================================
1000 0.152 0.000 0.153 0.000 engine.py:49(begin_step)
=======================================================================
I am not really sure if this is hardware-specific or not but can you help us investigate what could potentially be a bottleneck inside begin_step calls ?
I provide our reader-side code file base_server.py.
- Line 828
receive(): The receiver loop going over all active sst files at the moment until all are writers have finished. - Line 614
check_data(): callsbegin_step() - Line 662:
get_data(): callsget()andend_step()
Extra
To calculate the communication time for each timestep I am doing the following,
engine.begin_step()
start = MPI.Wtime()
engine.get()
engine.end_step()
end = MPI.Wtime()
comm_time = end - startShould the begin_step call also be included in the communication time ?
zmq is a bit different in the reception such that we send partial timesteps and then stitch the partial data before processing it. So, zmq communication time starts when the first partial timestep is received and ends when the entire timestep is stitched together.