Skip to content

RFC: Provide equivalence of MPICH_ASYNC_PROGRESS #13088

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hominhquan
Copy link

This PR is follow-up of #13074.

@janjust
Copy link
Contributor

janjust commented Feb 11, 2025

@hominhquan thanks for the PR, do you have any performance data and/or testing data?

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A step in the right direction. I still have an issue with the fact that the thread is not bound and cannot be bound, but we can address that later.

@hominhquan hominhquan force-pushed the mqho/async_progress branch 2 times, most recently from b1f4dad to f3e8fc3 Compare February 12, 2025 14:03
@hominhquan
Copy link
Author

hominhquan commented Feb 12, 2025

@hominhquan thanks for the PR, do you have any performance data and/or testing data?

@janjust Yes, I shared in #13074 that we observed a gain of upto x1.4 in OSU_Ireduce. Note that most OSU benchmarks only run on two MPI processes, so memory contention and CPU occupation is not stressed enough to their limit.

@devreal
Copy link
Contributor

devreal commented Feb 12, 2025

@hominhquan What is the impact on collective operations, both in shared and distributed memory? I imagine there to be more contention...

hppritcha
hppritcha previously approved these changes Feb 12, 2025
Copy link
Member

@hppritcha hppritcha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many other things to consider in terms of performance but in terms of changes to just support the async thread capability this looks okay. Low risk since its buy in.

opal_progress_set_event_flag(OPAL_EVLOOP_ONCE | OPAL_EVLOOP_NONBLOCK);
#endif
/* shutdown async progress thread before tearing down further services */
if (opal_async_progress_thread_spawned) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is okay for now but does leave a hole to plug for the sessions model, but since this is an buy-in option for the application user it should be okay for how.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind elaborate on this @hppritcha, I fail to see the issue with the session model.

@hominhquan
Copy link
Author

hominhquan commented Feb 12, 2025

@hominhquan What is the impact on collective operations, both in shared and distributed memory? I imagine there to be more contention...

I only compared with OSU_Ireduce. I'll run the whole OSU collective suite this night for a larger view, but only on a single-node shared-mem config, since I don't have access to any cluster system.

bosilca
bosilca previously approved these changes Feb 12, 2025
@hominhquan
Copy link
Author

@hominhquan What is the impact on collective operations, both in shared and distributed memory? I imagine there to be more contention...

I only compared with OSU_Ireduce. I'll run the whole OSU collective suite this night for a larger view, but only on a single-node shared-mem config, since I don't have access to any cluster system.

@devreal below is our results on a single-node Grace CPU, in which we see the sweet-spot around 128-256 KB per rank. There was degradation on three operations: osu_iallgather, osu_iallgatherv and osu_ialltoall
osu_speedup_async_progress

@devreal
Copy link
Contributor

devreal commented Feb 13, 2025

Is that a 3x slowdown for small messages? That would be very concerning. Any idea what would cause that?

@hominhquan
Copy link
Author

Is that a 3x slowdown for small messages? That would be very concerning. Any idea what would cause that?

It would come from the synchronization overhead between the progress thread (now executing opal_progress()) and the main thread (waiting for the job done at each MPI_Test() or MPI_Wait()). This incompressible cost turns out to be important on small communications (Amdahl's law).

@devreal
Copy link
Contributor

devreal commented Feb 13, 2025

It's not Amdahl's law if the execution gets slower when you add more resources :) In a perfect world, the thread calling MPI_Iallreduce would trigger the progress thread, which then immediately goes ahead and executes the required communications before returning control to the main thread. This looks like many 10s of microseconds synchronization overhead, which is more than I would expect.

@hominhquan
Copy link
Author

As said @bosilca in #13074, (back in 2010-2014), it was hard to get an optimal solution for all message sizes and to all use-cases, and I confirm this conclusion. Spawning a thread introduces many side effects that we can hardly measure their impact. The idea behind this patch is to (re)open the door to further improvement and fine-tuning (core-binding ? time-based yield/progression ? work-stealing ?).

@bosilca
Copy link
Member

bosilca commented Feb 13, 2025

These results looks suspicious. OSU doesn't do overlap, it basically posts the non-blocking and wait for it.

For small messages I could understand 10 to 20 percent performance degradation, but not 3x. And for large messages on an iallgather a 2x increase in performance ? Where is that extra bandwidth coming from ?

What exactly is the speedup you report on this graph ?

@hominhquan
Copy link
Author

hominhquan commented Feb 13, 2025

Recent version of OSU added, for example osu_ireduce which does overlap and measure the overall time of MPI_Ireduce + dummy_compute(latency_seq) + MPI_Wait, in which dummy_compute() simulates a computation period of latency_seq = elapsed(MPI_Ireduce + MPI_Wait). The goal was to see whether the MPI library manages to overlap 100% the non-blocking collective with the dummy_compute(...). They also did the same with other osu_iallreduce.c, osu_ialltoall or osu_ibcast.c etc..

The results I showed above is on the overall_time of those non-blocking operations with OPAL_ASYNC_PROGRESS=[0|1], on the latest OSU benchmark (>= 7.x) https://mvapich.cse.ohio-state.edu/benchmarks/

The speedup is expected from the fact that some tasks (e.g. Ireduce or purely data bcast/scatter/gather) are done really in parallel of dummy_compute() by the new progress thread, and not at the end of communication (MPI_Wait()).

I tested with MPICH's MPICH_ASYNC_PROGRESS=1 and also observed some performance gain "à l'époque" (numbers to be refreshed by new runs).

Again, I know the limitation of OSU of only using two MPI processes, where resource contention is not stressed much far. They mimic anyway a real well-written overlapped non-blocking schema.

@hjelmn
Copy link
Member

hjelmn commented Feb 13, 2025

I have been out of the loop for awhile but I thought the idea was to do more targeted progress rather than the hammer that is just looping on opal_progress (which is predictably bad-- we knew this years ago). The concept I remember was signaled sends that could then wake up the remote process that would then make progress. Did that go nowhere?

The way I remember it, the reason this speeds up large messages is we can progress the RNDV without entering MPI. This is what signaled sends in the BTL were supposed to address. It would send the RTS which would trigger an RDMA get, RTR, whatever then go back to sleep.

@hominhquan hominhquan dismissed stale reviews from bosilca and hppritcha via b3a29bf February 15, 2025 13:08
@hominhquan
Copy link
Author

Ping, please tell me if this PR needs more discussion and has its merit to be merged ?

Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favor of this PR. There is (close to) no harm in merging for users who don't enable the progress thread and I believe it gives us a base for further experimentation.

There are conflicts that need to be resolved though.

@@ -59,6 +59,7 @@ OPAL_DECLSPEC extern int opal_initialized;
OPAL_DECLSPEC extern bool opal_built_with_cuda_support;
OPAL_DECLSPEC extern bool opal_built_with_rocm_support;
OPAL_DECLSPEC extern bool opal_built_with_ze_support;
OPAL_DECLSPEC extern bool opal_async_progress_thread_spawned;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch removes the OPAL_ENABLE_PROGRESS_THREADS pre-processor guard from many places (like https://github.com/open-mpi/ompi/pull/13088/files#diff-b23aacc904fb4f13b003d1c716bf6db21c5c490a319b58a3ab70e8f3a8226885L48). Can we define opal_async_progress_thread_spawned to false in the header if progress threads are explicitly disabled at compile time? That makes the code cleaner and allows the compiler to optimize away code that is explicitly disabled.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done in new updated version. Thanks for proposal.

@jsquyres
Copy link
Member

@hominhquan Can you rebase this to the tip of main and answer @devreal's pending question?

Thanks!

@hominhquan hominhquan force-pushed the mqho/async_progress branch from b3a29bf to c717909 Compare June 10, 2025 08:40
@hominhquan
Copy link
Author

hominhquan commented Jun 10, 2025

@jsquyres I've just pushed an updated version, rebased on main and taken in account @devreal 's comment.
Sorry for the delayed response.

Comment on lines 350 to 355
while (p_thread_arg->running) {
const int64_t new_events = _opal_progress();
opal_atomic_add_fetch_64(&p_thread_arg->nb_events_reported, new_events);
}
Copy link
Contributor

@devreal devreal Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the tight loop around an atomic update here. We should at least restrict the update to when there were actual events, i.e., new_events > 0, which should be rare relative to the number of iterations this loop will perform.

Also, should we enable opal_progress_yield_when_idle by default when the progress thread is enabled (to reduce contention created by the progress thread)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added if (new_events > 0) around the atomic add.

On the opal_progress_yield_when_idle, it is set to true during progress thread spawning at line opal_progress.c:L226

@hominhquan hominhquan force-pushed the mqho/async_progress branch from c717909 to 757b0f4 Compare June 10, 2025 10:24
Comment on lines 363 to 376
#if OPAL_ENABLE_PROGRESS_THREADS == 1
if (opal_async_progress_thread_spawned) {
/* async progress thread alongside may has processed new events,
* atomically read and reset nb_events_reported to zero.
*/
return opal_atomic_swap_64(&thread_arg.nb_events_reported, 0);
} else {
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the piecemeal comments, I'm discovering new things on each pass... Now the application threads aren't yielding anymore. Any chance we can have them yield here as well to avoid busy looping on this swap?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I got your idea. No problem, I added the yield for application thread.

- The SW-based async progress thread has been planned long time ago in
  683efcb, but has never been enabled/implemented since.
- This commit enables the spawn of an async progress thread to execute
  _opal_progress() routine when enabled at both compile time and runtime
  (--enable-progress-threads (default=enabled) and env OMPI_ASYNC_PROGRESS or
  OPAL_ASYNC_PROGRESS = 1).
- Fix minor typo in opal_progress.h doxygen comment

Signed-off-by: Minh Quan Ho <[email protected]>
@hominhquan hominhquan force-pushed the mqho/async_progress branch from 757b0f4 to 41a1b44 Compare June 10, 2025 20:31
Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anything need to be documented about this? E.g., how does the user activate this functionality? What are the benefits / drawbacks?

@tonyskjellum
Copy link

tonyskjellum commented Jun 11, 2025 via email

@hominhquan
Copy link
Author

Does anything need to be documented about this? E.g., how does the user activate this functionality? What are the benefits / drawbacks?

Right , can a new docs/launching-apps/progress_thread.rst entry be a plausible place ? I can see some points :

  • how to enable/disable the feature build at configure and, when enabled, how to activate it at runtime.
  • Note about some beneficial usecases, and warn about possible performance degradation in other situations.

@jsquyres
Copy link
Member

Right , can a new docs/launching-apps/progress_thread.rst entry be a plausible place ? I can see some points :

  • how to enable/disable the feature build at configure and, when enabled, how to activate it at runtime.
  • Note about some beneficial usecases, and warn about possible performance degradation in other situations.

Sure, that sounds fine. And possibly also something under https://docs.open-mpi.org/en/v5.0.x/installing-open-mpi/configure-cli-options/.

I have not read the code, but @tonyskjellum and @bosilca's original points on #13074 are well-noted -- this is not a silver bullet of async progress and there are many, many tradeoffs -- potentially even a fairly narrow window for where performance gains will be realized. That should be clearly called out in the docs.

@hominhquan
Copy link
Author

hominhquan commented Jun 12, 2025

@tonyskjellum We are definitely aware of this not-always-win feature, extensively studied and discussed in the past. However there are still situations today where it can improve performance (single-node shared-mem non-blocking comm., spare (or communication-dedicated) cores , see #13088 (comment)). The beneficial window is narrow, but it exists.

The progress thread, even enabled at configure, is still not activated at runtime (i.e no spawning new thread), user must set an env var OPAL_ASYNC_PROGRESS=1 to activate it, like what was done in MPICH.

AC_MSG_RESULT([yes])
opal_want_progress_threads=1
fi
AC_DEFINE_UNQUOTED([OPAL_ENABLE_PROGRESS_THREADS], [$opal_want_progress_threads],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the comment about maintaining the OPAL_ENABLE_PROGRESS_THREADS define but removing its uses all over the code, but the answer makes no sense to me. Not everything is protected by opal_async_progress_thread_spawned so the compiler will never be able to completely remove all extra code and variables without complaining (such as variable set but not used).

If there is no bug in the old support for progress thread maintain the code as much as possible.

}
#endif

int opal_progress(void)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to track and return the number of event managed by the progress thread. The reason is that it will mostly be inaccurate if there are multiple threads calling into opal_progress to check for progress, and that would lead them to make incorrect assumptions (because they will not see events that were opal_atomic_swap_64 by another thread).

But this is not really an issue, it only affects a single location ompi/request/req_test.c, and the impact will be minimal. I propose you remote all atomic operations around thread_arg.nb_events_reported, and make opal_progress return 0 in all cases. Then, in ompi/request/req_test.c you always goto recheck_request_status after the call to opal_progress.

@bosilca
Copy link
Member

bosilca commented Jun 12, 2025

I saw your question about the smcuda BTL and it's peculiar support for threads (aka. having its own thread waiting on a fifo to be signalled by peers when messages are posted). I don't think we want to maintain that support, it will spawn yet another thread, and have it only progress the smcuda BTL. As a result the smcuda BTL will be progressed twice as much as the others (once by its own thread because all the other local processes will signal that fifo, and once by the progress thread via the BTL progress function mca_btl_smcuda_component_progress).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants