RFC: Provide equivalence of MPICH_ASYNC_PROGRESS #13088

hominhquan · 2025-02-11T15:03:12Z

This PR is follow-up of #13074.

janjust · 2025-02-11T16:31:05Z

@hominhquan thanks for the PR, do you have any performance data and/or testing data?

bosilca

A step in the right direction. I still have an issue with the fact that the thread is not bound and cannot be bound, but we can address that later.

config/opal_configure_options.m4

opal/runtime/opal_progress.c

opal/runtime/opal_progress.h

hominhquan · 2025-02-12T14:42:54Z

@hominhquan thanks for the PR, do you have any performance data and/or testing data?

@janjust Yes, I shared in #13074 that we observed a gain of upto x1.4 in OSU_Ireduce. Note that most OSU benchmarks only run on two MPI processes, so memory contention and CPU occupation is not stressed enough to their limit.

devreal · 2025-02-12T14:45:43Z

@hominhquan What is the impact on collective operations, both in shared and distributed memory? I imagine there to be more contention...

hppritcha

There are many other things to consider in terms of performance but in terms of changes to just support the async thread capability this looks okay. Low risk since its buy in.

hppritcha · 2025-02-12T15:08:00Z

ompi/runtime/ompi_mpi_finalize.c

-    opal_progress_set_event_flag(OPAL_EVLOOP_ONCE | OPAL_EVLOOP_NONBLOCK);
-#endif
+    /* shutdown async progress thread before tearing down further services */
+    if (opal_async_progress_thread_spawned) {


This is okay for now but does leave a hole to plug for the sessions model, but since this is an buy-in option for the application user it should be okay for how.

Do you mind elaborate on this @hppritcha, I fail to see the issue with the session model.

hominhquan · 2025-02-12T15:13:29Z

@hominhquan What is the impact on collective operations, both in shared and distributed memory? I imagine there to be more contention...

I only compared with OSU_Ireduce. I'll run the whole OSU collective suite this night for a larger view, but only on a single-node shared-mem config, since I don't have access to any cluster system.

opal/runtime/opal_progress.h

hominhquan · 2025-02-13T10:50:41Z

@hominhquan What is the impact on collective operations, both in shared and distributed memory? I imagine there to be more contention...

I only compared with OSU_Ireduce. I'll run the whole OSU collective suite this night for a larger view, but only on a single-node shared-mem config, since I don't have access to any cluster system.

@devreal below is our results on a single-node Grace CPU, in which we see the sweet-spot around 128-256 KB per rank. There was degradation on three operations: osu_iallgather, osu_iallgatherv and osu_ialltoall

devreal · 2025-02-13T13:10:30Z

Is that a 3x slowdown for small messages? That would be very concerning. Any idea what would cause that?

hominhquan · 2025-02-13T13:24:50Z

Is that a 3x slowdown for small messages? That would be very concerning. Any idea what would cause that?

It would come from the synchronization overhead between the progress thread (now executing opal_progress()) and the main thread (waiting for the job done at each MPI_Test() or MPI_Wait()). This incompressible cost turns out to be important on small communications (Amdahl's law).

devreal · 2025-02-13T13:33:16Z

It's not Amdahl's law if the execution gets slower when you add more resources :) In a perfect world, the thread calling MPI_Iallreduce would trigger the progress thread, which then immediately goes ahead and executes the required communications before returning control to the main thread. This looks like many 10s of microseconds synchronization overhead, which is more than I would expect.

hominhquan · 2025-02-13T13:46:04Z

As said @bosilca in #13074, (back in 2010-2014), it was hard to get an optimal solution for all message sizes and to all use-cases, and I confirm this conclusion. Spawning a thread introduces many side effects that we can hardly measure their impact. The idea behind this patch is to (re)open the door to further improvement and fine-tuning (core-binding ? time-based yield/progression ? work-stealing ?).

bosilca · 2025-02-13T16:30:58Z

These results looks suspicious. OSU doesn't do overlap, it basically posts the non-blocking and wait for it.

For small messages I could understand 10 to 20 percent performance degradation, but not 3x. And for large messages on an iallgather a 2x increase in performance ? Where is that extra bandwidth coming from ?

What exactly is the speedup you report on this graph ?

hominhquan · 2025-02-13T17:05:04Z

Recent version of OSU added, for example osu_ireduce which does overlap and measure the overall time of MPI_Ireduce + dummy_compute(latency_seq) + MPI_Wait, in which dummy_compute() simulates a computation period of latency_seq = elapsed(MPI_Ireduce + MPI_Wait). The goal was to see whether the MPI library manages to overlap 100% the non-blocking collective with the dummy_compute(...). They also did the same with other osu_iallreduce.c, osu_ialltoall or osu_ibcast.c etc..

The results I showed above is on the overall_time of those non-blocking operations with OPAL_ASYNC_PROGRESS=[0|1], on the latest OSU benchmark (>= 7.x) https://mvapich.cse.ohio-state.edu/benchmarks/

The speedup is expected from the fact that some tasks (e.g. Ireduce or purely data bcast/scatter/gather) are done really in parallel of dummy_compute() by the new progress thread, and not at the end of communication (MPI_Wait()).

I tested with MPICH's MPICH_ASYNC_PROGRESS=1 and also observed some performance gain "à l'époque" (numbers to be refreshed by new runs).

Again, I know the limitation of OSU of only using two MPI processes, where resource contention is not stressed much far. They mimic anyway a real well-written overlapped non-blocking schema.

hjelmn · 2025-02-13T21:04:16Z

I have been out of the loop for awhile but I thought the idea was to do more targeted progress rather than the hammer that is just looping on opal_progress (which is predictably bad-- we knew this years ago). The concept I remember was signaled sends that could then wake up the remote process that would then make progress. Did that go nowhere?

The way I remember it, the reason this speeds up large messages is we can progress the RNDV without entering MPI. This is what signaled sends in the BTL were supposed to address. It would send the RTS which would trigger an RDMA get, RTR, whatever then go back to sleep.

hominhquan · 2025-02-21T10:42:15Z

Ping, please tell me if this PR needs more discussion and has its merit to be merged ?

devreal

I am in favor of this PR. There is (close to) no harm in merging for users who don't enable the progress thread and I believe it gives us a base for further experimentation.

There are conflicts that need to be resolved though.

devreal · 2025-04-22T20:54:44Z

opal/runtime/opal_params_core.h

@@ -59,6 +59,7 @@ OPAL_DECLSPEC extern int opal_initialized;
 OPAL_DECLSPEC extern bool opal_built_with_cuda_support;
 OPAL_DECLSPEC extern bool opal_built_with_rocm_support;
 OPAL_DECLSPEC extern bool opal_built_with_ze_support;
+OPAL_DECLSPEC extern bool opal_async_progress_thread_spawned;


This patch removes the OPAL_ENABLE_PROGRESS_THREADS pre-processor guard from many places (like https://github.com/open-mpi/ompi/pull/13088/files#diff-b23aacc904fb4f13b003d1c716bf6db21c5c490a319b58a3ab70e8f3a8226885L48). Can we define opal_async_progress_thread_spawned to false in the header if progress threads are explicitly disabled at compile time? That makes the code cleaner and allows the compiler to optimize away code that is explicitly disabled.

Yes, done in new updated version. Thanks for proposal.

jsquyres · 2025-05-23T22:32:14Z

@hominhquan Can you rebase this to the tip of main and answer @devreal's pending question?

Thanks!

hominhquan · 2025-06-10T08:43:16Z

@jsquyres I've just pushed an updated version, rebased on main and taken in account @devreal 's comment.
Sorry for the delayed response.

devreal · 2025-06-10T10:09:36Z

opal/runtime/opal_progress.c

+    while (p_thread_arg->running) {
+        const int64_t new_events = _opal_progress();
+        opal_atomic_add_fetch_64(&p_thread_arg->nb_events_reported, new_events);
+    }


I don't like the tight loop around an atomic update here. We should at least restrict the update to when there were actual events, i.e., new_events > 0, which should be rare relative to the number of iterations this loop will perform.

Also, should we enable opal_progress_yield_when_idle by default when the progress thread is enabled (to reduce contention created by the progress thread)?

I added if (new_events > 0) around the atomic add.

On the opal_progress_yield_when_idle, it is set to true during progress thread spawning at line opal_progress.c:L226

devreal · 2025-06-10T15:27:26Z

opal/runtime/opal_progress.c

+#if OPAL_ENABLE_PROGRESS_THREADS == 1
+    if (opal_async_progress_thread_spawned) {
+        /* async progress thread alongside may has processed new events,
+         * atomically read and reset nb_events_reported to zero.
+         */
+        return opal_atomic_swap_64(&thread_arg.nb_events_reported, 0);
+    } else {
+#endif


Sorry for the piecemeal comments, I'm discovering new things on each pass... Now the application threads aren't yielding anymore. Any chance we can have them yield here as well to avoid busy looping on this swap?

ok I got your idea. No problem, I added the yield for application thread.

- The SW-based async progress thread has been planned long time ago in 683efcb, but has never been enabled/implemented since. - This commit enables the spawn of an async progress thread to execute _opal_progress() routine when enabled at both compile time and runtime (--enable-progress-threads (default=enabled) and env OMPI_ASYNC_PROGRESS or OPAL_ASYNC_PROGRESS = 1). - Fix minor typo in opal_progress.h doxygen comment Signed-off-by: Minh Quan Ho <[email protected]>

jsquyres

Does anything need to be documented about this? E.g., how does the user activate this functionality? What are the benefits / drawbacks?

tonyskjellum · 2025-06-11T13:56:14Z

All: Why provide it at all? This is a preliminary and questionable design from MPICH… perhaps just thrown out as a straw proposal to generate a paper… some of us noted its weaknesses when it was presented at the ExaMPI workshop as a naive workaround to actually delivering progress via a progress engine thread and/or delegated to a progressive transport. I think you should reject this PR.Tony Skjellum Anthony Skjellum, PhD205-807-4968On Jun 11, 2025, at 2:37 PM, Jeff Squyres ***@***.***> wrote: @jsquyres commented on this pull request. Does anything need to be documented about this? E.g., how does the user activate this functionality? What are the benefits / drawbacks? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

hominhquan · 2025-06-11T13:58:09Z

Does anything need to be documented about this? E.g., how does the user activate this functionality? What are the benefits / drawbacks?

Right , can a new docs/launching-apps/progress_thread.rst entry be a plausible place ? I can see some points :

how to enable/disable the feature build at configure and, when enabled, how to activate it at runtime.
Note about some beneficial usecases, and warn about possible performance degradation in other situations.

jsquyres · 2025-06-12T12:56:52Z

Right , can a new docs/launching-apps/progress_thread.rst entry be a plausible place ? I can see some points :

how to enable/disable the feature build at configure and, when enabled, how to activate it at runtime.

Note about some beneficial usecases, and warn about possible performance degradation in other situations.

Sure, that sounds fine. And possibly also something under https://docs.open-mpi.org/en/v5.0.x/installing-open-mpi/configure-cli-options/.

I have not read the code, but @tonyskjellum and @bosilca's original points on #13074 are well-noted -- this is not a silver bullet of async progress and there are many, many tradeoffs -- potentially even a fairly narrow window for where performance gains will be realized. That should be clearly called out in the docs.

hominhquan · 2025-06-12T14:02:35Z

@tonyskjellum We are definitely aware of this not-always-win feature, extensively studied and discussed in the past. However there are still situations today where it can improve performance (single-node shared-mem non-blocking comm., spare (or communication-dedicated) cores , see #13088 (comment)). The beneficial window is narrow, but it exists.

The progress thread, even enabled at configure, is still not activated at runtime (i.e no spawning new thread), user must set an env var OPAL_ASYNC_PROGRESS=1 to activate it, like what was done in MPICH.

bosilca · 2025-06-12T13:58:31Z

config/opal_configure_options.m4

+    AC_MSG_RESULT([yes])
+    opal_want_progress_threads=1
+fi
+AC_DEFINE_UNQUOTED([OPAL_ENABLE_PROGRESS_THREADS], [$opal_want_progress_threads],


I saw the comment about maintaining the OPAL_ENABLE_PROGRESS_THREADS define but removing its uses all over the code, but the answer makes no sense to me. Not everything is protected by opal_async_progress_thread_spawned so the compiler will never be able to completely remove all extra code and variables without complaining (such as variable set but not used).

If there is no bug in the old support for progress thread maintain the code as much as possible.

bosilca · 2025-06-12T15:24:44Z

opal/runtime/opal_progress.c

+}
+#endif
+
+int opal_progress(void)


I don't think it makes sense to track and return the number of event managed by the progress thread. The reason is that it will mostly be inaccurate if there are multiple threads calling into opal_progress to check for progress, and that would lead them to make incorrect assumptions (because they will not see events that were opal_atomic_swap_64 by another thread).

But this is not really an issue, it only affects a single location ompi/request/req_test.c, and the impact will be minimal. I propose you remote all atomic operations around thread_arg.nb_events_reported, and make opal_progress return 0 in all cases. Then, in ompi/request/req_test.c you always goto recheck_request_status after the call to opal_progress.

bosilca · 2025-06-12T15:42:33Z

I saw your question about the smcuda BTL and it's peculiar support for threads (aka. having its own thread waiting on a fifo to be signalled by peers when messages are posted). I don't think we want to maintain that support, it will spawn yet another thread, and have it only progress the smcuda BTL. As a result the smcuda BTL will be progressed twice as much as the others (once by its own thread because all the other local processes will signal that fifo, and once by the progress thread via the BTL progress function mca_btl_smcuda_component_progress).

github-actions bot added the Target: main label Feb 11, 2025

hominhquan mentioned this pull request Feb 11, 2025

ompi/instance: fix cleanup function registration order #13073

Closed

bosilca reviewed Feb 11, 2025

View reviewed changes

hominhquan force-pushed the mqho/async_progress branch 2 times, most recently from b1f4dad to f3e8fc3 Compare February 12, 2025 14:03

hppritcha previously approved these changes Feb 12, 2025

View reviewed changes

bosilca reviewed Feb 12, 2025

View reviewed changes

opal/runtime/opal_progress.h Show resolved Hide resolved

bosilca previously approved these changes Feb 12, 2025

View reviewed changes

hominhquan force-pushed the mqho/async_progress branch from f3e8fc3 to b3a29bf Compare February 12, 2025 17:02

hominhquan dismissed stale reviews from bosilca and hppritcha via b3a29bf February 15, 2025 13:08

devreal reviewed Apr 22, 2025

View reviewed changes

hominhquan force-pushed the mqho/async_progress branch from b3a29bf to c717909 Compare June 10, 2025 08:40

devreal reviewed Jun 10, 2025

View reviewed changes

hominhquan force-pushed the mqho/async_progress branch from c717909 to 757b0f4 Compare June 10, 2025 10:24

devreal reviewed Jun 10, 2025

View reviewed changes

hominhquan force-pushed the mqho/async_progress branch from 757b0f4 to 41a1b44 Compare June 10, 2025 20:31

devreal approved these changes Jun 11, 2025

View reviewed changes

jsquyres reviewed Jun 11, 2025

View reviewed changes

bosilca reviewed Jun 12, 2025

View reviewed changes

RFC: Provide equivalence of MPICH_ASYNC_PROGRESS #13088

Are you sure you want to change the base?

RFC: Provide equivalence of MPICH_ASYNC_PROGRESS #13088

Uh oh!

Conversation

hominhquan commented Feb 11, 2025

Uh oh!

janjust commented Feb 11, 2025

Uh oh!

bosilca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hominhquan commented Feb 12, 2025 • edited by jsquyres Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devreal commented Feb 12, 2025

Uh oh!

hppritcha left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hominhquan commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

hominhquan commented Feb 13, 2025

Uh oh!

devreal commented Feb 13, 2025

Uh oh!

hominhquan commented Feb 13, 2025

Uh oh!

devreal commented Feb 13, 2025

Uh oh!

hominhquan commented Feb 13, 2025

Uh oh!

bosilca commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hominhquan commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hjelmn commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hominhquan commented Feb 21, 2025

Uh oh!

devreal left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsquyres commented May 23, 2025

Uh oh!

hominhquan commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devreal Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsquyres left a comment

Choose a reason for hiding this comment

hominhquan commented Feb 12, 2025 •

edited by jsquyres

Loading

hominhquan commented Feb 12, 2025 •

edited

Loading

bosilca commented Feb 13, 2025 •

edited

Loading

hominhquan commented Feb 13, 2025 •

edited

Loading

hjelmn commented Feb 13, 2025 •

edited

Loading

hominhquan commented Jun 10, 2025 •

edited

Loading

devreal Jun 10, 2025 •

edited

Loading

hominhquan commented Jun 12, 2025 •

edited

Loading