-
Notifications
You must be signed in to change notification settings - Fork 903
RFC: Provide equivalence of MPICH_ASYNC_PROGRESS #13088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@hominhquan thanks for the PR, do you have any performance data and/or testing data? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A step in the right direction. I still have an issue with the fact that the thread is not bound and cannot be bound, but we can address that later.
b1f4dad
to
f3e8fc3
Compare
@janjust Yes, I shared in #13074 that we observed a gain of upto x1.4 in |
@hominhquan What is the impact on collective operations, both in shared and distributed memory? I imagine there to be more contention... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many other things to consider in terms of performance but in terms of changes to just support the async thread capability this looks okay. Low risk since its buy in.
opal_progress_set_event_flag(OPAL_EVLOOP_ONCE | OPAL_EVLOOP_NONBLOCK); | ||
#endif | ||
/* shutdown async progress thread before tearing down further services */ | ||
if (opal_async_progress_thread_spawned) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is okay for now but does leave a hole to plug for the sessions model, but since this is an buy-in option for the application user it should be okay for how.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind elaborate on this @hppritcha, I fail to see the issue with the session model.
I only compared with |
f3e8fc3
to
b3a29bf
Compare
@devreal below is our results on a single-node Grace CPU, in which we see the sweet-spot around 128-256 KB per rank. There was degradation on three operations: |
Is that a 3x slowdown for small messages? That would be very concerning. Any idea what would cause that? |
It would come from the synchronization overhead between the progress thread (now executing |
It's not Amdahl's law if the execution gets slower when you add more resources :) In a perfect world, the thread calling |
As said @bosilca in #13074, (back in 2010-2014), it was hard to get an optimal solution for all message sizes and to all use-cases, and I confirm this conclusion. Spawning a thread introduces many side effects that we can hardly measure their impact. The idea behind this patch is to (re)open the door to further improvement and fine-tuning (core-binding ? time-based yield/progression ? work-stealing ?). |
These results looks suspicious. OSU doesn't do overlap, it basically posts the non-blocking and wait for it. For small messages I could understand 10 to 20 percent performance degradation, but not 3x. And for large messages on an iallgather a 2x increase in performance ? Where is that extra bandwidth coming from ? What exactly is the speedup you report on this graph ? |
Recent version of OSU added, for example The results I showed above is on the The speedup is expected from the fact that some tasks (e.g. I tested with MPICH's Again, I know the limitation of OSU of only using two MPI processes, where resource contention is not stressed much far. They mimic anyway a real well-written overlapped non-blocking schema. |
I have been out of the loop for awhile but I thought the idea was to do more targeted progress rather than the hammer that is just looping on opal_progress (which is predictably bad-- we knew this years ago). The concept I remember was signaled sends that could then wake up the remote process that would then make progress. Did that go nowhere? The way I remember it, the reason this speeds up large messages is we can progress the RNDV without entering MPI. This is what signaled sends in the BTL were supposed to address. It would send the RTS which would trigger an RDMA get, RTR, whatever then go back to sleep. |
Ping, please tell me if this PR needs more discussion and has its merit to be merged ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in favor of this PR. There is (close to) no harm in merging for users who don't enable the progress thread and I believe it gives us a base for further experimentation.
There are conflicts that need to be resolved though.
@@ -59,6 +59,7 @@ OPAL_DECLSPEC extern int opal_initialized; | |||
OPAL_DECLSPEC extern bool opal_built_with_cuda_support; | |||
OPAL_DECLSPEC extern bool opal_built_with_rocm_support; | |||
OPAL_DECLSPEC extern bool opal_built_with_ze_support; | |||
OPAL_DECLSPEC extern bool opal_async_progress_thread_spawned; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch removes the OPAL_ENABLE_PROGRESS_THREADS
pre-processor guard from many places (like https://github.com/open-mpi/ompi/pull/13088/files#diff-b23aacc904fb4f13b003d1c716bf6db21c5c490a319b58a3ab70e8f3a8226885L48). Can we define opal_async_progress_thread_spawned
to false
in the header if progress threads are explicitly disabled at compile time? That makes the code cleaner and allows the compiler to optimize away code that is explicitly disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done in new updated version. Thanks for proposal.
@hominhquan Can you rebase this to the tip of Thanks! |
b3a29bf
to
c717909
Compare
opal/runtime/opal_progress.c
Outdated
while (p_thread_arg->running) { | ||
const int64_t new_events = _opal_progress(); | ||
opal_atomic_add_fetch_64(&p_thread_arg->nb_events_reported, new_events); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the tight loop around an atomic update here. We should at least restrict the update to when there were actual events, i.e., new_events > 0
, which should be rare relative to the number of iterations this loop will perform.
Also, should we enable opal_progress_yield_when_idle
by default when the progress thread is enabled (to reduce contention created by the progress thread)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added if (new_events > 0)
around the atomic add.
On the opal_progress_yield_when_idle
, it is set to true during progress thread spawning at line opal_progress.c:L226
c717909
to
757b0f4
Compare
opal/runtime/opal_progress.c
Outdated
#if OPAL_ENABLE_PROGRESS_THREADS == 1 | ||
if (opal_async_progress_thread_spawned) { | ||
/* async progress thread alongside may has processed new events, | ||
* atomically read and reset nb_events_reported to zero. | ||
*/ | ||
return opal_atomic_swap_64(&thread_arg.nb_events_reported, 0); | ||
} else { | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the piecemeal comments, I'm discovering new things on each pass... Now the application threads aren't yielding anymore. Any chance we can have them yield here as well to avoid busy looping on this swap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I got your idea. No problem, I added the yield for application thread.
- The SW-based async progress thread has been planned long time ago in 683efcb, but has never been enabled/implemented since. - This commit enables the spawn of an async progress thread to execute _opal_progress() routine when enabled at both compile time and runtime (--enable-progress-threads (default=enabled) and env OMPI_ASYNC_PROGRESS or OPAL_ASYNC_PROGRESS = 1). - Fix minor typo in opal_progress.h doxygen comment Signed-off-by: Minh Quan Ho <[email protected]>
757b0f4
to
41a1b44
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anything need to be documented about this? E.g., how does the user activate this functionality? What are the benefits / drawbacks?
All: Why provide it at all? This is a preliminary and questionable design from MPICH… perhaps just thrown out as a straw proposal to generate a paper… some of us noted its weaknesses when it was presented at the ExaMPI workshop as a naive workaround to actually delivering progress via a progress engine thread and/or delegated to a progressive transport. I think you should reject this PR.Tony Skjellum Anthony Skjellum, PhD205-807-4968On Jun 11, 2025, at 2:37 PM, Jeff Squyres ***@***.***> wrote:
@jsquyres commented on this pull request.
Does anything need to be documented about this? E.g., how does the user activate this functionality? What are the benefits / drawbacks?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Right , can a new
|
Sure, that sounds fine. And possibly also something under https://docs.open-mpi.org/en/v5.0.x/installing-open-mpi/configure-cli-options/. I have not read the code, but @tonyskjellum and @bosilca's original points on #13074 are well-noted -- this is not a silver bullet of async progress and there are many, many tradeoffs -- potentially even a fairly narrow window for where performance gains will be realized. That should be clearly called out in the docs. |
@tonyskjellum We are definitely aware of this not-always-win feature, extensively studied and discussed in the past. However there are still situations today where it can improve performance (single-node shared-mem non-blocking comm., spare (or communication-dedicated) cores , see #13088 (comment)). The beneficial window is narrow, but it exists. The progress thread, even enabled at configure, is still not activated at runtime (i.e no spawning new thread), user must set an env var |
AC_MSG_RESULT([yes]) | ||
opal_want_progress_threads=1 | ||
fi | ||
AC_DEFINE_UNQUOTED([OPAL_ENABLE_PROGRESS_THREADS], [$opal_want_progress_threads], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw the comment about maintaining the OPAL_ENABLE_PROGRESS_THREADS
define but removing its uses all over the code, but the answer makes no sense to me. Not everything is protected by opal_async_progress_thread_spawned so the compiler will never be able to completely remove all extra code and variables without complaining (such as variable set but not used).
If there is no bug in the old support for progress thread maintain the code as much as possible.
} | ||
#endif | ||
|
||
int opal_progress(void) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense to track and return the number of event managed by the progress thread. The reason is that it will mostly be inaccurate if there are multiple threads calling into opal_progress
to check for progress, and that would lead them to make incorrect assumptions (because they will not see events that were opal_atomic_swap_64
by another thread).
But this is not really an issue, it only affects a single location ompi/request/req_test.c
, and the impact will be minimal. I propose you remote all atomic operations around thread_arg.nb_events_reported
, and make opal_progress
return 0 in all cases. Then, in ompi/request/req_test.c
you always goto recheck_request_status
after the call to opal_progress
.
I saw your question about the smcuda BTL and it's peculiar support for threads (aka. having its own thread waiting on a fifo to be signalled by peers when messages are posted). I don't think we want to maintain that support, it will spawn yet another thread, and have it only progress the smcuda BTL. As a result the smcuda BTL will be progressed twice as much as the others (once by its own thread because all the other local processes will signal that fifo, and once by the progress thread via the BTL progress function |
This PR is follow-up of #13074.