dlopen: Fix regressions caused by changes to CHPL_CACHE_REMOTE#28949
Merged
dlongnecke-cray merged 2 commits intoJun 9, 2026
Conversation
Signed-off-by: David Longnecker <dlongnecke-cray@users.noreply.github.com>
Signed-off-by: David Longnecker <dlongnecke-cray@users.noreply.github.com>
benharsh
approved these changes
Jun 9, 2026
arifthpe
added a commit
that referenced
this pull request
Jun 16, 2026
Add perf annotations (under `all`) for `CHPL_CACHE_REMOTE` move regression and fix (#28869 and #28949, respectively). [reviewed by @benharsh and @dlongnecke-cray , thanks!]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR attempts to fix performance regressions caused by changes to
CHPL_CACHE_REMOTE.Previously,
CHPL_CACHE_REMOTEwas declared as anextern const intwithin the runtime code (and used inchpl-cache.h). The references to it in the runtime would be resolved when a Chapel program was linked (recall that normally, the runtime is distributed as a static archive and a copy of it is bundled in with every Chapel program).As of #28869,
CHPL_CACHE_REMOTEis now accessed via a struct field read instead of being a const global variable.This change is causing a 5-15% drop in performance.
The hot spot is its use in
chpl_cache_enabled(), which can potentially be invoked per iteration in very tight loops.My guess is that in the old world, when throwing
--fast, the (LLVM only?) compiler is doing LTO (link-time optimization) and basically inliningCHPL_CACHE_REMOTE. After #28869 it is no longer able to do that and is forced to do a memory read.The first thing I tried to do was optimize the struct read to avoid function calls on the fast path. The second thing I tried to do was to adjust
chpl_cache_enabled()to reference a runtime global variable that is computed during runtime setup. Neither had any effect, which makes sense - the cost of both is about the same, which is a memory read, and neither can substitute for basically inlining away the memory read (and branch) completely.The next step was to figure out all the places where
chpl_cache_enabled()is called. The majority are inchpl-comm-compiler-macros.h, which as far as I can tell is a header (primarily meant for use by the compiler) which contains wrappers around GET/PUT primitives. The macros all basically boil down to:chpl_cache_enabled()would be called)SHORT TERM FIX
I've added a shim in
chpl-comm-compiler-macros.hthat basically does what the old code used to do. It declaresCHPL_CACHE_REMOTEas anextern const intso that it can take advantage of LTO. This restores performance on16-node-apollo-hdr.LONG TERM FIX
If we want to let the code in this header reference
CHPL_CACHE_REMOTEdirectly so that it can take advantage of LTO (which appears to be critical for these functions), then we have to isolate them so that only the compiler can use their current forms and the runtime code cannot invoke them directly.This is because the runtime is no longer allowed to directly reference (via declaration and linking) constants that are declared per-Chapel-program.
So structurally, all the places in the runtime code that invoke the helpers in this header ought to be changed to effectively do something else. Maybe they can use a runtime-only copy of the code that does the slower thing, which is OK as long as the shims used by the compiler in tight loops are performant.
We might also be able to whip up some macro-based magic to make this a bit less painful. I'll figure something out after the 2.9 release.
TESTING
16-node-apollo-hdrstandardCOMM=gasnetCOMM=ofiReviewed by @benharsh. Thanks!