Skip to content

dlopen: Fix regressions caused by changes to CHPL_CACHE_REMOTE#28949

Merged
dlongnecke-cray merged 2 commits into
chapel-lang:mainfrom
dlongnecke-cray:dynamic-loading-49
Jun 9, 2026
Merged

dlopen: Fix regressions caused by changes to CHPL_CACHE_REMOTE#28949
dlongnecke-cray merged 2 commits into
chapel-lang:mainfrom
dlongnecke-cray:dynamic-loading-49

Conversation

@dlongnecke-cray

@dlongnecke-cray dlongnecke-cray commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

This PR attempts to fix performance regressions caused by changes to CHPL_CACHE_REMOTE.

Previously, CHPL_CACHE_REMOTE was declared as an extern const int within the runtime code (and used in chpl-cache.h). The references to it in the runtime would be resolved when a Chapel program was linked (recall that normally, the runtime is distributed as a static archive and a copy of it is bundled in with every Chapel program).

As of #28869, CHPL_CACHE_REMOTE is now accessed via a struct field read instead of being a const global variable.

This change is causing a 5-15% drop in performance.

The hot spot is its use in chpl_cache_enabled(), which can potentially be invoked per iteration in very tight loops.

My guess is that in the old world, when throwing --fast, the (LLVM only?) compiler is doing LTO (link-time optimization) and basically inlining CHPL_CACHE_REMOTE. After #28869 it is no longer able to do that and is forced to do a memory read.

The first thing I tried to do was optimize the struct read to avoid function calls on the fast path. The second thing I tried to do was to adjust chpl_cache_enabled() to reference a runtime global variable that is computed during runtime setup. Neither had any effect, which makes sense - the cost of both is about the same, which is a memory read, and neither can substitute for basically inlining away the memory read (and branch) completely.

The next step was to figure out all the places where chpl_cache_enabled() is called. The majority are in chpl-comm-compiler-macros.h, which as far as I can tell is a header (primarily meant for use by the compiler) which contains wrappers around GET/PUT primitives. The macros all basically boil down to:

  • If src/dst locale are local, do a mem-move
  • If (GPU-based condition) do GPU-based things
  • If remote and cache is enabled, use cache (where chpl_cache_enabled() would be called)
  • Otherwise, do a runtime PUT/GET

SHORT TERM FIX

I've added a shim in chpl-comm-compiler-macros.h that basically does what the old code used to do. It declares CHPL_CACHE_REMOTE as an extern const int so that it can take advantage of LTO. This restores performance on 16-node-apollo-hdr.

LONG TERM FIX

If we want to let the code in this header reference CHPL_CACHE_REMOTE directly so that it can take advantage of LTO (which appears to be critical for these functions), then we have to isolate them so that only the compiler can use their current forms and the runtime code cannot invoke them directly.

This is because the runtime is no longer allowed to directly reference (via declaration and linking) constants that are declared per-Chapel-program.

So structurally, all the places in the runtime code that invoke the helpers in this header ought to be changed to effectively do something else. Maybe they can use a runtime-only copy of the code that does the slower thing, which is OK as long as the shims used by the compiler in tight loops are performant.

We might also be able to whip up some macro-based magic to make this a bit less painful. I'll figure something out after the 2.9 release.

TESTING

  • Verified regressions are fixed on 16-node-apollo-hdr
  • standard
  • COMM=gasnet
  • Built with COMM=ofi

Reviewed by @benharsh. Thanks!

Signed-off-by: David Longnecker <dlongnecke-cray@users.noreply.github.com>
Signed-off-by: David Longnecker <dlongnecke-cray@users.noreply.github.com>
@dlongnecke-cray dlongnecke-cray merged commit f649c5c into chapel-lang:main Jun 9, 2026
12 checks passed
arifthpe added a commit that referenced this pull request Jun 16, 2026
Add perf annotations (under `all`) for `CHPL_CACHE_REMOTE` move
regression and fix (#28869 and
#28949, respectively).

[reviewed by @benharsh and @dlongnecke-cray , thanks!]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants