Skip to content

8356176: C2 MemorySegment: missing RCE with byteSize() in Loop Exit Check inside the for Expression #26429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

mhaessig
Copy link
Contributor

@mhaessig mhaessig commented Jul 22, 2025

A loop of the form

MemorySegment ms = {};
for (long i = 0; i < ms.byteSize() / 8L; i++) {
    // vectorizable work
}

does not vectorize, whereas

MemorySegment ms = {};
long size = ms.byteSize();
for (long i = 0; i < size / 8L; i++) {
    // vectorizable work
}

vectorizes. The reason is that the loop with the loop limit lifted manually out of the loop exit check is immediately detected as a counted loop, whereas the other (more intuitive) loop has to be cleaned up a bit, before it is recognized as counted. Tragically, the LShift used in the loop exit check gets split through the phi preventing range check elimination, which is why the loop does not get vectorized. Before splitting through the phi, there is a check to prevent splitting LShifts modifying the IV of a counted loop:

// Pushing a shift through the iv Phi can get in the way of addressing optimizations or range check elimination
if (n_blk->is_BaseCountedLoop() && n->Opcode() == Op_LShift(n_blk->as_BaseCountedLoop()->bt()) &&
n->in(1) == n_blk->as_BaseCountedLoop()->phi()) {
return n;
}

Hence, not detecting the counted loop earlier is the main culprit for the missing vectorization.

So, why is the counted loop not detected? Because the call to byteSize() is inside the loop head, and CiTypeFlow::clone_loop_heads() duplicates it into the loop body. The loop limit in the cloned loop head is loop variant and thus cannot be detected as a counted loop. The first ITER_GVN in PHASEIDEALLOOP1 will already remove the cloned loop head, enabling counted loop detection in the following iteration, which in turn enables vectorization.

Possible solutions

  1. Prevent splitting LShifts in uncounted loops that have the same shape as a counted loop would have. This solution is implemented with UseNewCode as guard.
Problems This is just another one of those exceptions from splitting through phis, and one of the more imprecise at that. So, this will prevent some `LShift`s in uncounted loops being split even though it would be beneficial.
  1. As @merykitty mentions below, overly eager splitting through phis leads to similar problems, e.g. in JDK-8348096. To alleviate we track where we get wins when splitting. If a split onis only profitable in te loop entry, but not on the backedge, we deem it as not profitable.
Alternatives Insert a "`PHASEIDEALLOOP0`" with `LoopOptsNone` that only performs loop tree building and then a round of IGVN where `Loop` nodes have been created. This cleans up the duplicated loop limit field access inside of the loop, which enables the counted loop detection in `PHASEIDEALLOOP1`.

Problems

In spirit, this is only a cleanup before the "real" work on loops begins. However, two tests fail with this solution. TestMultiversionRemoveUselessSlowLoop.java fails, because the small loop does not get multiversioned, effectively obviating the test. More critically, TestCombineAddPWithConstantOffsets.java (added in #21898) detects a regression with folding address computations. The problem, as I understand it, is that PHASEIDEALLOOP0 enables cleaning of a long data path that seems to be essential for folding AddP nodes with constant offsets.

I am unsure what the best solution really is the best and would appreciate feedback on that. Personally, I prefer the second option, since the logic of a cleanup pass before the "real" loop opts begin makes sense to me.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8356176: C2 MemorySegment: missing RCE with byteSize() in Loop Exit Check inside the for Expression (Sub-task - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26429/head:pull/26429
$ git checkout pull/26429

Update a local copy of the PR:
$ git checkout pull/26429
$ git pull https://git.openjdk.org/jdk.git pull/26429/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26429

View PR using the GUI difftool:
$ git pr show -t 26429

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26429.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 22, 2025

👋 Welcome back mhaessig! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 22, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk
Copy link

openjdk bot commented Jul 22, 2025

@mhaessig The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@merykitty
Copy link
Member

I think one possible solution is to avoid splitting through Phi if there is no benefit. In this case, the only benefit is in the loop entry, while there is none in the loop backedge. If we take frequency into consideration, we may be able to determine that the splitting is not profitable. What do you think?

@mhaessig
Copy link
Contributor Author

I would not rely solely on profile information to solve this, but it might be a good additional piece of information for the first proposed solution.

@merykitty
Copy link
Member

@mhaessig You don't really need profile information, only that the profit is on the loop entry path and there is no profit on the loop backedge.

@mhaessig
Copy link
Contributor Author

I see, but, if I understand correctly, any logic that relates to profit will have to go into split_through_phi(). Since we already have special logic for not splitting LShifts in split_if_with_blocks_pre(), I would prefer to only have it there.

@merykitty
Copy link
Member

merykitty commented Jul 23, 2025

From the principle point of view, splitting a node through the loop Phi is only profitable if the profit is in the loop backedge. From the practical point of view, there are some issues when split_through_phi is applied recklessly such as JDK-8348096. I believe taking loop head into consideration when splitting through Phis can solve these issues. As a result, I think while you are at this issue, it is worth investigating this approach.

@mhaessig mhaessig force-pushed the jdk-8356176-byte-size branch from 45f5d7b to 5400c90 Compare July 23, 2025 14:41
@mhaessig
Copy link
Contributor Author

mhaessig commented Jul 23, 2025

@merykitty, I took me a while to understand, but now I implemented your suggestion and it works at least the case of this issue (testing is ongoing). Thank you for pushing back.

EDIT: It also fixes JDK-8348096.

Copy link
Member

@merykitty merykitty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your work, I have some suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants