-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8356176: C2 MemorySegment: missing RCE with byteSize() in Loop Exit Check inside the for Expression #26429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
👋 Welcome back mhaessig! A progress list of the required criteria for merging this PR into |
❗ This change is not yet ready to be integrated. |
37cc12d
to
45f5d7b
Compare
I think one possible solution is to avoid splitting through |
I would not rely solely on profile information to solve this, but it might be a good additional piece of information for the first proposed solution. |
@mhaessig You don't really need profile information, only that the profit is on the loop entry path and there is no profit on the loop backedge. |
I see, but, if I understand correctly, any logic that relates to profit will have to go into |
From the principle point of view, splitting a node through the loop |
45f5d7b
to
5400c90
Compare
@merykitty, I took me a while to understand, but now I implemented your suggestion and it works at least the case of this issue (testing is ongoing). Thank you for pushing back. EDIT: It also fixes JDK-8348096. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your work, I have some suggestions.
A loop of the form
does not vectorize, whereas
vectorizes. The reason is that the loop with the loop limit lifted manually out of the loop exit check is immediately detected as a counted loop, whereas the other (more intuitive) loop has to be cleaned up a bit, before it is recognized as counted. Tragically, the
LShift
used in the loop exit check gets split through the phi preventing range check elimination, which is why the loop does not get vectorized. Before splitting through the phi, there is a check to prevent splittingLShift
s modifying the IV of a counted loop:jdk/src/hotspot/share/opto/loopopts.cpp
Lines 1172 to 1176 in e3f85c9
Hence, not detecting the counted loop earlier is the main culprit for the missing vectorization.
So, why is the counted loop not detected? Because the call to
byteSize()
is inside the loop head, andCiTypeFlow::clone_loop_heads()
duplicates it into the loop body. The loop limit in the cloned loop head is loop variant and thus cannot be detected as a counted loop. The firstITER_GVN
inPHASEIDEALLOOP1
will already remove the cloned loop head, enabling counted loop detection in the following iteration, which in turn enables vectorization.Possible solutions
LShift
s in uncounted loops that have the same shape as a counted loop would have. This solution is implemented withUseNewCode
as guard.Problems
This is just another one of those exceptions from splitting through phis, and one of the more imprecise at that. So, this will prevent some `LShift`s in uncounted loops being split even though it would be beneficial.Alternatives
Insert a "`PHASEIDEALLOOP0`" with `LoopOptsNone` that only performs loop tree building and then a round of IGVN where `Loop` nodes have been created. This cleans up the duplicated loop limit field access inside of the loop, which enables the counted loop detection in `PHASEIDEALLOOP1`.Problems
In spirit, this is only a cleanup before the "real" work on loops begins. However, two tests fail with this solution.
TestMultiversionRemoveUselessSlowLoop.java
fails, because the small loop does not get multiversioned, effectively obviating the test. More critically,TestCombineAddPWithConstantOffsets.java
(added in #21898) detects a regression with folding address computations. The problem, as I understand it, is thatPHASEIDEALLOOP0
enables cleaning of along
data path that seems to be essential for foldingAddP
nodes with constant offsets.I am unsure what the best solution really is the best and would appreciate feedback on that. Personally, I prefer the second option, since the logic of a cleanup pass before the "real" loop opts begin makes sense to me.
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26429/head:pull/26429
$ git checkout pull/26429
Update a local copy of the PR:
$ git checkout pull/26429
$ git pull https://git.openjdk.org/jdk.git pull/26429/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 26429
View PR using the GUI difftool:
$ git pr show -t 26429
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26429.diff