-
Notifications
You must be signed in to change notification settings - Fork 5k
OSR support for Arm64 #62831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSR support for Arm64 #62831
Conversation
Enable OSR for Arm64: * rename `FpToSpDelta` in `PatchpointInfo` to `TotalFrameSize` so it makes sense for both x64 and Arm64. * make it clear that the local offsets in `PatchpointInfo` are virtual (relative to the top of the frame). Adjust recorded Arm64 offsets to match. * add new jit config setting `JitEnableOSRRange` to allow selectively enabling OSR for only a subset of jitted methods. * Arm64 OSR method is passed Tier0 SP but not Tier0 FP, as Tier0 FP can point either at top or bottom of Tier0 frame depending on frame type. For Arm64 the OSR method establishes its own FP that chains to caller FP. (I will likely revise x64 to work this way too, as it makes simple FP chain stackwalks work out better). * The Arm64 OSR epilog gets an extra SP adjustment to remove the Tier0 frame. * The Arm64 OSR prolog gets a phantom SP adjustment in unwind to account for being passed the Tier0 SP. * local/arg init for the OSR frame from the Tier0 frame was moved into a new method `genEnregisterOSRArgsAndLocals`; implemented the ARM64 version. * sequencing of this initialization in the prolog reordered slightly to prevent inadvertent clobbering. * comments/code that referred to `original` or `old` method were revised to instead try and consistently use `tier0`. * support the mixed-altjit case for OSR by allocating a local copy of the patchpoint info with similar and plausible information. * add symbol table note for locals in OSR methods that live on the Tier0 frame. * broaden the jit-experimental testing matrix to include Arm64.
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsEnable OSR for Arm64:
|
cc @dotnet/jit-contrib No diff for x64, but we will see "text diffs" for OSR methods for x64 as the prolog length is now properly described in the assembly summary (it was ok in the unwind).
Will run the newly extended jit experimental tests once the default testing is looking solid. There are some known failures there so will have to sort through the results to see if there's anything unexpected. I suspect my Arm64 codegen for large offsets needs some adjusting (in both prolog and epilog). |
Sample codegen (note the two SP adjusts in the epilog and accesses to the Tier0 frame): ; Assembly listing for method AddressExposedLocal:F(int,int):int
; Emitting BLENDED_CODE for generic ARM64 CPU - Unix
; Tier-1 compilation
; OSR variant for entry point 0xd
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;* V00 arg0 [V00 ] ( 0, 0 ) int -> zero-ref single-def
; V01 arg1 [V01,T01] ( 3, 10 ) int -> x1 single-def
; V02 loc0 [V02 ] ( 3, 17 ) int -> [fp+30H] do-not-enreg[X] addr-exposed ld-addr-op tier0-frame
; V03 loc1 [V03,T00] ( 4, 32 ) int -> x0
;# V04 OutArgs [V04 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace"
;
; Lcl frame size = 0
G_M35369_IG01: ;; offset=0000H
A9BF7BFD stp fp, lr, [sp,#-16]!
910003FD mov fp, sp
B9403BA1 ldr w1, [fp,#56]
B9402FA0 ldr w0, [fp,#44]
;; bbWeight=1 PerfScore 5.50
G_M35369_IG02: ;; offset=0010H
B94033A2 ldr w2, [fp,#48]
0B000042 add w2, w2, w0
B90033A2 str w2, [fp,#48] // [V02 loc0]
11000400 add w0, w0, #1
6B01001F cmp w0, w1
54FFFF6B blt G_M35369_IG02
;; bbWeight=8 PerfScore 44.00
G_M35369_IG03: ;; offset=0028H
B94033A0 ldr w0, [fp,#48] // [V02 loc0]
;; bbWeight=1 PerfScore 2.00
G_M35369_IG04: ;; offset=002CH
A8C17BFD ldp fp, lr, [sp],#16
9100C3FF add sp, sp, #48
D65F03C0 ret lr
;; bbWeight=1 PerfScore 2.50
; Total bytes of code 56, prolog size 16, PerfScore 59.60, instruction count 14, allocated bytes for code 56 (MethodHash=35dd75d6) for method AddressExposedLocal:F(int,int):int |
/azp run runtime-jit-experimental |
Azure Pipelines successfully started running 1 pipeline(s). |
There are a few EH test failures with partial compilation enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple suggestions.
Otherwise, LGTM
JITDUMP("Extra SP adjust for OSR to pop off Tier0 frame: %d bytes\n", tier0FrameSize); | ||
|
||
// Tier0 size may exceed simple immediate. We're in the epilog so not clear if we can | ||
// use a scratch reg. So just do two subtracts if necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: two adds
src/coreclr/jit/compiler.cpp
Outdated
// If are an altjit and have patchpoint info, we might need to tweak the frame size | ||
// so it's plausible for the altjit architecture. | ||
// | ||
if (compileFlags->IsSet(JitFlags::JIT_FLAG_OSR)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (compileFlags->IsSet(JitFlags::JIT_FLAG_OSR)) | |
if (!info.compMatchedVM && compileFlags->IsSet(JitFlags::JIT_FLAG_OSR)) |
Shouldn't you also only do this if we know we're running against a mismatched VM (aka, as a cross-altjit)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for spotting that. My initial version of this was up under where we checked for the mismatch.
One of the partial comp failures was a large offset case in the prolog. Fixed this to use the The other partial comp failures are cases where we throw from a catch in a method with a generic catch. Seems to suggest either something is off with the PSPSym or the reporting of the generic context in GcInfo. The catch is seemingly entered properly, but things go off the rails when it throws. This only happens for catches with shared catch clause types (eg ref class instantiations), again suggesting the runtime is unable to find the right information about the actual type being caught. Per |
Simple repro case (for partial comp): using System;
public class Exception<T> : Exception
{
}
class X
{
static int x = -1;
public static int Main()
{
F<string>();
return x;
}
public static void F<T>()
{
try
{
try
{
throw new Exception<T>();
}
catch(Exception)
{
throw;
}
}
catch (Exception<T>)
{
x = 100;
}
}
} currently fails with:
|
I think what's going on is that there is an expectation that the PSP sym for both main method and funclet is the same delta from CallerSP. There is just one offset for this in the gc info. This isn't the case currently; the OSR method delta includes the Tier0 frame size but we don't pad out the OSR funclet frame to match. So we end up with a situation where we report PSPSym offset in gc info as -80 from Caller SP. For the OSR frame this is (-16 osr part of frame) + (-64 tier0 frame) and is fine. But this offset doesn't work for the funclet; it's PSP sym is at (-16) with respect to Caller SP. So when we try and find the PSP sym value for the funclet we end up with the wrong value and things go downhill from there. Options are to either pad the funclet frame or to have the OSR frame use the Tier0 frame's PSP Sym slot. The latter seems appealing but may not be possible as the PSP sym location varies depending on which regs get saved, and this differs between Tier0 frame and OSR frame (eg in this case the Tier0 PSP sym offset is at (-8).). If we tried to normalize for that by saving PSPSym first after varargs we might run into issues because the Tier0 frame spills varargs at top of frame and there's no reason for the OSR frame to do this (but see note). So as ugly as it seems, we may need to pad the funclet frame to include a phantom tier0 frame. |
Still thinking about the funclet PSP issue (see note) but pushed the other fixes I have to get wider testing on them. |
Arm64 varargs may also need looking into. May simply be the OSR method doesn't need to allocate and populate its own varargs spill space as the spilling done by Tier0 should suffice. So might just be an optimization. Note funclet frames currently reserve space for varargs that they don't spill. So there is a precedent of sorts. |
/azp run runtime-jit-experimental |
Azure Pipelines successfully started running 1 pipeline(s). |
Padding the OSR funclets mostly works, it seems. It causes one SP adjust to get too large to be encoded by a type3 frame, and there's one other failure. |
The other failure was in I've pushed the funclet padding approach further -- while ugly -- it seems to be generally corrrect and passing the tests we have, but I think in the long run we'll want a different approach. For instance, if the tier0 frame is large we would need to probe in the funclet prolog; I am not doing that yet. JIT/Regression/JitBlue/GitHub_21990 + partial compilation is a good stress test here -- a very large tier0 frame inducing very large padded funclets under OSR. It seems like we could allocate the PSPSym above the saved regs (either top of frame or just below the varargs area). That way the OSR frame could "reuse" the tier0 frame's PSP sym and the OSR funclet frames would just pad for varargs like they do now and everone would agree on the caller-relative offset. So PSPSym would always be at offset -8 or -40 depending. This would also potentially free up the restriction that funclets must save / restore the same set of registers, if we ever wanted to go after smaller funclet prolog/epilogs. @BruceForstall any gotchas you can think of in changing this aspect of arm64 frame gen? If you think moving the PSPSym up is viable, I'll work it up as a separate change in the new year. In the meantime, I'll push up these changes and (since they only impact OSR) propose checking things in as they are now as it is mostly functional. |
/azp run runtime-jit-experimental |
Azure Pipelines successfully started running 1 pipeline(s). |
fyi, the CLR ABI doc talks about PSPSym here: https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/botr/clr-abi.md#the-pspsym-and-funclet-parameters. If there are any additional clarifications we should make, e.g., to design questions, it would be good to update that location. I think the main reason the PSPSym lives at a lower address than the callee saved registers is that establishing the PSPSym is not part of the OS unwind prolog, however, the allocated space and the code to write the PSPSym don't have to be that tightly coupled. So, even if we left space for it at a high location, we wouldn't want to generate the code to store it until after the unwind prolog. It's stored with an offset from SP, so you'd want to ensure the positive offset from RSP to the PSPSym slot was always encodable. Also, we want to make sure the epilog codes are a mirror of the prolog codes as much as possible, so you'd want to check the prolog/epilog sequence so the codes would match. It probably doesn't need to match location (above versus below callee saved registers) on ARM32, as long as the code and comments are clear where it lives in each case. It doesn't seem like there are any blocking issues to move it up. |
Still seeing some encoding failures with OSR / PC:
Also some minor diffs when OSR/PC is not enabled. Will dig in... |
Those unexpected diffs are coming from |
/azp run runtime-jit-experimental |
Azure Pipelines successfully started running 1 pipeline(s). |
SPMI back to no diffs (text diffs in some OSR methods for windows x64). |
Looks like moving the PSP sym up breaks assumptions for packed unwind, namely that one can compute the offset of the callee saves area by just looking at some header bits. And it appears the jit only knows how to emit packed unwind. https://docs.microsoft.com/en-us/cpp/build/arm64-exception-handling?view=msvc-170#packed-unwind-data shows one such computation for So options here are to either emit more complex unwind for methods with EH (or just for OSR methods with EH) or else keep pushing on fixing issues with large padded funclet frames. |
This seems wrong; the JIT never generates the packed unwind format. The JIT always generates .xdata, and the VM constructs the appropriate .pdata. |
Thanks -- looks like I was confused then about why my changes were failing. I'd still like to try and move the PSP sym, so I should revisit this. |
Enable OSR for Arm64:
FpToSpDelta
inPatchpointInfo
toTotalFrameSize
so it makessense for both x64 and Arm64.
PatchpointInfo
are virtual(relative to the top of the frame). Adjust recorded Arm64 offsets to match.
JitEnableOSRRange
to allow selectively enablingOSR for only a subset of jitted methods.
either at top or bottom of Tier0 frame depending on frame type. For Arm64 the
OSR method establishes its own FP that chains to caller FP. (I will likely
revise x64 to work this way too, as it makes simple FP chain stackwalks work
out better).
being passed the Tier0 SP.
method
genEnregisterOSRArgsAndLocals
; implemented the ARM64 version.inadvertent clobbering.
original
orold
method were revised toinstead try and consistently use
tier0
.patchpoint info with similar and plausible information.