OSR support for Arm64 #62831

AndyAyersMS · 2021-12-15T02:09:32Z

Enable OSR for Arm64:

rename FpToSpDelta in PatchpointInfo to TotalFrameSize so it makes
sense for both x64 and Arm64.
make it clear that the local offsets in PatchpointInfo are virtual
(relative to the top of the frame). Adjust recorded Arm64 offsets to match.
add new jit config setting JitEnableOSRRange to allow selectively enabling
OSR for only a subset of jitted methods.
Arm64 OSR method is passed Tier0 SP but not Tier0 FP, as Tier0 FP can point
either at top or bottom of Tier0 frame depending on frame type. For Arm64 the
OSR method establishes its own FP that chains to caller FP. (I will likely
revise x64 to work this way too, as it makes simple FP chain stackwalks work
out better).
The Arm64 OSR epilog gets an extra SP adjustment to remove the Tier0 frame.
The Arm64 OSR prolog gets a phantom SP adjustment in unwind to account for
being passed the Tier0 SP.
local/arg init for the OSR frame from the Tier0 frame was moved into a new
method genEnregisterOSRArgsAndLocals; implemented the ARM64 version.
sequencing of this initialization in the prolog reordered slightly to prevent
inadvertent clobbering.
comments/code that referred to original or old method were revised to
instead try and consistently use tier0.
support the mixed-altjit case for OSR by allocating a local copy of the
patchpoint info with similar and plausible information.
add symbol table note for locals in OSR methods that live on the Tier0 frame.
broaden the jit-experimental testing matrix to include Arm64.

Enable OSR for Arm64: * rename `FpToSpDelta` in `PatchpointInfo` to `TotalFrameSize` so it makes sense for both x64 and Arm64. * make it clear that the local offsets in `PatchpointInfo` are virtual (relative to the top of the frame). Adjust recorded Arm64 offsets to match. * add new jit config setting `JitEnableOSRRange` to allow selectively enabling OSR for only a subset of jitted methods. * Arm64 OSR method is passed Tier0 SP but not Tier0 FP, as Tier0 FP can point either at top or bottom of Tier0 frame depending on frame type. For Arm64 the OSR method establishes its own FP that chains to caller FP. (I will likely revise x64 to work this way too, as it makes simple FP chain stackwalks work out better). * The Arm64 OSR epilog gets an extra SP adjustment to remove the Tier0 frame. * The Arm64 OSR prolog gets a phantom SP adjustment in unwind to account for being passed the Tier0 SP. * local/arg init for the OSR frame from the Tier0 frame was moved into a new method `genEnregisterOSRArgsAndLocals`; implemented the ARM64 version. * sequencing of this initialization in the prolog reordered slightly to prevent inadvertent clobbering. * comments/code that referred to `original` or `old` method were revised to instead try and consistently use `tier0`. * support the mixed-altjit case for OSR by allocating a local copy of the patchpoint info with similar and plausible information. * add symbol table note for locals in OSR methods that live on the Tier0 frame. * broaden the jit-experimental testing matrix to include Arm64.

ghost · 2021-12-15T02:09:39Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Enable OSR for Arm64:

rename FpToSpDelta in PatchpointInfo to TotalFrameSize so it makes
sense for both x64 and Arm64.
make it clear that the local offsets in PatchpointInfo are virtual
(relative to the top of the frame). Adjust recorded Arm64 offsets to match.
add new jit config setting JitEnableOSRRange to allow selectively enabling
OSR for only a subset of jitted methods.
Arm64 OSR method is passed Tier0 SP but not Tier0 FP, as Tier0 FP can point
either at top or bottom of Tier0 frame depending on frame type. For Arm64 the
OSR method establishes its own FP that chains to caller FP. (I will likely
revise x64 to work this way too, as it makes simple FP chain stackwalks work
out better).
The Arm64 OSR epilog gets an extra SP adjustment to remove the Tier0 frame.
The Arm64 OSR prolog gets a phantom SP adjustment in unwind to account for
being passed the Tier0 SP.
local/arg init for the OSR frame from the Tier0 frame was moved into a new
method genEnregisterOSRArgsAndLocals; implemented the ARM64 version.
sequencing of this initialization in the prolog reordered slightly to prevent
inadvertent clobbering.
comments/code that referred to original or old method were revised to
instead try and consistently use tier0.
support the mixed-altjit case for OSR by allocating a local copy of the
patchpoint info with similar and plausible information.
add symbol table note for locals in OSR methods that live on the Tier0 frame.
broaden the jit-experimental testing matrix to include Arm64.

Author:	AndyAyersMS
Assignees:	AndyAyersMS
Labels:	`area-CodeGen-coreclr`
Milestone:	-

AndyAyersMS · 2021-12-15T02:15:07Z

cc @dotnet/jit-contrib

No diff for x64, but we will see "text diffs" for OSR methods for x64 as the prolog length is now properly described in the assembly summary (it was ok in the unwind).

~~Arm64 SPMI will show diffs as OSR method contexts will now be handled via OSR rather than silently enabling Tier1.~~ [evidently not -- as we don't try and run as an aljit].

Will run the newly extended jit experimental tests once the default testing is looking solid. There are some known failures there so will have to sort through the results to see if there's anything unexpected.

I suspect my Arm64 codegen for large offsets needs some adjusting (in both prolog and epilog).

AndyAyersMS · 2021-12-15T02:25:42Z

Sample codegen (note the two SP adjusts in the epilog and accesses to the Tier0 frame):

; Assembly listing for method AddressExposedLocal:F(int,int):int
; Emitting BLENDED_CODE for generic ARM64 CPU - Unix
; Tier-1 compilation
; OSR variant for entry point 0xd
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;* V00 arg0         [V00    ] (  0,  0   )     int  ->  zero-ref    single-def
;  V01 arg1         [V01,T01] (  3, 10   )     int  ->   x1         single-def
;  V02 loc0         [V02    ] (  3, 17   )     int  ->  [fp+30H]   do-not-enreg[X] addr-exposed ld-addr-op tier0-frame
;  V03 loc1         [V03,T00] (  4, 32   )     int  ->   x0        
;# V04 OutArgs      [V04    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M35369_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
        B9403BA1          ldr     w1, [fp,#56]
        B9402FA0          ldr     w0, [fp,#44]
                                                ;; bbWeight=1    PerfScore 5.50
G_M35369_IG02:              ;; offset=0010H
        B94033A2          ldr     w2, [fp,#48]
        0B000042          add     w2, w2, w0
        B90033A2          str     w2, [fp,#48]  // [V02 loc0]
        11000400          add     w0, w0, #1
        6B01001F          cmp     w0, w1
        54FFFF6B          blt     G_M35369_IG02
                                                ;; bbWeight=8    PerfScore 44.00
G_M35369_IG03:              ;; offset=0028H
        B94033A0          ldr     w0, [fp,#48]  // [V02 loc0]
                                                ;; bbWeight=1    PerfScore 2.00
G_M35369_IG04:              ;; offset=002CH
        A8C17BFD          ldp     fp, lr, [sp],#16
        9100C3FF          add     sp, sp, #48
        D65F03C0          ret     lr
                                                ;; bbWeight=1    PerfScore 2.50

; Total bytes of code 56, prolog size 16, PerfScore 59.60, instruction count 14, allocated bytes for code 56 (MethodHash=35dd75d6) for method AddressExposedLocal:F(int,int):int

AndyAyersMS · 2021-12-15T02:52:00Z

/azp run runtime-jit-experimental

azure-pipelines · 2021-12-15T02:52:15Z

Azure Pipelines successfully started running 1 pipeline(s).

AndyAyersMS · 2021-12-15T18:33:08Z

There are a few EH test failures with partial compilation enabled.

BruceForstall

A couple suggestions.

Otherwise, LGTM

BruceForstall · 2021-12-15T19:08:55Z

src/coreclr/jit/codegencommon.cpp

+        JITDUMP("Extra SP adjust for OSR to pop off Tier0 frame: %d bytes\n", tier0FrameSize);
+
+        // Tier0 size may exceed simple immediate. We're in the epilog so not clear if we can
+        // use a scratch reg. So just do two subtracts if necessary.


nit: two adds

BruceForstall · 2021-12-15T19:18:10Z

src/coreclr/jit/compiler.cpp

+    // If are an altjit and have patchpoint info, we might need to tweak the frame size
+    // so it's plausible for the altjit architecture.
+    //
+    if (compileFlags->IsSet(JitFlags::JIT_FLAG_OSR))


Suggested change

if (compileFlags->IsSet(JitFlags::JIT_FLAG_OSR))

if (!info.compMatchedVM && compileFlags->IsSet(JitFlags::JIT_FLAG_OSR))

Shouldn't you also only do this if we know we're running against a mismatched VM (aka, as a cross-altjit)?

Yes, thanks for spotting that. My initial version of this was up under where we checked for the mismatch.

AndyAyersMS · 2021-12-15T20:09:57Z

One of the partial comp failures was a large offset case in the prolog. Fixed this to use the initReg like we do elsewhere to hold larger offsets.

The other partial comp failures are cases where we throw from a catch in a method with a generic catch. Seems to suggest either something is off with the PSPSym or the reporting of the generic context in GcInfo. The catch is seemingly entered properly, but things go off the rails when it throws. This only happens for catches with shared catch clause types (eg ref class instantiations), again suggesting the runtime is unable to find the right information about the actual type being caught.

Per genSetPSPSym, PSPSym value is different for Arm64 than for x64, which I missed in my initial commit above. In my local fork I have updated it to what I think is the right value (CallerSP) but the tests are still not passing. So, need to keep digging.

AndyAyersMS · 2021-12-15T20:28:15Z

Simple repro case (for partial comp):

using System;

public class Exception<T> : Exception
{
}

class X
{
    static int x = -1;
    public static int Main()
    {
        F<string>();
        return x;
    }
    
    public static void F<T>()
    {
        try 
        {
            try 
            {
                throw new Exception<T>();
            }
            catch(Exception) 
            {
                throw;
            }
        }
        catch (Exception<T>) 
        {
            x = 100;
        }
    }
}

currently fails with:

TID 4399c: Jit_PartialCompilationPatchpoint: patchpoint [1] (0x0000FFFF3279AE08) TRIGGER
TID 4399c: JitPatchpointWorker: creating OSR version of Method=0x0000FFFF32B0FC78M (X::X.F[System.__Canon]()) at offset 0
TID 4399c: Jit_PartialCompilationPatchpoint: patchpoint [1] (0x0000FFFF3279AE08) TRANSITION to ip 0x0000FFFF3279AEA0

Assert failure(PID 276892 [0x0004399c], Thread: 276892 [0x4399c]): !"Cannot return exact class instantiation when we are requested to."
    File: /home/andy/repos/runtime/src/coreclr/vm/stackwalk.cpp Line: 258
    Image: /mnt/laptop/repos/runtime/artifacts/tests/coreclr/Linux.arm64.Checked/Tests/Core_Root/corerun

AndyAyersMS · 2021-12-15T23:29:49Z

I think what's going on is that there is an expectation that the PSP sym for both main method and funclet is the same delta from CallerSP. There is just one offset for this in the gc info.

This isn't the case currently; the OSR method delta includes the Tier0 frame size but we don't pad out the OSR funclet frame to match.

So we end up with a situation where we report PSPSym offset in gc info as -80 from Caller SP. For the OSR frame this is (-16 osr part of frame) + (-64 tier0 frame) and is fine.

But this offset doesn't work for the funclet; it's PSP sym is at (-16) with respect to Caller SP. So when we try and find the PSP sym value for the funclet we end up with the wrong value and things go downhill from there.

Options are to either pad the funclet frame or to have the OSR frame use the Tier0 frame's PSP Sym slot. The latter seems appealing but may not be possible as the PSP sym location varies depending on which regs get saved, and this differs between Tier0 frame and OSR frame (eg in this case the Tier0 PSP sym offset is at (-8).). If we tried to normalize for that by saving PSPSym first after varargs we might run into issues because the Tier0 frame spills varargs at top of frame and there's no reason for the OSR frame to do this (but see note).

So as ugly as it seems, we may need to pad the funclet frame to include a phantom tier0 frame.

AndyAyersMS · 2021-12-16T01:44:08Z

Still thinking about the funclet PSP issue (see note) but pushed the other fixes I have to get wider testing on them.

AndyAyersMS · 2021-12-16T02:03:40Z

Arm64 varargs may also need looking into. May simply be the OSR method doesn't need to allocate and populate its own varargs spill space as the spilling done by Tier0 should suffice. So might just be an optimization.

Note funclet frames currently reserve space for varargs that they don't spill. So there is a precedent of sorts.

AndyAyersMS · 2021-12-16T03:12:48Z

/azp run runtime-jit-experimental

azure-pipelines · 2021-12-16T03:13:03Z

Azure Pipelines successfully started running 1 pipeline(s).

AndyAyersMS · 2021-12-16T03:33:50Z

Padding the OSR funclets mostly works, it seems. It causes one SP adjust to get too large to be encoded by a type3 frame, and there's one other failure.

AndyAyersMS · 2021-12-16T21:14:42Z

Padding the OSR funclets mostly works, it seems. It causes one SP adjust to get too large to be encoded by a type3 frame, and there's one other failure

The other failure was in fgIsThrowHlpBlk not handling an empty block. Didn't root cause this as it seems like this utility should handle the empty block case.

I've pushed the funclet padding approach further -- while ugly -- it seems to be generally corrrect and passing the tests we have, but I think in the long run we'll want a different approach. For instance, if the tier0 frame is large we would need to probe in the funclet prolog; I am not doing that yet.

JIT/Regression/JitBlue/GitHub_21990 + partial compilation is a good stress test here -- a very large tier0 frame inducing very large padded funclets under OSR.

It seems like we could allocate the PSPSym above the saved regs (either top of frame or just below the varargs area). That way the OSR frame could "reuse" the tier0 frame's PSP sym and the OSR funclet frames would just pad for varargs like they do now and everone would agree on the caller-relative offset. So PSPSym would always be at offset -8 or -40 depending. This would also potentially free up the restriction that funclets must save / restore the same set of registers, if we ever wanted to go after smaller funclet prolog/epilogs.

@BruceForstall any gotchas you can think of in changing this aspect of arm64 frame gen? If you think moving the PSPSym up is viable, I'll work it up as a separate change in the new year.

In the meantime, I'll push up these changes and (since they only impact OSR) propose checking things in as they are now as it is mostly functional.

AndyAyersMS · 2021-12-16T22:17:38Z

/azp run runtime-jit-experimental

azure-pipelines · 2021-12-16T22:17:52Z

Azure Pipelines successfully started running 1 pipeline(s).

BruceForstall · 2021-12-16T23:56:13Z

fyi, the CLR ABI doc talks about PSPSym here: https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/botr/clr-abi.md#the-pspsym-and-funclet-parameters. If there are any additional clarifications we should make, e.g., to design questions, it would be good to update that location.

I think the main reason the PSPSym lives at a lower address than the callee saved registers is that establishing the PSPSym is not part of the OS unwind prolog, however, the allocated space and the code to write the PSPSym don't have to be that tightly coupled. So, even if we left space for it at a high location, we wouldn't want to generate the code to store it until after the unwind prolog. It's stored with an offset from SP, so you'd want to ensure the positive offset from RSP to the PSPSym slot was always encodable. Also, we want to make sure the epilog codes are a mirror of the prolog codes as much as possible, so you'd want to check the prolog/epilog sequence so the codes would match.

It probably doesn't need to match location (above versus below callee saved registers) on ARM32, as long as the code and comments are clear where it lives in each case.

It doesn't seem like there are any blocking issues to move it up.

AndyAyersMS · 2021-12-17T01:12:02Z

Still seeing some encoding failures with OSR / PC:

Assert failure(PID 48 [0x00000030], Thread: 48 [0x0030]): Assertion failed '!"Instruction cannot be encoded: IF_LS_3C"' in 'Microsoft.CodeAnalysis.CSharp.Symbols.Metadata.PE.PENamedTypeSymbol:LoadMembers():this' during 'Generate code' (IL size 710)

Also some minor diffs when OSR/PC is not enabled.

Will dig in...

AndyAyersMS · 2021-12-17T02:10:16Z

Those unexpected diffs are coming from fgIsThrowHlpBlk -- it needs to check differently for LIR.

AndyAyersMS · 2021-12-17T03:11:19Z

/azp run runtime-jit-experimental

azure-pipelines · 2021-12-17T03:11:34Z

Azure Pipelines successfully started running 1 pipeline(s).

AndyAyersMS · 2021-12-17T16:28:58Z

SPMI back to no diffs (text diffs in some OSR methods for windows x64).
Jit-Experimental showing "expected" failures in x64 R2R tests. All arm64 testing passed.

AndyAyersMS · 2022-01-07T18:56:13Z

Looks like moving the PSP sym up breaks assumptions for packed unwind, namely that one can compute the offset of the callee saves area by just looking at some header bits. And it appears the jit only knows how to emit packed unwind.

https://docs.microsoft.com/en-us/cpp/build/arm64-exception-handling?view=msvc-170#packed-unwind-data shows one such computation for #savsz.

So options here are to either emit more complex unwind for methods with EH (or just for OSR methods with EH) or else keep pushing on fixing issues with large padded funclet frames.

BruceForstall · 2022-01-14T06:08:19Z

This seems wrong; the JIT never generates the packed unwind format. The JIT always generates .xdata, and the VM constructs the appropriate .pdata.

AndyAyersMS · 2022-01-14T17:38:58Z

Thanks -- looks like I was confused then about why my changes were failing. I'd still like to try and move the PSP sym, so I should revisit this.

ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 15, 2021

ghost assigned AndyAyersMS Dec 15, 2021

AndyAyersMS mentioned this pull request Dec 15, 2021

On Stack Replacement Next Steps #33658

Open

72 tasks

BruceForstall approved these changes Dec 15, 2021

View reviewed changes

AndyAyersMS added 2 commits December 15, 2021 17:35

fix prolog offset limitation; fix pspsym value

d777cbf

fix altjit patchpoint logic to check for mismatched vm

25eed48

AndyAyersMS added 2 commits December 16, 2021 13:16

fix fgIsThrowHlpBlk

83ba2eb

handle padded osr funclet frames

3a15f40

type 1 funclets can have large fiSpDelta1 too

d3d3529

fix fgIsThrowHlpBlk

522f03a

AndyAyersMS merged commit 0206926 into dotnet:main Dec 17, 2021

JulieLeeMSFT mentioned this pull request Jan 25, 2022

What's new in .NET 7 Preview 1 [WIP] dotnet/core#7106

Closed

ghost locked as resolved and limited conversation to collaborators Feb 13, 2022

	if (compileFlags->IsSet(JitFlags::JIT_FLAG_OSR))
	if (!info.compMatchedVM && compileFlags->IsSet(JitFlags::JIT_FLAG_OSR))

OSR support for Arm64 #62831

OSR support for Arm64 #62831

Uh oh!

Conversation

AndyAyersMS commented Dec 15, 2021

Uh oh!

ghost commented Dec 15, 2021

Uh oh!

AndyAyersMS commented Dec 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndyAyersMS commented Dec 15, 2021

Uh oh!

AndyAyersMS commented Dec 15, 2021

Uh oh!

azure-pipelines bot commented Dec 15, 2021

Uh oh!

AndyAyersMS commented Dec 15, 2021

Uh oh!

BruceForstall left a comment

Choose a reason for hiding this comment

Uh oh!

BruceForstall Dec 15, 2021

Choose a reason for hiding this comment

Uh oh!

BruceForstall Dec 15, 2021

Choose a reason for hiding this comment

Uh oh!

AndyAyersMS Dec 15, 2021

Choose a reason for hiding this comment

Uh oh!

AndyAyersMS commented Dec 15, 2021

Uh oh!

AndyAyersMS commented Dec 15, 2021

Uh oh!

AndyAyersMS commented Dec 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndyAyersMS commented Dec 16, 2021

Uh oh!

AndyAyersMS commented Dec 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndyAyersMS commented Dec 16, 2021

Uh oh!

azure-pipelines bot commented Dec 16, 2021

Uh oh!

AndyAyersMS commented Dec 16, 2021

Uh oh!

AndyAyersMS commented Dec 16, 2021

Uh oh!

AndyAyersMS commented Dec 16, 2021

Uh oh!

azure-pipelines bot commented Dec 16, 2021

Uh oh!

BruceForstall commented Dec 16, 2021

Uh oh!

AndyAyersMS commented Dec 17, 2021

Uh oh!

AndyAyersMS commented Dec 17, 2021

Uh oh!

AndyAyersMS commented Dec 17, 2021

Uh oh!

azure-pipelines bot commented Dec 17, 2021

Uh oh!

AndyAyersMS commented Dec 17, 2021

Uh oh!

AndyAyersMS commented Jan 7, 2022

Uh oh!

BruceForstall commented Jan 14, 2022

Uh oh!

AndyAyersMS commented Jan 14, 2022

Uh oh!

Uh oh!

AndyAyersMS commented Dec 15, 2021 •

edited

Loading

AndyAyersMS commented Dec 15, 2021 •

edited

Loading

AndyAyersMS commented Dec 16, 2021 •

edited

Loading