Skip to content

OSR support for Arm64 #62831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Dec 17, 2021
Merged

OSR support for Arm64 #62831

merged 7 commits into from
Dec 17, 2021

Conversation

AndyAyersMS
Copy link
Member

Enable OSR for Arm64:

  • rename FpToSpDelta in PatchpointInfo to TotalFrameSize so it makes
    sense for both x64 and Arm64.
  • make it clear that the local offsets in PatchpointInfo are virtual
    (relative to the top of the frame). Adjust recorded Arm64 offsets to match.
  • add new jit config setting JitEnableOSRRange to allow selectively enabling
    OSR for only a subset of jitted methods.
  • Arm64 OSR method is passed Tier0 SP but not Tier0 FP, as Tier0 FP can point
    either at top or bottom of Tier0 frame depending on frame type. For Arm64 the
    OSR method establishes its own FP that chains to caller FP. (I will likely
    revise x64 to work this way too, as it makes simple FP chain stackwalks work
    out better).
  • The Arm64 OSR epilog gets an extra SP adjustment to remove the Tier0 frame.
  • The Arm64 OSR prolog gets a phantom SP adjustment in unwind to account for
    being passed the Tier0 SP.
  • local/arg init for the OSR frame from the Tier0 frame was moved into a new
    method genEnregisterOSRArgsAndLocals; implemented the ARM64 version.
  • sequencing of this initialization in the prolog reordered slightly to prevent
    inadvertent clobbering.
  • comments/code that referred to original or old method were revised to
    instead try and consistently use tier0.
  • support the mixed-altjit case for OSR by allocating a local copy of the
    patchpoint info with similar and plausible information.
  • add symbol table note for locals in OSR methods that live on the Tier0 frame.
  • broaden the jit-experimental testing matrix to include Arm64.

Enable OSR for Arm64:
* rename `FpToSpDelta` in `PatchpointInfo` to `TotalFrameSize` so it makes
sense for both x64 and Arm64.
* make it clear that the local offsets in `PatchpointInfo` are virtual
(relative to the top of the frame). Adjust recorded Arm64 offsets to match.
* add new jit config setting `JitEnableOSRRange` to allow selectively enabling
OSR for only a subset of jitted methods.
* Arm64 OSR method is passed Tier0 SP but not Tier0 FP, as Tier0 FP can point
either at top or bottom of Tier0 frame depending on frame type. For Arm64 the
OSR method establishes its own FP that chains to caller FP. (I will likely
revise x64 to work this way too, as it makes simple FP chain stackwalks work
out better).
* The Arm64 OSR epilog gets an extra SP adjustment to remove the Tier0 frame.
* The Arm64 OSR prolog gets a phantom SP adjustment in unwind to account for
being passed the Tier0 SP.
* local/arg init for the OSR frame from the Tier0 frame was moved into a new
method `genEnregisterOSRArgsAndLocals`; implemented the ARM64 version.
* sequencing of this initialization in the prolog reordered slightly to prevent
inadvertent clobbering.
* comments/code that referred to `original` or `old` method were revised to
instead try and consistently use `tier0`.
* support the mixed-altjit case for OSR by allocating a local copy of the
patchpoint info with similar and plausible information.
* add symbol table note for locals in OSR methods that live on the Tier0 frame.
* broaden the jit-experimental testing matrix to include Arm64.
@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 15, 2021
@ghost ghost assigned AndyAyersMS Dec 15, 2021
@ghost
Copy link

ghost commented Dec 15, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Enable OSR for Arm64:

  • rename FpToSpDelta in PatchpointInfo to TotalFrameSize so it makes
    sense for both x64 and Arm64.
  • make it clear that the local offsets in PatchpointInfo are virtual
    (relative to the top of the frame). Adjust recorded Arm64 offsets to match.
  • add new jit config setting JitEnableOSRRange to allow selectively enabling
    OSR for only a subset of jitted methods.
  • Arm64 OSR method is passed Tier0 SP but not Tier0 FP, as Tier0 FP can point
    either at top or bottom of Tier0 frame depending on frame type. For Arm64 the
    OSR method establishes its own FP that chains to caller FP. (I will likely
    revise x64 to work this way too, as it makes simple FP chain stackwalks work
    out better).
  • The Arm64 OSR epilog gets an extra SP adjustment to remove the Tier0 frame.
  • The Arm64 OSR prolog gets a phantom SP adjustment in unwind to account for
    being passed the Tier0 SP.
  • local/arg init for the OSR frame from the Tier0 frame was moved into a new
    method genEnregisterOSRArgsAndLocals; implemented the ARM64 version.
  • sequencing of this initialization in the prolog reordered slightly to prevent
    inadvertent clobbering.
  • comments/code that referred to original or old method were revised to
    instead try and consistently use tier0.
  • support the mixed-altjit case for OSR by allocating a local copy of the
    patchpoint info with similar and plausible information.
  • add symbol table note for locals in OSR methods that live on the Tier0 frame.
  • broaden the jit-experimental testing matrix to include Arm64.
Author: AndyAyersMS
Assignees: AndyAyersMS
Labels:

area-CodeGen-coreclr

Milestone: -

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Dec 15, 2021

cc @dotnet/jit-contrib

No diff for x64, but we will see "text diffs" for OSR methods for x64 as the prolog length is now properly described in the assembly summary (it was ok in the unwind).

Arm64 SPMI will show diffs as OSR method contexts will now be handled via OSR rather than silently enabling Tier1. [evidently not -- as we don't try and run as an aljit].

Will run the newly extended jit experimental tests once the default testing is looking solid. There are some known failures there so will have to sort through the results to see if there's anything unexpected.

I suspect my Arm64 codegen for large offsets needs some adjusting (in both prolog and epilog).

@AndyAyersMS
Copy link
Member Author

Sample codegen (note the two SP adjusts in the epilog and accesses to the Tier0 frame):

; Assembly listing for method AddressExposedLocal:F(int,int):int
; Emitting BLENDED_CODE for generic ARM64 CPU - Unix
; Tier-1 compilation
; OSR variant for entry point 0xd
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;* V00 arg0         [V00    ] (  0,  0   )     int  ->  zero-ref    single-def
;  V01 arg1         [V01,T01] (  3, 10   )     int  ->   x1         single-def
;  V02 loc0         [V02    ] (  3, 17   )     int  ->  [fp+30H]   do-not-enreg[X] addr-exposed ld-addr-op tier0-frame
;  V03 loc1         [V03,T00] (  4, 32   )     int  ->   x0        
;# V04 OutArgs      [V04    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M35369_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
        B9403BA1          ldr     w1, [fp,#56]
        B9402FA0          ldr     w0, [fp,#44]
                                                ;; bbWeight=1    PerfScore 5.50
G_M35369_IG02:              ;; offset=0010H
        B94033A2          ldr     w2, [fp,#48]
        0B000042          add     w2, w2, w0
        B90033A2          str     w2, [fp,#48]  // [V02 loc0]
        11000400          add     w0, w0, #1
        6B01001F          cmp     w0, w1
        54FFFF6B          blt     G_M35369_IG02
                                                ;; bbWeight=8    PerfScore 44.00
G_M35369_IG03:              ;; offset=0028H
        B94033A0          ldr     w0, [fp,#48]  // [V02 loc0]
                                                ;; bbWeight=1    PerfScore 2.00
G_M35369_IG04:              ;; offset=002CH
        A8C17BFD          ldp     fp, lr, [sp],#16
        9100C3FF          add     sp, sp, #48
        D65F03C0          ret     lr
                                                ;; bbWeight=1    PerfScore 2.50

; Total bytes of code 56, prolog size 16, PerfScore 59.60, instruction count 14, allocated bytes for code 56 (MethodHash=35dd75d6) for method AddressExposedLocal:F(int,int):int

@AndyAyersMS
Copy link
Member Author

/azp run runtime-jit-experimental

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@AndyAyersMS
Copy link
Member Author

There are a few EH test failures with partial compilation enabled.

Copy link
Contributor

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple suggestions.

Otherwise, LGTM

JITDUMP("Extra SP adjust for OSR to pop off Tier0 frame: %d bytes\n", tier0FrameSize);

// Tier0 size may exceed simple immediate. We're in the epilog so not clear if we can
// use a scratch reg. So just do two subtracts if necessary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: two adds

// If are an altjit and have patchpoint info, we might need to tweak the frame size
// so it's plausible for the altjit architecture.
//
if (compileFlags->IsSet(JitFlags::JIT_FLAG_OSR))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (compileFlags->IsSet(JitFlags::JIT_FLAG_OSR))
if (!info.compMatchedVM && compileFlags->IsSet(JitFlags::JIT_FLAG_OSR))

Shouldn't you also only do this if we know we're running against a mismatched VM (aka, as a cross-altjit)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for spotting that. My initial version of this was up under where we checked for the mismatch.

@AndyAyersMS
Copy link
Member Author

One of the partial comp failures was a large offset case in the prolog. Fixed this to use the initReg like we do elsewhere to hold larger offsets.

The other partial comp failures are cases where we throw from a catch in a method with a generic catch. Seems to suggest either something is off with the PSPSym or the reporting of the generic context in GcInfo. The catch is seemingly entered properly, but things go off the rails when it throws. This only happens for catches with shared catch clause types (eg ref class instantiations), again suggesting the runtime is unable to find the right information about the actual type being caught.

Per genSetPSPSym, PSPSym value is different for Arm64 than for x64, which I missed in my initial commit above. In my local fork I have updated it to what I think is the right value (CallerSP) but the tests are still not passing. So, need to keep digging.

@AndyAyersMS
Copy link
Member Author

Simple repro case (for partial comp):

using System;

public class Exception<T> : Exception
{
}

class X
{
    static int x = -1;
    public static int Main()
    {
        F<string>();
        return x;
    }
    
    public static void F<T>()
    {
        try 
        {
            try 
            {
                throw new Exception<T>();
            }
            catch(Exception) 
            {
                throw;
            }
        }
        catch (Exception<T>) 
        {
            x = 100;
        }
    }
}

currently fails with:

TID 4399c: Jit_PartialCompilationPatchpoint: patchpoint [1] (0x0000FFFF3279AE08) TRIGGER
TID 4399c: JitPatchpointWorker: creating OSR version of Method=0x0000FFFF32B0FC78M (X::X.F[System.__Canon]()) at offset 0
TID 4399c: Jit_PartialCompilationPatchpoint: patchpoint [1] (0x0000FFFF3279AE08) TRANSITION to ip 0x0000FFFF3279AEA0

Assert failure(PID 276892 [0x0004399c], Thread: 276892 [0x4399c]): !"Cannot return exact class instantiation when we are requested to."
    File: /home/andy/repos/runtime/src/coreclr/vm/stackwalk.cpp Line: 258
    Image: /mnt/laptop/repos/runtime/artifacts/tests/coreclr/Linux.arm64.Checked/Tests/Core_Root/corerun

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Dec 15, 2021

I think what's going on is that there is an expectation that the PSP sym for both main method and funclet is the same delta from CallerSP. There is just one offset for this in the gc info.

This isn't the case currently; the OSR method delta includes the Tier0 frame size but we don't pad out the OSR funclet frame to match.

So we end up with a situation where we report PSPSym offset in gc info as -80 from Caller SP. For the OSR frame this is (-16 osr part of frame) + (-64 tier0 frame) and is fine.

But this offset doesn't work for the funclet; it's PSP sym is at (-16) with respect to Caller SP. So when we try and find the PSP sym value for the funclet we end up with the wrong value and things go downhill from there.

Options are to either pad the funclet frame or to have the OSR frame use the Tier0 frame's PSP Sym slot. The latter seems appealing but may not be possible as the PSP sym location varies depending on which regs get saved, and this differs between Tier0 frame and OSR frame (eg in this case the Tier0 PSP sym offset is at (-8).). If we tried to normalize for that by saving PSPSym first after varargs we might run into issues because the Tier0 frame spills varargs at top of frame and there's no reason for the OSR frame to do this (but see note).

So as ugly as it seems, we may need to pad the funclet frame to include a phantom tier0 frame.

@AndyAyersMS
Copy link
Member Author

Still thinking about the funclet PSP issue (see note) but pushed the other fixes I have to get wider testing on them.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Dec 16, 2021

Arm64 varargs may also need looking into. May simply be the OSR method doesn't need to allocate and populate its own varargs spill space as the spilling done by Tier0 should suffice. So might just be an optimization.

Note funclet frames currently reserve space for varargs that they don't spill. So there is a precedent of sorts.

@AndyAyersMS
Copy link
Member Author

/azp run runtime-jit-experimental

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@AndyAyersMS
Copy link
Member Author

Padding the OSR funclets mostly works, it seems. It causes one SP adjust to get too large to be encoded by a type3 frame, and there's one other failure.

@AndyAyersMS
Copy link
Member Author

Padding the OSR funclets mostly works, it seems. It causes one SP adjust to get too large to be encoded by a type3 frame, and there's one other failure

The other failure was in fgIsThrowHlpBlk not handling an empty block. Didn't root cause this as it seems like this utility should handle the empty block case.

I've pushed the funclet padding approach further -- while ugly -- it seems to be generally corrrect and passing the tests we have, but I think in the long run we'll want a different approach. For instance, if the tier0 frame is large we would need to probe in the funclet prolog; I am not doing that yet.

JIT/Regression/JitBlue/GitHub_21990 + partial compilation is a good stress test here -- a very large tier0 frame inducing very large padded funclets under OSR.

It seems like we could allocate the PSPSym above the saved regs (either top of frame or just below the varargs area). That way the OSR frame could "reuse" the tier0 frame's PSP sym and the OSR funclet frames would just pad for varargs like they do now and everone would agree on the caller-relative offset. So PSPSym would always be at offset -8 or -40 depending. This would also potentially free up the restriction that funclets must save / restore the same set of registers, if we ever wanted to go after smaller funclet prolog/epilogs.

@BruceForstall any gotchas you can think of in changing this aspect of arm64 frame gen? If you think moving the PSPSym up is viable, I'll work it up as a separate change in the new year.

In the meantime, I'll push up these changes and (since they only impact OSR) propose checking things in as they are now as it is mostly functional.

@AndyAyersMS
Copy link
Member Author

/azp run runtime-jit-experimental

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@BruceForstall
Copy link
Contributor

fyi, the CLR ABI doc talks about PSPSym here: https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/botr/clr-abi.md#the-pspsym-and-funclet-parameters. If there are any additional clarifications we should make, e.g., to design questions, it would be good to update that location.

I think the main reason the PSPSym lives at a lower address than the callee saved registers is that establishing the PSPSym is not part of the OS unwind prolog, however, the allocated space and the code to write the PSPSym don't have to be that tightly coupled. So, even if we left space for it at a high location, we wouldn't want to generate the code to store it until after the unwind prolog. It's stored with an offset from SP, so you'd want to ensure the positive offset from RSP to the PSPSym slot was always encodable. Also, we want to make sure the epilog codes are a mirror of the prolog codes as much as possible, so you'd want to check the prolog/epilog sequence so the codes would match.

It probably doesn't need to match location (above versus below callee saved registers) on ARM32, as long as the code and comments are clear where it lives in each case.

It doesn't seem like there are any blocking issues to move it up.

@AndyAyersMS
Copy link
Member Author

Still seeing some encoding failures with OSR / PC:

Assert failure(PID 48 [0x00000030], Thread: 48 [0x0030]): Assertion failed '!"Instruction cannot be encoded: IF_LS_3C"' in 'Microsoft.CodeAnalysis.CSharp.Symbols.Metadata.PE.PENamedTypeSymbol:LoadMembers():this' during 'Generate code' (IL size 710)

Also some minor diffs when OSR/PC is not enabled.

Will dig in...

@AndyAyersMS
Copy link
Member Author

Those unexpected diffs are coming from fgIsThrowHlpBlk -- it needs to check differently for LIR.

@AndyAyersMS
Copy link
Member Author

/azp run runtime-jit-experimental

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@AndyAyersMS
Copy link
Member Author

SPMI back to no diffs (text diffs in some OSR methods for windows x64).
Jit-Experimental showing "expected" failures in x64 R2R tests. All arm64 testing passed.

@AndyAyersMS AndyAyersMS merged commit 0206926 into dotnet:main Dec 17, 2021
@AndyAyersMS
Copy link
Member Author

Looks like moving the PSP sym up breaks assumptions for packed unwind, namely that one can compute the offset of the callee saves area by just looking at some header bits. And it appears the jit only knows how to emit packed unwind.

https://docs.microsoft.com/en-us/cpp/build/arm64-exception-handling?view=msvc-170#packed-unwind-data shows one such computation for #savsz.

So options here are to either emit more complex unwind for methods with EH (or just for OSR methods with EH) or else keep pushing on fixing issues with large padded funclet frames.

@BruceForstall
Copy link
Contributor

This seems wrong; the JIT never generates the packed unwind format. The JIT always generates .xdata, and the VM constructs the appropriate .pdata.

@AndyAyersMS
Copy link
Member Author

Thanks -- looks like I was confused then about why my changes were failing. I'd still like to try and move the PSP sym, so I should revisit this.

@ghost ghost locked as resolved and limited conversation to collaborators Feb 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants