Managed StelemRef and LdelemaRef #32722

VSadov · 2020-02-24T06:57:52Z

Moving StelemRef and LdelemaRef JIT helpers to managed code.

significantly improves GC suspension latency in code that frequently uses these helpers
(it is not uncommon to use array element accessors in loops)
reduces some duplication of implementations in c++ and/or in platform/OS specific asm flavors.

src/coreclr/src/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs

VSadov · 2020-03-02T03:28:27Z

I think this one is ready for review.

The change significantly improves responsiveness to GC suspension in tight loops with these helpers. From 200-500ms. it becomes less than 20ms. and most of the time much less than that.

Perf is comparable but slightly worse. In directed microbenchmarks, the fastest scenarios could be 10-20% slower.

The code is roughly the same as before. The biggest difference is PreStub indirection and it is responsible for the extra cycles. These are very fast helpers, so even small overheads are noticeable.
With some hacks the PreStub can be defeated. After that the perf becomes roughly the same as baseline in fastest scenarios (exact match, null) or actually faster when typecheck is needed.

I am not adding the PreStub hacks here, since they are just hacks and not production ready. We should consider that as a separate change. That would help other managed JIT helpers too.
The idea is to tell MethodDesc that it is a JIT helper and then whenever native code is updated, update the cell in the JIT helper table as well.

...clr/src/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.CoreCLR.cs

src/coreclr/src/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs

jkotas · 2020-03-02T03:40:11Z

The idea is to tell MethodDesc that it is a JIT helper and then whenever native code is updated, update the cell in the JIT helper table as well.

The indirection is because of tiered compilation. It may be better to focus on #965 and fix it everywhere.

VSadov · 2020-03-02T03:46:13Z

Even without tiering we have indirection. We ask for multicallable entry point very early, so we always get a stub. That is expected, since we do not want to force JIT-ting at that point.

Later the stub may be patched to point to actual native code (or better native code), but JIT helper table still points to the stub.

Not sure if this is a subset of #965 . In some generalized sense - perhaps, but just fixing the scenario in #965 may not fix this one.

VSadov · 2020-03-02T04:03:01Z

For the perf perspective:

On the following sample, with default behavior (i.e. tiering enabled, corlib crossgenned, WKS GC )

I get about:
186ms. before change,
210ms after.

That is ~ 13% regression, but it is a bit contextual. Some code changes could make it a bit better or worse,

using System;
using System.Diagnostics;
class Program
{
    static object[] arr = new Exception[10000];
    static Exception val = new Exception();

    static void Work()
    {
        var v = val;
        var a = arr;

        for (int i = 0; i < 100000000; i++)
        {
            a[1] = v;
        }
    }

    static void Main()
    {
        for (; ; )
        {
            var sw = Stopwatch.StartNew();
            Work();
            sw.Stop();

            {
                Console.WriteLine(sw.ElapsedMilliseconds);
            }
        }
    }
}

src/coreclr/src/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs

jkotas · 2020-03-02T05:44:41Z

That is ~ 13% regression

Does it vary between platforms?

VSadov · 2020-03-02T05:51:05Z

I did not check other platforms, but win-x64 seems to have the most optimized code originally, so I assumed if any regression, it must be worst on win-x64.

Is any other platform interesting in particular?

VSadov · 2020-03-02T06:04:09Z

Re: GetValueInternal

Did you mean like the following:

            notExactMatch:
                if (elementType == (void*)RuntimeTypeHandle.GetValueInternal(typeof(object).TypeHandle))
                    goto doWrite;

That does not seem to intrincify and is much slower.

00007ffa`d1611c79 48b948525dd1fa7f0000 mov     rcx, 7FFAD15D5248h
00007ffa`d1611c83 e8b8fbb15f           call    coreclr!JIT_GetRuntimeType (00007ffb`31131840)
00007ffa`d1611c88 488bc8               mov     rcx, rax
00007ffa`d1611c8b e8e045d15f           call    coreclr!RuntimeTypeHandle::GetValueInternal (00007ffb`31326270)
00007ffa`d1611c90 483bf8               cmp     rdi, rax

The following seems an interesting alternative:

            notExactMatch:
                if (arr.GetType() == typeof(object[]))
                    goto doWrite;

It is slightly more code, but overall has similar performance to the variant with cached static, without the static.
I will check if other scenarios are not affected much. If not, I will keep this variant.

jkotas · 2020-03-02T06:35:05Z

Is any other platform interesting in particular?

Win x86; any arm64 (one of Win or Linux is enough - the numbers should be the same)

src/coreclr/src/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs

VSadov · 2020-03-02T20:01:37Z

On x86 the same sample as posted above shows ~30% regression. (340ms. vs. 260ms. baseline)

Again - most of the regression disappears if I add the back-patching hack and let the helper to be JIT-ed (and the helper table updated) before going to measure Work method.

In fact managed implementation becomes a few % faster, but these measurements are sensitive and can move a few percent for seemingly unrelated changes, so I would consider that noise.

jkotas · 2020-03-02T21:48:41Z

How bad is the back-patching hack? 30% sounds too much to take. I think I would be ok with taking a workaround for it that get cleaned up in future.

VSadov · 2020-03-02T22:09:22Z

An example of the hack - VSadov@edb8ae9

It is a minimal implementation that supports just 1 helper. It is basically at proof of concept stage.

To make it more reasonable, we need:

to support more than one helper, we need to associate the helper ID with a MethodDesc.
There is a shortage of bits on the instance though and this would be a rare case.
Perhaps use a bit in the methoddesc to indicate it is a helper + store methoddesc ref in helper table somewhere (it is not a big table, can just scan it for this).
we have a case where the same MethodDesc represents two helpers.
Not a big issue though. We could pick one and leave the other for now, or just duplicate the helper.
need to make sure this is the place that sees all cases when native code is updated.
I am new to this area, this may not be the [only] choke point.

jkotas · 2020-03-02T22:19:24Z

This feels fragile. It means that the code JITed before the helper got (tiered) JITed will be stuck using the slow version of helper.

What is the code quality of the R2R version of the helper? Would it be an option to just use that?

VSadov · 2020-03-02T22:39:35Z

The code that was jitted before the helper updated will still see the latest/best code, just via a jmp.

I think a bigger concern would be if we update native code twice - to unoptimized version and then to optimized. If we patch the helper on the first update, some code may end up forever calling the slow version. That can be mitigated by patching only when we update to the last/best variant. I assume we can know that.

Updating all the jit-ed code when indirection is no longer needed is a much bigger problem. I assume that is what #965 covers. Once there is mechanism for that, this case could plug into it.

R2R version has other issues - it does not tailcall (we are sensitive to tailcalling the barrier), statics are accessed via a call, etc. It could cost more than an extra jmp.
Right now it is hard to measure. If I just disable tiering, I end up with both R2R codegen and the stub. And that is obviously slower.

VSadov · 2020-03-03T04:01:14Z

On Linux arm64 with the same sample:

baseline
1251ms.
with changes
1347ms. 7% degrade

The presence of PreStub hack does not seem to have any meaningfull effect. (I did check in debugger that the hack works on ARM64 and does eliminate the indirection)
My guess is that the whole thing is more expensive than on x64 and possibly dominated by the cost of write barrier, so extra jump matters much less.

jkotas · 2020-03-03T18:59:27Z

I think we need to do something about the regression. The options that I can think of:

(easier) JIT the helper synchronously. It will add one or two methods are JITed during startup that is not end of the world.
(harder) Fix the R2R tailcall problem

VSadov · 2020-03-03T20:20:42Z

#1 fix implies actually forcing the method to jit - like in PrepareMethod sense, right?

jkotas · 2020-03-04T04:44:13Z

Tests failing...

VSadov · 2020-03-04T05:32:35Z

        // Do the instrumentation and publish atomically, so that the
        // instrumentation data always matches the published code.
        CrstHolder gcCoverLock(&m_GCCoverCrst);

crst not initialized . . .

jkotas · 2020-03-04T17:03:28Z

src/coreclr/src/vm/ecall.cpp

@@ -195,6 +195,25 @@ void ECall::PopulateManagedCastHelpers()
    pDest = pMD->GetMultiCallableAddrOfCode();
    SetJitHelperFunction(CORINFO_HELP_UNBOX, pDest);

+    // Array element accessors are more perf sensitive than other managed helpers and indirection 
+    // costs introduced by PreStub could be noticeable (7% to 30% depending on platform).


Add a comment that this should be revisited once #5857 is fixed?

jkotas · 2020-03-04T17:04:35Z

@davidwrighton This will make us JIT two more small methods during startup for now. I think it is ok to take this and work on fixing the JITing separately. Are you ok with this as well?

jkotas

Thanks!

davidwrighton · 2020-03-04T20:55:37Z

@jkotas, adding 2 tiny methods to jit on startup is not significant. If it solves a real problem, its ok. (Not ideal, but ok)

davidwrighton

VSadov · 2020-03-04T22:19:32Z

Thanks!!

benaadams · 2020-03-06T17:02:43Z

The helper calls in the dasm look like they always remain helper calls. A question though is could they inline and for example in LdelemaRef the if (elementType == type) test be hoisted out of the loop by the Jit?

VSadov · 2020-03-06T18:59:58Z

What you are suggesting is introducing these calls early - as method calls and then let the regular inlining/cse deal with them appropriately.

The thought did occur while implementing these. It would be interesting, especially for Ldelema.

Stelem could be tricky since there is a call to write barrier method and jit does not know that it has same semantics as assignment.

benaadams · 2020-03-06T19:55:38Z

Yea I was having a look at the asm since its now emitted for JitDisasm=* 😄

StelemRef is more bulky; however Ldelema would likely be more highly used? (Reads generally higher than writes, else why write)

Additionally if it was used in a for (i =0; i < a.Length; i++) type loop if Ldelema inlines there is a possibility for the array bounds check to also be eliminated (as it is the for condition)

VSadov · 2020-03-06T20:15:12Z

Ldelema is used when an element is used by reference - foo(ref arr[1]); ref var x = ref arr[2]; etc.. Not sure how common element byrefs are overall. Could be less common than writes.
They can be used in tight loops though.

Ordinary reads are just reads though, without helpers.

VSadov · 2021-05-16T00:52:51Z

src/coreclr/src/vm/amd64/JitHelpers_Fast.asm

-        je      ThrowNullReferenceException
-
-        ; we only want the lower 32-bits of edx, it might be dirty
-        or      edx, edx


Original code truncates the index to 32bit

danmoseley added the area-System.Runtime.CompilerServices label Feb 27, 2020

VSadov force-pushed the arrElem branch from 95ceef6 to 7a73568 Compare February 28, 2020 00:13

VSadov added area-VM-coreclr and removed area-System.Runtime.CompilerServices labels Feb 28, 2020

VSadov force-pushed the arrElem branch 2 times, most recently from 83d52f2 to ba23941 Compare February 29, 2020 23:17

VSadov changed the title ~~[WIP] Managed StelemRef and LdelemaRef~~ Managed StelemRef and LdelemaRef Mar 2, 2020

VSadov marked this pull request as ready for review March 2, 2020 03:14

VSadov requested a review from jkotas March 2, 2020 03:14

jkotas reviewed Mar 2, 2020

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs Outdated Show resolved Hide resolved

jkotas reviewed Mar 2, 2020

View reviewed changes

...clr/src/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.CoreCLR.cs Outdated Show resolved Hide resolved

jkotas reviewed Mar 2, 2020

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs Outdated Show resolved Hide resolved

jkotas reviewed Mar 2, 2020

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs Outdated Show resolved Hide resolved

EgorBo reviewed Mar 2, 2020

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs Outdated Show resolved Hide resolved

jkotas reviewed Mar 4, 2020

View reviewed changes

jkotas approved these changes Mar 4, 2020

View reviewed changes

VSadov added 11 commits March 4, 2020 11:10

Managed LdelemaRef and StelemRef

24de52a

support ARM64 and OSX

944e213

removed old JIT_Stelem_Ref implementation

53b3d8e

removed JIT_Ldelema_Ref

ecdbaa9

Attributes for stack/debug hiding.

94997d5

PR feedback

d970d8a

Avoid needing static initializer.

35e548f

couple small tweaks

590031e

Avoid PreStub indirection in array element JIT helpers.

a8bf8de

Initialize gcCoverLock a bit earlier

f412beb

Added comment about revisiting after dotnet#5857 is fixed

00a91b8

VSadov force-pushed the arrElem branch from ed41614 to 00a91b8 Compare March 4, 2020 19:15

jkotas added the tenet-performance Performance related issue label Mar 4, 2020

davidwrighton approved these changes Mar 4, 2020

View reviewed changes

VSadov merged commit 3fc245e into dotnet:master Mar 4, 2020

erozenfeld mentioned this pull request Mar 6, 2020

Assert failure: pMT != NULL #33298

Closed

jkotas mentioned this pull request May 11, 2020

Update the mscorlib.md document. #36225

Merged

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

VSadov commented May 16, 2021

View reviewed changes

Managed StelemRef and LdelemaRef #32722

Managed StelemRef and LdelemaRef #32722

Uh oh!

Conversation

VSadov commented Feb 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

VSadov commented Mar 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jkotas commented Mar 2, 2020

Uh oh!

VSadov commented Mar 2, 2020

Uh oh!

VSadov commented Mar 2, 2020

Uh oh!

Uh oh!

jkotas commented Mar 2, 2020

Uh oh!

VSadov commented Mar 2, 2020

Uh oh!

VSadov commented Mar 2, 2020

Uh oh!

jkotas commented Mar 2, 2020

Uh oh!

Uh oh!

VSadov commented Mar 2, 2020

Uh oh!

jkotas commented Mar 2, 2020

Uh oh!

VSadov commented Mar 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkotas commented Mar 2, 2020

Uh oh!

VSadov commented Mar 2, 2020

Uh oh!

VSadov commented Mar 3, 2020

Uh oh!

jkotas commented Mar 3, 2020

Uh oh!

VSadov commented Mar 3, 2020

Uh oh!

jkotas commented Mar 4, 2020

Uh oh!

VSadov commented Mar 4, 2020

Uh oh!

jkotas Mar 4, 2020

Choose a reason for hiding this comment

Uh oh!

jkotas commented Mar 4, 2020

Uh oh!

jkotas left a comment

Choose a reason for hiding this comment

Uh oh!

davidwrighton commented Mar 4, 2020

Uh oh!

davidwrighton left a comment

Choose a reason for hiding this comment

Uh oh!

VSadov commented Mar 4, 2020

Uh oh!

benaadams commented Mar 6, 2020

Uh oh!

VSadov commented Mar 6, 2020

Uh oh!

benaadams commented Mar 6, 2020

Uh oh!

VSadov commented Mar 6, 2020

Uh oh!

VSadov May 16, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

VSadov commented Feb 24, 2020 •

edited

Loading

VSadov commented Mar 2, 2020 •

edited

Loading

VSadov commented Mar 2, 2020 •

edited

Loading