Skip to content

Managed StelemRef and LdelemaRef #32722

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 4, 2020
Merged

Managed StelemRef and LdelemaRef #32722

merged 11 commits into from
Mar 4, 2020

Conversation

VSadov
Copy link
Member

@VSadov VSadov commented Feb 24, 2020

Moving StelemRef and LdelemaRef JIT helpers to managed code.

  • significantly improves GC suspension latency in code that frequently uses these helpers
    (it is not uncommon to use array element accessors in loops)
  • reduces some duplication of implementations in c++ and/or in platform/OS specific asm flavors.

Fixes: #32683

@VSadov VSadov force-pushed the arrElem branch 2 times, most recently from 83d52f2 to ba23941 Compare February 29, 2020 23:17
@VSadov VSadov changed the title [WIP] Managed StelemRef and LdelemaRef Managed StelemRef and LdelemaRef Mar 2, 2020
@VSadov VSadov marked this pull request as ready for review March 2, 2020 03:14
@VSadov VSadov requested a review from jkotas March 2, 2020 03:14
@VSadov
Copy link
Member Author

VSadov commented Mar 2, 2020

I think this one is ready for review.

The change significantly improves responsiveness to GC suspension in tight loops with these helpers. From 200-500ms. it becomes less than 20ms. and most of the time much less than that.

Perf is comparable but slightly worse. In directed microbenchmarks, the fastest scenarios could be 10-20% slower.

The code is roughly the same as before. The biggest difference is PreStub indirection and it is responsible for the extra cycles. These are very fast helpers, so even small overheads are noticeable.
With some hacks the PreStub can be defeated. After that the perf becomes roughly the same as baseline in fastest scenarios (exact match, null) or actually faster when typecheck is needed.

I am not adding the PreStub hacks here, since they are just hacks and not production ready. We should consider that as a separate change. That would help other managed JIT helpers too.
The idea is to tell MethodDesc that it is a JIT helper and then whenever native code is updated, update the cell in the JIT helper table as well.

@jkotas
Copy link
Member

jkotas commented Mar 2, 2020

The idea is to tell MethodDesc that it is a JIT helper and then whenever native code is updated, update the cell in the JIT helper table as well.

The indirection is because of tiered compilation. It may be better to focus on #965 and fix it everywhere.

@VSadov
Copy link
Member Author

VSadov commented Mar 2, 2020

Even without tiering we have indirection. We ask for multicallable entry point very early, so we always get a stub. That is expected, since we do not want to force JIT-ting at that point.

Later the stub may be patched to point to actual native code (or better native code), but JIT helper table still points to the stub.

Not sure if this is a subset of #965 . In some generalized sense - perhaps, but just fixing the scenario in #965 may not fix this one.

@VSadov
Copy link
Member Author

VSadov commented Mar 2, 2020

For the perf perspective:

On the following sample, with default behavior (i.e. tiering enabled, corlib crossgenned, WKS GC )

I get about:
186ms. before change,
210ms after.

That is ~ 13% regression, but it is a bit contextual. Some code changes could make it a bit better or worse,

using System;
using System.Diagnostics;
class Program
{
    static object[] arr = new Exception[10000];
    static Exception val = new Exception();

    static void Work()
    {
        var v = val;
        var a = arr;

        for (int i = 0; i < 100000000; i++)
        {
            a[1] = v;
        }
    }

    static void Main()
    {
        for (; ; )
        {
            var sw = Stopwatch.StartNew();
            Work();
            sw.Stop();

            {
                Console.WriteLine(sw.ElapsedMilliseconds);
            }
        }
    }
}

@jkotas
Copy link
Member

jkotas commented Mar 2, 2020

That is ~ 13% regression

Does it vary between platforms?

@VSadov
Copy link
Member Author

VSadov commented Mar 2, 2020

I did not check other platforms, but win-x64 seems to have the most optimized code originally, so I assumed if any regression, it must be worst on win-x64.

Is any other platform interesting in particular?

@VSadov
Copy link
Member Author

VSadov commented Mar 2, 2020

Re: GetValueInternal

Did you mean like the following:

            notExactMatch:
                if (elementType == (void*)RuntimeTypeHandle.GetValueInternal(typeof(object).TypeHandle))
                    goto doWrite;

That does not seem to intrincify and is much slower.

00007ffa`d1611c79 48b948525dd1fa7f0000 mov     rcx, 7FFAD15D5248h
00007ffa`d1611c83 e8b8fbb15f           call    coreclr!JIT_GetRuntimeType (00007ffb`31131840)
00007ffa`d1611c88 488bc8               mov     rcx, rax
00007ffa`d1611c8b e8e045d15f           call    coreclr!RuntimeTypeHandle::GetValueInternal (00007ffb`31326270)
00007ffa`d1611c90 483bf8               cmp     rdi, rax

The following seems an interesting alternative:

            notExactMatch:
                if (arr.GetType() == typeof(object[]))
                    goto doWrite;

It is slightly more code, but overall has similar performance to the variant with cached static, without the static.
I will check if other scenarios are not affected much. If not, I will keep this variant.

@jkotas
Copy link
Member

jkotas commented Mar 2, 2020

Is any other platform interesting in particular?

Win x86; any arm64 (one of Win or Linux is enough - the numbers should be the same)

@VSadov
Copy link
Member Author

VSadov commented Mar 2, 2020

On x86 the same sample as posted above shows ~30% regression. (340ms. vs. 260ms. baseline)

Again - most of the regression disappears if I add the back-patching hack and let the helper to be JIT-ed (and the helper table updated) before going to measure Work method.

In fact managed implementation becomes a few % faster, but these measurements are sensitive and can move a few percent for seemingly unrelated changes, so I would consider that noise.

@jkotas
Copy link
Member

jkotas commented Mar 2, 2020

How bad is the back-patching hack? 30% sounds too much to take. I think I would be ok with taking a workaround for it that get cleaned up in future.

@VSadov
Copy link
Member Author

VSadov commented Mar 2, 2020

An example of the hack - VSadov@edb8ae9

It is a minimal implementation that supports just 1 helper. It is basically at proof of concept stage.

To make it more reasonable, we need:

  • to support more than one helper, we need to associate the helper ID with a MethodDesc.
    There is a shortage of bits on the instance though and this would be a rare case.
    Perhaps use a bit in the methoddesc to indicate it is a helper + store methoddesc ref in helper table somewhere (it is not a big table, can just scan it for this).
  • we have a case where the same MethodDesc represents two helpers.
    Not a big issue though. We could pick one and leave the other for now, or just duplicate the helper.
  • need to make sure this is the place that sees all cases when native code is updated.
    I am new to this area, this may not be the [only] choke point.

@jkotas
Copy link
Member

jkotas commented Mar 2, 2020

This feels fragile. It means that the code JITed before the helper got (tiered) JITed will be stuck using the slow version of helper.

What is the code quality of the R2R version of the helper? Would it be an option to just use that?

@VSadov
Copy link
Member Author

VSadov commented Mar 2, 2020

The code that was jitted before the helper updated will still see the latest/best code, just via a jmp.

I think a bigger concern would be if we update native code twice - to unoptimized version and then to optimized. If we patch the helper on the first update, some code may end up forever calling the slow version. That can be mitigated by patching only when we update to the last/best variant. I assume we can know that.

Updating all the jit-ed code when indirection is no longer needed is a much bigger problem. I assume that is what #965 covers. Once there is mechanism for that, this case could plug into it.

R2R version has other issues - it does not tailcall (we are sensitive to tailcalling the barrier), statics are accessed via a call, etc. It could cost more than an extra jmp.
Right now it is hard to measure. If I just disable tiering, I end up with both R2R codegen and the stub. And that is obviously slower.

@VSadov
Copy link
Member Author

VSadov commented Mar 3, 2020

On Linux arm64 with the same sample:

  • baseline
    1251ms.

  • with changes
    1347ms. 7% degrade

The presence of PreStub hack does not seem to have any meaningfull effect. (I did check in debugger that the hack works on ARM64 and does eliminate the indirection)
My guess is that the whole thing is more expensive than on x64 and possibly dominated by the cost of write barrier, so extra jump matters much less.

@jkotas
Copy link
Member

jkotas commented Mar 3, 2020

I think we need to do something about the regression. The options that I can think of:

  1. (easier) JIT the helper synchronously. It will add one or two methods are JITed during startup that is not end of the world.
  2. (harder) Fix the R2R tailcall problem

@VSadov
Copy link
Member Author

VSadov commented Mar 3, 2020

#1 fix implies actually forcing the method to jit - like in PrepareMethod sense, right?

@jkotas
Copy link
Member

jkotas commented Mar 4, 2020

Tests failing...

@VSadov
Copy link
Member Author

VSadov commented Mar 4, 2020

        // Do the instrumentation and publish atomically, so that the
        // instrumentation data always matches the published code.
        CrstHolder gcCoverLock(&m_GCCoverCrst);

crst not initialized . . .

@@ -195,6 +195,25 @@ void ECall::PopulateManagedCastHelpers()
pDest = pMD->GetMultiCallableAddrOfCode();
SetJitHelperFunction(CORINFO_HELP_UNBOX, pDest);

// Array element accessors are more perf sensitive than other managed helpers and indirection
// costs introduced by PreStub could be noticeable (7% to 30% depending on platform).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment that this should be revisited once #5857 is fixed?

@jkotas
Copy link
Member

jkotas commented Mar 4, 2020

@davidwrighton This will make us JIT two more small methods during startup for now. I think it is ok to take this and work on fixing the JITing separately. Are you ok with this as well?

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@davidwrighton
Copy link
Member

@jkotas, adding 2 tiny methods to jit on startup is not significant. If it solves a real problem, its ok. (Not ideal, but ok)

@jkotas jkotas added the tenet-performance Performance related issue label Mar 4, 2020
Copy link
Member

@davidwrighton davidwrighton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@VSadov
Copy link
Member Author

VSadov commented Mar 4, 2020

Thanks!!

@VSadov VSadov merged commit 3fc245e into dotnet:master Mar 4, 2020
@benaadams
Copy link
Member

The helper calls in the dasm look like they always remain helper calls. A question though is could they inline and for example in LdelemaRef the if (elementType == type) test be hoisted out of the loop by the Jit?

@VSadov
Copy link
Member Author

VSadov commented Mar 6, 2020

What you are suggesting is introducing these calls early - as method calls and then let the regular inlining/cse deal with them appropriately.

The thought did occur while implementing these. It would be interesting, especially for Ldelema.

Stelem could be tricky since there is a call to write barrier method and jit does not know that it has same semantics as assignment.

@benaadams
Copy link
Member

Yea I was having a look at the asm since its now emitted for JitDisasm=* 😄

StelemRef is more bulky; however Ldelema would likely be more highly used? (Reads generally higher than writes, else why write)

Additionally if it was used in a for (i =0; i < a.Length; i++) type loop if Ldelema inlines there is a possibility for the array bounds check to also be eliminated (as it is the for condition)

@VSadov
Copy link
Member Author

VSadov commented Mar 6, 2020

Ldelema is used when an element is used by reference - foo(ref arr[1]); ref var x = ref arr[2]; etc.. Not sure how common element byrefs are overall. Could be less common than writes.
They can be used in tight loops though.

Ordinary reads are just reads though, without helpers.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
je ThrowNullReferenceException

; we only want the lower 32-bits of edx, it might be dirty
or edx, edx
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original code truncates the index to 32bit

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-VM-coreclr tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make JIT_Stelem_Ref more responsive to suspension requests
6 participants