-
Notifications
You must be signed in to change notification settings - Fork 5k
Managed StelemRef and LdelemaRef #32722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
83d52f2
to
ba23941
Compare
src/coreclr/src/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs
Outdated
Show resolved
Hide resolved
I think this one is ready for review. The change significantly improves responsiveness to GC suspension in tight loops with these helpers. From 200-500ms. it becomes less than 20ms. and most of the time much less than that. Perf is comparable but slightly worse. In directed microbenchmarks, the fastest scenarios could be 10-20% slower. The code is roughly the same as before. The biggest difference is PreStub indirection and it is responsible for the extra cycles. These are very fast helpers, so even small overheads are noticeable. I am not adding the PreStub hacks here, since they are just hacks and not production ready. We should consider that as a separate change. That would help other managed JIT helpers too. |
...clr/src/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.CoreCLR.cs
Outdated
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs
Outdated
Show resolved
Hide resolved
The indirection is because of tiered compilation. It may be better to focus on #965 and fix it everywhere. |
Even without tiering we have indirection. We ask for multicallable entry point very early, so we always get a stub. That is expected, since we do not want to force JIT-ting at that point. Later the stub may be patched to point to actual native code (or better native code), but JIT helper table still points to the stub. Not sure if this is a subset of #965 . In some generalized sense - perhaps, but just fixing the scenario in #965 may not fix this one. |
For the perf perspective: On the following sample, with default behavior (i.e. tiering enabled, corlib crossgenned, WKS GC ) I get about: That is ~ 13% regression, but it is a bit contextual. Some code changes could make it a bit better or worse, using System;
using System.Diagnostics;
class Program
{
static object[] arr = new Exception[10000];
static Exception val = new Exception();
static void Work()
{
var v = val;
var a = arr;
for (int i = 0; i < 100000000; i++)
{
a[1] = v;
}
}
static void Main()
{
for (; ; )
{
var sw = Stopwatch.StartNew();
Work();
sw.Stop();
{
Console.WriteLine(sw.ElapsedMilliseconds);
}
}
}
} |
src/coreclr/src/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs
Outdated
Show resolved
Hide resolved
Does it vary between platforms? |
I did not check other platforms, but win-x64 seems to have the most optimized code originally, so I assumed if any regression, it must be worst on win-x64. Is any other platform interesting in particular? |
Re: Did you mean like the following: notExactMatch:
if (elementType == (void*)RuntimeTypeHandle.GetValueInternal(typeof(object).TypeHandle))
goto doWrite; That does not seem to intrincify and is much slower.
The following seems an interesting alternative: notExactMatch:
if (arr.GetType() == typeof(object[]))
goto doWrite; It is slightly more code, but overall has similar performance to the variant with cached static, without the static. |
Win x86; any arm64 (one of Win or Linux is enough - the numbers should be the same) |
src/coreclr/src/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs
Outdated
Show resolved
Hide resolved
On x86 the same sample as posted above shows ~30% regression. (340ms. vs. 260ms. baseline) Again - most of the regression disappears if I add the back-patching hack and let the helper to be JIT-ed (and the helper table updated) before going to measure In fact managed implementation becomes a few % faster, but these measurements are sensitive and can move a few percent for seemingly unrelated changes, so I would consider that noise. |
How bad is the back-patching hack? 30% sounds too much to take. I think I would be ok with taking a workaround for it that get cleaned up in future. |
An example of the hack - VSadov@edb8ae9 It is a minimal implementation that supports just 1 helper. It is basically at proof of concept stage. To make it more reasonable, we need:
|
This feels fragile. It means that the code JITed before the helper got (tiered) JITed will be stuck using the slow version of helper. What is the code quality of the R2R version of the helper? Would it be an option to just use that? |
The code that was jitted before the helper updated will still see the latest/best code, just via a I think a bigger concern would be if we update native code twice - to unoptimized version and then to optimized. If we patch the helper on the first update, some code may end up forever calling the slow version. That can be mitigated by patching only when we update to the last/best variant. I assume we can know that. Updating all the jit-ed code when indirection is no longer needed is a much bigger problem. I assume that is what #965 covers. Once there is mechanism for that, this case could plug into it. R2R version has other issues - it does not tailcall (we are sensitive to tailcalling the barrier), statics are accessed via a call, etc. It could cost more than an extra jmp. |
On Linux arm64 with the same sample:
The presence of PreStub hack does not seem to have any meaningfull effect. (I did check in debugger that the hack works on ARM64 and does eliminate the indirection) |
I think we need to do something about the regression. The options that I can think of:
|
|
Tests failing... |
// Do the instrumentation and publish atomically, so that the
// instrumentation data always matches the published code.
CrstHolder gcCoverLock(&m_GCCoverCrst); crst not initialized . . . |
@@ -195,6 +195,25 @@ void ECall::PopulateManagedCastHelpers() | |||
pDest = pMD->GetMultiCallableAddrOfCode(); | |||
SetJitHelperFunction(CORINFO_HELP_UNBOX, pDest); | |||
|
|||
// Array element accessors are more perf sensitive than other managed helpers and indirection | |||
// costs introduced by PreStub could be noticeable (7% to 30% depending on platform). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment that this should be revisited once #5857 is fixed?
@davidwrighton This will make us JIT two more small methods during startup for now. I think it is ok to take this and work on fixing the JITing separately. Are you ok with this as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@jkotas, adding 2 tiny methods to jit on startup is not significant. If it solves a real problem, its ok. (Not ideal, but ok) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!! |
The helper calls in the dasm look like they always remain helper calls. A question though is could they inline and for example in |
What you are suggesting is introducing these calls early - as method calls and then let the regular inlining/cse deal with them appropriately. The thought did occur while implementing these. It would be interesting, especially for Ldelema. Stelem could be tricky since there is a call to write barrier method and jit does not know that it has same semantics as assignment. |
Yea I was having a look at the asm since its now emitted for
Additionally if it was used in a |
Ordinary reads are just reads though, without helpers. |
je ThrowNullReferenceException | ||
|
||
; we only want the lower 32-bits of edx, it might be dirty | ||
or edx, edx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Original code truncates the index to 32bit
Moving
StelemRef
andLdelemaRef
JIT helpers to managed code.(it is not uncommon to use array element accessors in loops)
Fixes: #32683