Skip to content

Alternate approach for HELPER_METHOD_FRAME removal for arithmetic div… #113286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

davidwrighton
Copy link
Member

@davidwrighton davidwrighton commented Mar 8, 2025

This is a combination of @am11's work in PR #109087 and some work to just rewrite the Windows X86 helpers in assembly.

  • For non 64-bit platforms, remove the helpers
  • For Unix platforms, rely on an implementation which uses an FCall to native code to invoke the various operations.
  • For Windows X86 on CoreCLR, rewrite the helpers in assembly with a tail-call approach for throwing.
    • Also it was noted that the existing helpers do a fair bit of unnecessary stack manipulation, and the manual rewrite avoids all of that. Maybe this will actually be faster. (Turns out it is about the same on performance. The real win would come from re-ordering the argument order, but that's a much more invasive change in the JIT, which I judge to be too risky)

…ide operations

- Instead of moving to managed
  - For non 64-bit platforms, remove the helpers
  - For Unix platforms, rely on [[clang::musttail]] to tail-call a helper which throws (current implementation still uses FCThrow, but that should be relatively easy to move to a call to the managed throw routines
  - For Windows X86, rewrite the helpers in assembly with a tail-call approach for throwing.
    - As part of writing the helpers in assembly, it was noted that the stack layout is swapped from the underyling CRT routines for 64bit arithmetic. Tweak the helper calls for those cases to pass in data in the order which the CRT is going to handle them efficiently. Do this for all architectures/OSs as other ones don't really care about this weirdness.
    - Also it was noted that the existing helpers do a fair bit of unnecessary stack manipulation, and the manual rewrite avoids all of that. Maybe this will actually be faster.

TODO: Bump jit version
TODO: Bump R2R version
TODO: Actually use managed helper to throw instead of keeping the FCThrow in place
Comment on lines 3473 to 3476
// Using the musttail attribute requires that we disble our protection against using return in the wrong place
#ifdef return
#undef return
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related: I'm trying to remove the cases outside of HMFs that engage this infra as it continues to cause us problems in places.

{
if (divisor == 0)
{
[[clang::musttail]] return JIT_ThrowDivideByZero_RetInt();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can define MUSTTAIL for this:

#if TARGET_UNIX
#ifdef __clang__
#define MUSTTAIL [[clang::musttail]]
#else // __clang__
#define MUSTTAIL [[gnu::musttail]]
#endif // __clang__
#endif // TARGET_UNIX

{
FCALL_CONTRACT;

FCThrow(kOverflowException);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we directly inline them in JIT? there already are CORINFO_HELP_OVERFLOW and CORINFO_HELP_THROWDIVZERO exposed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is just an experiment to see how musttail works, and if it can be used to help solve the HMF problem without changing the JIT/etc for this issue. Given the number of test failures, something clearly didn't go completely right with my change here, so I want to understand what I've missed.

Comment on lines 37 to 38
// CORINFO_HELP_DBL2INT, CORINFO_HELP_DBL2UINT, and CORINFO_HELP_DBL2LONG get
// patched for CPUs that support SSE2 (P4 and above).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// CORINFO_HELP_DBL2INT, CORINFO_HELP_DBL2UINT, and CORINFO_HELP_DBL2LONG get
// patched for CPUs that support SSE2 (P4 and above).

Outdated comment

…the original order (Sacrifices perf, but avoid needinga bunch of optimizer changes)

Adjust the negate path for the X86 hand coded assembly for JIT_LDiv to load edx before jumping to JIT_LDiv_DoNegate

#if TARGET_UNIX
#ifdef __clang__
#define MUSTTAIL [[clang::musttail]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the performance concern was only for win-x86 and performance optimizations aren't critical for linux-arm and linux-x86. Moreover, it seems like MUSTTAIL is only to accelerate exception path with no effect on the actual div/mod ops? If this is the case, then I think we can just drop this optimization even if it was for win-x86 because perf concern was about the div/mod (non-exception) happy path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MUSTTAIL logic here isn't an optimization as without HMFs we can't through an exception from an FCALL, but I agree, it doesn't look like it has worked out as a reasonable way to make this work simple. My current plan is to keep the work I've done to port the Windows X86 helpers to handcoded assembly, and merge it with @am11's work that is more general purpose, and then call it good.

@am11
Copy link
Member

am11 commented Mar 24, 2025

There was also a pure managed experiment somewhere in the middle using generic math with no FCall that was passing all tests:

private static (T quotient, T remainder) DivMod<T, U>(T dividend, T divisor)
where T : INumber<T>, IMinMaxValue<T>
where U : INumber<U>, IMinMaxValue<U>, IShiftOperators<U, int, U>, IBitwiseOperators<U, U, U>
{
bool dividendIsNegative = false;
bool divisorIsNegative = false;
// Handle signs if T is signed
if (typeof(T) != typeof(U))
{
if (dividend < T.Zero)
{
dividend = -dividend;
dividendIsNegative = true;
}
if (divisor < T.Zero)
{
divisor = -divisor;
divisorIsNegative = true;
}
}
// Perform unsigned division using type U
U uDividend = U.CreateTruncating(dividend);
U uDivisor = U.CreateTruncating(divisor);
U bit = U.One;
U uQuotient = U.Zero;
int mask = typeof(U) == typeof(uint) ? 31 : 63;
// Align divisor with dividend
while (uDivisor < uDividend && bit != U.Zero && U.IsZero(uDivisor & (U.One << mask)))
{
uDivisor <<= 1;
bit <<= 1;
}
// Perform the division
while (bit > U.Zero)
{
if (uDividend >= uDivisor)
{
uDividend -= uDivisor;
uQuotient |= bit;
}
bit >>= 1;
uDivisor >>= 1;
}
// Convert results back to type T
T tQuotient = T.CreateTruncating(uQuotient);
T tRemainder = T.CreateTruncating(uDividend);
// Adjust signs if T is signed
if (typeof(T) != typeof(U))
{
if (dividendIsNegative)
tRemainder = -tRemainder;
if (dividendIsNegative ^ divisorIsNegative)
tQuotient = -tQuotient;
}
return (tQuotient, tRemainder);
}
(those InternalCalls after line 243 are unused)

It was quite slow EgorBot/runtime-utils#193. (was probably dreaming that it could outperform the native implementation 🥲)

The FCall version you've applied was performing much better EgorBot/runtime-utils#191, which is fine for 32-bit linux.

[MethodImpl(MethodImplOptions.InternalCall)]
private static extern uint DivUInt32Internal(uint dividend, uint divisor);

#if NATIVEAOT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be same between NAOT and regular CoreCLR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AOT could use InternalCall (with revert of RuntimeHelpers.cs file). There was a slight regression with QCall on coreclr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, my point is that there is no reason for two to be different.

(I do not have a strong opinion on FCall vs. QCall here as long as it is the same.)

Copy link
Member

@am11 am11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number are looking promising EgorBot/runtime-utils#336 (comment) (0.55) 🚀

@@ -1,6 +1,7 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.

using System;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: undo

#else // TARGET_64BIT
DYNAMICJITHELPER(CORINFO_HELP_LMUL_OVF, NULL, METHOD__NIL)
DYNAMICJITHELPER(CORINFO_HELP_ULMUL_OVF, NULL, METHOD__NIL)
JITHELPER(CORINFO_HELP_LDIV, NULL, METHOD__NIL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Padding for the METHOD__NIL column is off.

DYNAMICJITHELPER(CORINFO_HELP_ULMOD, NULL, METHOD__MATH__MOD_UINT64)
#endif // TARGET_X86 && TARGET_WINDOWS
#else // TARGET_64BIT
DYNAMICJITHELPER(CORINFO_HELP_LMUL_OVF, NULL, METHOD__NIL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If CORINFO_HELP_LMUL/ULMUL_OVF are unused on 64-bit, I would expect them to be JITHELPER.

If they are used, I would expect them to look like the 32-bit ones and be outside the #ifndef TARGET_64BIT ifdef.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@am11
Copy link
Member

am11 commented Apr 4, 2025

Thanks for this @davidwrighton!

For the remaining alloc-like HMFs, I think we can either use managed RuntimeTypeHandle_InternalAlloc{NoChecks} or the assembly ones in AllocFast.{asm,S} (+ the FOH):

newobj = TryAllocateFrozenObject(pMT);

cc @AaronRobinsonMSFT

@davidwrighton davidwrighton merged commit 7aa0868 into dotnet:main Apr 4, 2025
154 of 159 checks passed
@akoeplinger
Copy link
Member

@davidwrighton @am11 this causes an issue on the runtime sdk codeflow on win-x86: dotnet/sdk#48034

finalizer.obj : error LNK2001: unresolved external symbol RhpDbl2Int [D:\a\_work\1\vmr\src\sdk\src\Layout\finalizer\finalizer.csproj]
finalizer.obj : error LNK2001: unresolved external symbol RhpDbl2Lng [D:\a\_work\1\vmr\src\sdk\src\Layout\finalizer\finalizer.csproj]
finalizer.obj : error LNK2001: unresolved external symbol RhpLng2Dbl [D:\a\_work\1\vmr\src\sdk\src\Layout\finalizer\finalizer.csproj]
finalizer.obj : error LNK2001: unresolved external symbol RhpULng2Dbl [D:\a\_work\1\vmr\src\sdk\src\Layout\finalizer\finalizer.csproj]
finalizer.obj : error LNK2001: unresolved external symbol ceil [D:\a\_work\1\vmr\src\sdk\src\Layout\finalizer\finalizer.csproj]
finalizer.obj : error LNK2001: unresolved external symbol cos [D:\a\_work\1\vmr\src\sdk\src\Layout\finalizer\finalizer.csproj]
finalizer.obj : error LNK2001: unresolved external symbol floor [D:\a\_work\1\vmr\src\sdk\src\Layout\finalizer\finalizer.csproj]
finalizer.obj : error LNK2001: unresolved external symbol modf [D:\a\_work\1\vmr\src\sdk\src\Layout\finalizer\finalizer.csproj]
finalizer.obj : error LNK2001: unresolved external symbol pow [D:\a\_work\1\vmr\src\sdk\src\Layout\finalizer\finalizer.csproj]
finalizer.obj : error LNK2001: unresolved external symbol sin [D:\a\_work\1\vmr\src\sdk\src\Layout\finalizer\finalizer.csproj]
finalizer.obj : error LNK2001: unresolved external symbol tan [D:\a\_work\1\vmr\src\sdk\src\Layout\finalizer\finalizer.csproj]
D:\a\_work\1\vmr\src\sdk\artifacts\bin\finalizer\Release\net10.0-windows\native\finalizer.exe : fatal error LNK1120: 11 unresolved externals [D:\a\_work\1\vmr\src\sdk\src\Layout\finalizer\finalizer.csproj]

@am11
Copy link
Member

am11 commented Apr 6, 2025

Ah, nvm, it is also failing in AOT outerloop: https://github.com/dotnet/runtime/runs/40052903604. Although we haven't touched these exports. 👀

@akoeplinger
Copy link
Member

Ok thanks, I'll prepare a revert since this is blocking the codeflow.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants