Skip to content

JIT: Emit mulx for GT_MULHI and GT_MUL_LONG if BMI2 is available #116198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Jun 20, 2025

Conversation

Daniel-Svensson
Copy link
Contributor

@Daniel-Svensson Daniel-Svensson commented Jun 1, 2025

Summary

Overview:

  • Allows the JIT to emit MULX for GT_MULHI (when doing division by constant) and for GT_MUL_LONG
    • Using mulx should allow more flexible register allocation
  • Fixes containment check GT_MUL_LONG on x86 (it never succeeded due to different register sizes)

Minor changes:

  • Use helper IsUnsigned() call instead of doing bit manipulation for clarity
  • Place any memory operator as "op2" during lowering (containment check) for multiply to allow simpler code

Copilot Summary

Expand

Enhancements for BMI2 Instruction Set:

  • Added support for the MULX instruction when BMI2 is available, enabling efficient unsigned multiplication without modifying RDX. This includes logic to handle operand placement and instruction emission for both GT_MULHI and GT_MUL_LONG. (src/coreclr/jit/codegenxarch.cpp, [1] [2] [3]

Operand and Memory Containment Logic:

  • Updated ContainCheckMul to adjust operand types (nodeType) for GT_MUL_LONG and ensure safe containment of memory operands.
  • Added operand swapping to guarantee contained memory operands are always op2. (src/coreclr/jit/lowerxarch.cpp, [1] [2] [3] [4] [5] [6]

Register Allocation and Kill Set Updates:

  • Modified LinearScan::getKillSetForMul to avoid killing RDX when using MULX and added logic to differentiate between base instructions and BMI2-specific instructions. (src/coreclr/jit/lsrabuild.cpp, src/coreclr/jit/lsrabuild.cppL785-R798)

General Refactoring:

  • Simplified flag checks by replacing bitwise operations with IsUnsigned() method calls for clarity and consistency. (src/coreclr/jit/lowerxarch.cpp, [1]; src/coreclr/jit/lsraxarch.cpp, [2]
  • Adjusted BuildMul to account for BMI2-specific operand handling and register constraints, ensuring proper use of implicit registers like RAX and RDX. (src/coreclr/jit/lsraxarch.cpp, [1] [2]

These changes enhance the compiler's efficiency and maintainability, particularly for architectures supporting BMI2, while ensuring correctness in operand handling and memory containment.

Code generation examples

Simple

static unsafe ulong TestBigMulINT2(uint* arr, uint b)
{
    return Math.BigMul(b, arr[0]) + Math.BigMul(b, arr[1]);
}

With BMI2


; Method Program:<<Main>$>g__TestBigMulINT2|0_15(uint,uint):ulong (FullOpts)
G_M34028_IG01:  ;; offset=0x0000
       push     esi
       sub      esp, 16
						;; size=4 bbWeight=1 PerfScore 1.25

G_M34028_IG02:  ;; offset=0x0004
       mulx     eax, esi, dword ptr [ecx]
       mulx     edx, ecx, dword ptr [ecx+0x04]
       add      eax, edx
       mov      edx, esi
       adc      edx, ecx
						;; size=17 bbWeight=1 PerfScore 13.00

G_M34028_IG03:  ;; offset=0x0015
       add      esp, 16
       pop      esi
       ret      
						;; size=5 bbWeight=1 PerfScore 1.75
; Total bytes of code: 26

Without BMI2


; Method Program:<<Main>$>g__TestBigMulINT2|0_15(uint,uint):ulong (FullOpts)
G_M34028_IG01:  ;; offset=0x0000
       push     edi
       push     esi
       push     ebx
       sub      esp, 16
       mov      esi, edx
						;; size=8 bbWeight=1 PerfScore 3.50

G_M34028_IG02:  ;; offset=0x0008
       mov      eax, esi
       mul      edx:eax, dword ptr [ecx]
       mov      edi, eax
       mov      ebx, edx
       mov      eax, esi
       mul      edx:eax, dword ptr [ecx+0x04]
       add      eax, edi
       adc      edx, ebx
						;; size=17 bbWeight=1 PerfScore 13.75

G_M34028_IG03:  ;; offset=0x0019
       add      esp, 16
       pop      ebx
       pop      esi
       pop      edi
       ret      
						;; size=7 bbWeight=1 PerfScore 2.75
; Total bytes of code: 32

BigMul

The following code is generated for the following Math.BigMul method

static ulong BigMul(ulong a, uint b, out ulong low)
{
    ulong prodL = ((ulong)(uint)a) * b;
    ulong prodH = (prodL >> 32) + (((ulong)(uint)(a >> 32)) * b);

    low = ((prodH << 32) | (uint)prodL);
    return (prodH >> 32);
}
codegen with BMI2

bmi2 codegen produces less push/pop due to better argument usage

; Method Program:<<Main>$>g__BigMul|0_14(ulong,uint,byref):ulong (FullOpts)
G_M22501_IG01:  ;; offset=0x0000
       push     esi
       sub      esp, 24
       mov      bword ptr [esp], edx
       mov      eax, ecx
						;; size=9 bbWeight=1 PerfScore 2.50

G_M22501_IG02:  ;; offset=0x0009
       mov      edx, eax
       mulx     esi, ecx, dword ptr [esp+0x20]
       mov      edx, eax
       mulx     edx, eax, dword ptr [esp+0x24]
       add      eax, esi
       adc      edx, 0
       mov      dword ptr [esp+0x0C], edx
       mov      esi, bword ptr [esp]
       mov      dword ptr [esi], ecx
       mov      dword ptr [esi+0x04], eax
       xor      edx, edx
       mov      eax, dword ptr [esp+0x0C]
						;; size=41 bbWeight=1 PerfScore 16.50

G_M22501_IG03:  ;; offset=0x0032
       add      esp, 24
       pop      esi
       ret      8
						;; size=7 bbWeight=1 PerfScore 2.75
; Total bytes of code: 57
codegen without BMI2
; Method Program:<<Main>$>g__BigMul|0_12(ulong,uint,byref):ulong (FullOpts)
G_M59235_IG01:  ;; offset=0x0000
       push     edi
       push     esi
       push     ebx
       sub      esp, 20
       mov      esi, edx
						;; size=8 bbWeight=1 PerfScore 3.50

G_M59235_IG02:  ;; offset=0x0008
       mov      eax, ecx
       mul      edx:eax, dword ptr [esp+0x24]
       mov      edi, eax
       mov      ebx, edx
       mov      eax, ecx
       mul      edx:eax, dword ptr [esp+0x28]
       add      eax, ebx
       adc      edx, 0
       mov      dword ptr [esp+0x08], edx
       mov      dword ptr [esi], edi
       mov      dword ptr [esi+0x04], eax
       xor      edx, edx
       mov      eax, dword ptr [esp+0x08]
						;; size=36 bbWeight=1 PerfScore 16.00

G_M59235_IG03:  ;; offset=0x002C
       add      esp, 20
       pop      ebx
       pop      esi
       pop      edi
       ret      8
						;; size=9 bbWeight=1 PerfScore 3.75
; Total bytes of code: 53

.NET 9 (from linqpad)
<html>
<body>
<!--StartFragment-->
0000 | push | edi
-- | -- | --
L0001 | push | esi
L0002 | push | ebx
L0003 | sub | esp, 0x14
L0006 | mov | esi, edx

L0008 | mov | eax, [esp+0x24]
L000c | mul | ecx
L000e | mov | edi, eax
L0010 | mov | ebx, edx
L0012 | mov | eax, [esp+0x28]
L0016 | mul | ecx
L0018 | add | eax, ebx
L001a | adc | edx, 0
L001d | mov | [esp+8], edx
L0021 | or | edi, 0
L0024 | or | eax, 0
L0027 | mov | [esi], edi
L0029 | mov | [esi+4], eax
L002c | xor | edx, edx
L002e | mov | eax, [esp+8]

L0032 | add | esp, 0x14
L0035 | pop | ebx
L0036 | pop | esi
L0037 | pop | edi
L0038 | ret | 8

<!--EndFragment-->
</body>
</html>

Division by constant

uint TEstDiv2(uint a)
{
    return a / 10;
}

With BMI2 generates the following (x86, same behaviour for ulong on x64):

; Method Program:<<Main>$>g__TEstDiv2|0_14(uint):uint (FullOpts)
G_M15534_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M15534_IG02:  ;; offset=0x0000
       mov      edx, 0xCCCCCCCD
       mulx     eax, eax, ecx
       shr      eax, 3
						;; size=13 bbWeight=1 PerfScore 3.75

G_M15534_IG03:  ;; offset=0x000D
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code: 14

Instead of

; Method Program:<<Main>$>g__TEstDiv2|0_14(uint):uint (FullOpts)
G_M15534_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M15534_IG02:  ;; offset=0x0000
       mov      edx, 0xCCCCCCCD
       mov      eax, ecx
       mul      edx:eax, edx
       mov      eax, edx
       shr      eax, 3
						;; size=14 bbWeight=1 PerfScore 4.25

G_M15534_IG03:  ;; offset=0x000E
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code: 15

@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 1, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jun 1, 2025
// In lowering, we place any memory operand in op2 so we default to placing op1 in RDX
// By selecting RDX here we don't have to kill it
srcCount = BuildOperandUses(op1, SRBM_RDX);
srcCount += BuildOperandUses(op2, RBM_NONE);
Copy link
Contributor Author

@Daniel-Svensson Daniel-Svensson Jun 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is heavily inspired by how MultiplyNoFlags is implemented.

Is it safe to not have RDX killed if SRBM_RDX is specified as register here ?

I hope this is able to produce slightly better code than to always kill RDX and just specify any register instead. (since rdx can be reused)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found #10196
and the code comment in

regMaskTP LinearScan::getKillSetForHWIntrinsic(GenTreeHWIntrinsic* node)
{
regMaskTP killMask = RBM_NONE;
#ifdef TARGET_XARCH
switch (node->GetHWIntrinsicId())
{
case NI_X86Base_MaskMove:
// maskmovdqu uses edi as the implicit address register.
// Although it is set as the srcCandidate on the address, if there is also a fixed
// assignment for the definition of the address, resolveConflictingDefAndUse() may
// change the register assignment on the def or use of a tree temp (SDSU) when there
// is a conflict, and the FixedRef on edi won't be sufficient to ensure that another
// Interval will not be allocated there.
// Issue #17674 tracks this.
killMask = RBM_EDI;
break;

Is that still an issue? (NI_AVX2_MultiplyNoFlags does not do anything similar and still seems to work)

case GT_MULHI:
{
// MUL, IMUL are RMW but mulx is not (which is used for unsigned operands when BMI2 is availible)
return !(tree->IsUnsigned() && compiler->compOpportunisticallyDependsOn(InstructionSet_BMI2));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about extracting a helper method for determining if a multiply node should emit mulx since

tree->OperGet() != GT_MUL && isUnsignedMultiply && compiler->compOpportunisticallyDependsOn(InstructionSet_BMI2) is used in a few places, but I did not know here to place such a helper so did not do it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to look at using mulx when numbers are signed, but they are proven to be non-negative ?

If so where would it make sense to have a helper and do you have a suggestion for name shouldEmitMulxForMultiplication ?

@Daniel-Svensson Daniel-Svensson marked this pull request as ready for review June 1, 2025 21:48
* use OperIs()
* replace Intructionset_BMI2 => InstructionSetAVX2
@Daniel-Svensson
Copy link
Contributor Author

@JulieLeeMSFT, @jakobbotsch, it seems the dotnet-policy-service tags you on new PR's for this area, should I have mentioned you when it did not mention anyone ?

@jakobbotsch
Copy link
Member

Not sure why it didn't here, let me ping the rest of the team.

cc @dotnet/jit-contrib

The diffs for the PR look quite mixed. Is it expected?

@Daniel-Svensson
Copy link
Contributor Author

@jakobbotsch

The diffs for the PR look quite mixed. Is it expected?

I cannot access the diffs so hard to tell.
Would you be able to post a screenshot?

Some observations from diffs when I did the work:

  • code size diffs will depend much on register pressure
    • mul is so much smaller that you must remove 2 moves to decrease size
    • real gains (code and perf) comes when stack spills/memory read/writes are avoided.
    • i expect some size regression, but not performance regression (unless caused by code alignment)
  • registers assignment can become very different, causing diffs (at least large textual diffa) in itself
    • even with more registers usable, I believe I had a case where it spilled memory even when there where temp registers such as r11 available. (It was many weeks ago related to my other mulx pr, but the principle would be the same)
    • I do not remember if it was rax or rdx, but without mulx the variable was placed in some other register.

@jakobbotsch
Copy link
Member

I cannot access the diffs so hard to tell.

What error do you get? You should be able to see the diffs just fine. E.g. I can see them even in incognito.

@Daniel-Svensson
Copy link
Contributor Author

What error do you get? You should be able to see the diffs just fine. E.g. I can see them even in incognito.

Thank you for the incognito tip, I came to my microsoft account login and then was forbidden access (probably since I often use devops for other projects).

The diffs does looks worse than expected, register allocation seems to get problems when defining fixed registers for uses.
I am a bit suprised by the result for the following code (I would expected it to maybe use another temp variable) for storing

My assumption is that the below stack spill / regression is caused by not having rdx killed and using BuildOperandUses(op1, SRBM_RDX);

  • It seems like just killing RDX and not fixing input register might be less problematic
    I will push some changes and so what happens.
static long mul2s(int a, int b)
{
    return (long)a * (long)b;
}

generates

; Method Program:<<Main>$>g__mul2|0_13(uint,uint):ulong (FullOpts)
G_M14403_IG01:  ;; offset=0x0000
       sub      esp, 12
       mov      dword ptr [esp+0x08], edx
						;; size=7 bbWeight=1 PerfScore 1.25

G_M14403_IG02:  ;; offset=0x0007
       mov      edx, ecx
       mulx     edx, eax, dword ptr [esp+0x08]
						;; size=9 bbWeight=1 PerfScore 5.25

G_M14403_IG03:  ;; offset=0x0010
       add      esp, 12
       ret      
						;; size=4 bbWeight=1 PerfScore 1.25
; Total bytes of code: 20

Full JITDUMP can be found here

@Daniel-Svensson
Copy link
Contributor Author

Daniel-Svensson commented Jun 14, 2025

@jakobbotsch
I switched to kill RDX instead of specifying it as fixed regsiter on use and the new diffs looks a lot more to what I expected.

There are mostly perf and size improvements, but there are a few regressions that I had not expected such as System.Decimal+DecCalc:DecDivMod1E9 (example diff under benchmarks.run.linux.x64.checked.mch ).

It seems a bit unintuitive to mee but it looks like it spills a variable to stack just because of different registers being allocated (it uses rax in a different way since it is not killed by mul anymore.

Is there any change I should do to this PR or is it something more general, perhaps it would be better to use volatile registers such as r9 to store data instead of spilling to stack?

  public static uint DecDivMod1E9(ref DecCalc value)
  {
      ulong high64 = ((ulong)value.uhi << 32) + value.umid;
      ulong div64 = high64 / TenToPowerNine;
      value.uhi = (uint)(div64 >> 32);
      value.umid = (uint)div64;

      ulong num = ((high64 - (uint)div64 * TenToPowerNine) << 32) + value.ulo;
      uint div = (uint)(num / TenToPowerNine);
      value.ulo = div;
      return (uint)num - div * TenToPowerNine;
  }

UPDATE:: I think I may have found it, rax seemed to be fixed register used by another operation during lovering och division by constant

@Daniel-Svensson
Copy link
Contributor Author

Daniel-Svensson commented Jun 15, 2025

Update: I think I fixed the rax spill issue and now the diffs looks like expected for x64.

There are a bit more regressions to x86 (even it total is an improvement) than expected, but the few a looked at looks more likely to be caused due to different register usage.

For example System.Numerics.BigIntegerCalculator:Multiply(System.ReadOnlySpan``1[uint],uint,System.Span``1[uint]) (FullOpts) which during crossgen emits mul against memory instead of temp seems to cause additional register usage and movs instead of less.
Maybe the same happens to System.Number:<NumberToBigInteger>g__MultiplyAdd|

if (mulNode->IsUnsigned() && compiler->compOpportunisticallyDependsOn(InstructionSet_AVX2))
{
// If on operand is used from memory, we define fixed RDX register for use, so we don't need to kill it.
if (mulNode->gtGetOp1()->isUsedFromMemory() || mulNode->gtGetOp2()->isUsedFromMemory())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not ok to use isUsedFromMemory during LSRA, only codegen. We only know for sure after LSRA due to the spill temps case it handles.

You can check for the contained memory op case though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I swithced to isContained() both here and in LinearScan::BuildMul

Copy link
Contributor Author

@Daniel-Svensson Daniel-Svensson Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That did make a nice difference to x86, it went from a size regression to a size improvement https://dev.azure.com/dnceng-public/public/_build/results?buildId=1069175&view=ms.vss-build-web.run-extensions-tab

However I had hoped that the containment support for bigmul would make a more positive improvement.
I do wonder about the following part of the diff:
Should it not have the same perfscore? maybe that is where some of the perfscore increase is from for crossgen?

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also noticed that containment seems to actually increase perfscore, while in reality it's most likely an improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It comes down to the insThroughput considerations being different between them and so the costing returned by getInsExecutionCharacterstics isn't quite correct.

The general issue here is rather that we're effectively modeling it as a single uops when the actuality is that a contained load/embed is an additional uop on top.

So the standalone mov eax, dword ptr [ecx+0x04] is going to be throughput: 2x, latency: PERFSCORE_LATENCY_RD_* and the imul is going to be throughput: 1x, latency: 3C. While the contained is going to be throughput: 1x, latency: 3C + PERFSCORE_LATENCY_RD_*.

However, in practice the load portion is still throughput: 2x and can be pipelined since its decomposed to its own uop. It just forms part of the dependency chain directly with the subsequent instruction.

We could probably move the PERFSCORE_LATENCY_RD_* handling "up" so that way we can reasonably handle contained loads while still accurately tracking the throughput.

CC. @AndyAyersMS on if he has any alternative ideas or input.

Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
@tannergooding can you take a look as well?

{
containedMemOp = op2;
assert(!(op1->isContained() && !op1->IsCnsIntOrI()) || !(op2->isContained() && !op2->IsCnsIntOrI()));
srcCount = BuildRMWUses(tree, op1, op2, RBM_NONE, RBM_NONE);
Copy link
Member

@tannergooding tannergooding Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This doesn't actually have to be RMW for imul reg1, reg2/mem, imm8/16/32 (which does reg1 = reg2/mem * imm)

It is only that way for imul reg1, reg2/mem (which does reg1 *= reg2/mem and imul reg1/mem (which does dx:ax = ax * reg2/mem)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched back to BuildBinaryUses, so it should behave exactly as before this PR for mul/imul

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Couple small typos and a nit about imul involving an 8, 16, or 32-bit sign-extended immediate not being RMW

@tannergooding tannergooding merged commit ede0118 into dotnet:main Jun 20, 2025
112 of 114 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants