-
Notifications
You must be signed in to change notification settings - Fork 5.1k
JIT: Emit mulx for GT_MULHI and GT_MUL_LONG if BMI2 is available #116198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Fix register for mulx * Cleanup: use GenTree::IsUnsigned helper
* don't force op1 to implicit register if op2 is already in it
src/coreclr/jit/lsraxarch.cpp
Outdated
// In lowering, we place any memory operand in op2 so we default to placing op1 in RDX | ||
// By selecting RDX here we don't have to kill it | ||
srcCount = BuildOperandUses(op1, SRBM_RDX); | ||
srcCount += BuildOperandUses(op2, RBM_NONE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is heavily inspired by how MultiplyNoFlags is implemented.
Is it safe to not have RDX killed if SRBM_RDX
is specified as register here ?
I hope this is able to produce slightly better code than to always kill RDX and just specify any register instead. (since rdx can be reused)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just found #10196
and the code comment in
runtime/src/coreclr/jit/lsrabuild.cpp
Lines 956 to 971 in 9703a4c
regMaskTP LinearScan::getKillSetForHWIntrinsic(GenTreeHWIntrinsic* node) | |
{ | |
regMaskTP killMask = RBM_NONE; | |
#ifdef TARGET_XARCH | |
switch (node->GetHWIntrinsicId()) | |
{ | |
case NI_X86Base_MaskMove: | |
// maskmovdqu uses edi as the implicit address register. | |
// Although it is set as the srcCandidate on the address, if there is also a fixed | |
// assignment for the definition of the address, resolveConflictingDefAndUse() may | |
// change the register assignment on the def or use of a tree temp (SDSU) when there | |
// is a conflict, and the FixedRef on edi won't be sufficient to ensure that another | |
// Interval will not be allocated there. | |
// Issue #17674 tracks this. | |
killMask = RBM_EDI; | |
break; |
Is that still an issue? (NI_AVX2_MultiplyNoFlags does not do anything similar and still seems to work)
src/coreclr/jit/lsraxarch.cpp
Outdated
case GT_MULHI: | ||
{ | ||
// MUL, IMUL are RMW but mulx is not (which is used for unsigned operands when BMI2 is availible) | ||
return !(tree->IsUnsigned() && compiler->compOpportunisticallyDependsOn(InstructionSet_BMI2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about extracting a helper method for determining if a multiply node should emit mulx since
tree->OperGet() != GT_MUL && isUnsignedMultiply && compiler->compOpportunisticallyDependsOn(InstructionSet_BMI2)
is used in a few places, but I did not know here to place such a helper so did not do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to look at using mulx when numbers are signed, but they are proven to be non-negative ?
If so where would it make sense to have a helper and do you have a suggestion for name shouldEmitMulxForMultiplication
?
* use OperIs() * replace Intructionset_BMI2 => InstructionSetAVX2
@JulieLeeMSFT, @jakobbotsch, it seems the dotnet-policy-service tags you on new PR's for this area, should I have mentioned you when it did not mention anyone ? |
Not sure why it didn't here, let me ping the rest of the team. cc @dotnet/jit-contrib The diffs for the PR look quite mixed. Is it expected? |
I cannot access the diffs so hard to tell. Some observations from diffs when I did the work:
|
What error do you get? You should be able to see the diffs just fine. E.g. I can see them even in incognito. |
Thank you for the incognito tip, I came to my microsoft account login and then was forbidden access (probably since I often use devops for other projects). The diffs does looks worse than expected, register allocation seems to get problems when defining fixed registers for uses. My assumption is that the below stack spill / regression is caused by not having rdx killed and using
static long mul2s(int a, int b)
{
return (long)a * (long)b;
} generates ; Method Program:<<Main>$>g__mul2|0_13(uint,uint):ulong (FullOpts)
G_M14403_IG01: ;; offset=0x0000
sub esp, 12
mov dword ptr [esp+0x08], edx
;; size=7 bbWeight=1 PerfScore 1.25
G_M14403_IG02: ;; offset=0x0007
mov edx, ecx
mulx edx, eax, dword ptr [esp+0x08]
;; size=9 bbWeight=1 PerfScore 5.25
G_M14403_IG03: ;; offset=0x0010
add esp, 12
ret
;; size=4 bbWeight=1 PerfScore 1.25
; Total bytes of code: 20 Full JITDUMP can be found here |
8c764ed
to
5f848c6
Compare
b6d5713
to
04450b6
Compare
…ded 1 op mul - some cleanup of BuildMul, reorder andremove dead code
@jakobbotsch
public static uint DecDivMod1E9(ref DecCalc value)
{
ulong high64 = ((ulong)value.uhi << 32) + value.umid;
ulong div64 = high64 / TenToPowerNine;
value.uhi = (uint)(div64 >> 32);
value.umid = (uint)div64;
ulong num = ((high64 - (uint)div64 * TenToPowerNine) << 32) + value.ulo;
uint div = (uint)(num / TenToPowerNine);
value.ulo = div;
return (uint)num - div * TenToPowerNine;
} UPDATE:: I think I may have found it, rax seemed to be fixed register used by another operation during lovering och division by constant |
Update: I think I fixed the rax spill issue and now the diffs looks like expected for x64. There are a bit more regressions to x86 (even it total is an improvement) than expected, but the few a looked at looks more likely to be caused due to different register usage. For example |
src/coreclr/jit/lsrabuild.cpp
Outdated
if (mulNode->IsUnsigned() && compiler->compOpportunisticallyDependsOn(InstructionSet_AVX2)) | ||
{ | ||
// If on operand is used from memory, we define fixed RDX register for use, so we don't need to kill it. | ||
if (mulNode->gtGetOp1()->isUsedFromMemory() || mulNode->gtGetOp2()->isUsedFromMemory()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not ok to use isUsedFromMemory
during LSRA, only codegen. We only know for sure after LSRA due to the spill temps case it handles.
You can check for the contained memory op case though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I swithced to isContained() both here and in LinearScan::BuildMul
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That did make a nice difference to x86, it went from a size regression to a size improvement https://dev.azure.com/dnceng-public/public/_build/results?buildId=1069175&view=ms.vss-build-web.run-extensions-tab
However I had hoped that the containment support for bigmul would make a more positive improvement.
I do wonder about the following part of the diff:
Should it not have the same perfscore? maybe that is where some of the perfscore increase is from for crossgen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also noticed that containment seems to actually increase perfscore, while in reality it's most likely an improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It comes down to the insThroughput
considerations being different between them and so the costing returned by getInsExecutionCharacterstics
isn't quite correct.
The general issue here is rather that we're effectively modeling it as a single uops
when the actuality is that a contained load/embed is an additional uop on top.
So the standalone mov eax, dword ptr [ecx+0x04]
is going to be throughput: 2x, latency: PERFSCORE_LATENCY_RD_*
and the imul
is going to be throughput: 1x, latency: 3C
. While the contained is going to be throughput: 1x, latency: 3C + PERFSCORE_LATENCY_RD_*
.
However, in practice the load portion is still throughput: 2x
and can be pipelined since its decomposed to its own uop. It just forms part of the dependency chain directly with the subsequent instruction.
We could probably move the PERFSCORE_LATENCY_RD_*
handling "up" so that way we can reasonably handle contained loads while still accurately tracking the throughput.
CC. @AndyAyersMS on if he has any alternative ideas or input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@tannergooding can you take a look as well?
src/coreclr/jit/lsraxarch.cpp
Outdated
{ | ||
containedMemOp = op2; | ||
assert(!(op1->isContained() && !op1->IsCnsIntOrI()) || !(op2->isContained() && !op2->IsCnsIntOrI())); | ||
srcCount = BuildRMWUses(tree, op1, op2, RBM_NONE, RBM_NONE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This doesn't actually have to be RMW for imul reg1, reg2/mem, imm8/16/32
(which does reg1 = reg2/mem * imm
)
It is only that way for imul reg1, reg2/mem
(which does reg1 *= reg2/mem
and imul reg1/mem
(which does dx:ax = ax * reg2/mem
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched back to BuildBinaryUses
, so it should behave exactly as before this PR for mul/imul
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Couple small typos and a nit about imul
involving an 8, 16, or 32-bit sign-extended immediate not being RMW
Summary
Overview:
MULX
for GT_MULHI (when doing division by constant) and for GT_MUL_LONGMinor changes:
IsUnsigned()
call instead of doing bit manipulation for clarityCopilot Summary
Expand
Enhancements for BMI2 Instruction Set:
MULX
instruction when BMI2 is available, enabling efficient unsigned multiplication without modifying RDX. This includes logic to handle operand placement and instruction emission for bothGT_MULHI
andGT_MUL_LONG
. (src/coreclr/jit/codegenxarch.cpp
, [1] [2] [3]Operand and Memory Containment Logic:
ContainCheckMul
to adjust operand types (nodeType
) forGT_MUL_LONG
and ensure safe containment of memory operands.op2
. (src/coreclr/jit/lowerxarch.cpp
, [1] [2] [3] [4] [5] [6]Register Allocation and Kill Set Updates:
LinearScan::getKillSetForMul
to avoid killing RDX when usingMULX
and added logic to differentiate between base instructions and BMI2-specific instructions. (src/coreclr/jit/lsrabuild.cpp
, src/coreclr/jit/lsrabuild.cppL785-R798)General Refactoring:
IsUnsigned()
method calls for clarity and consistency. (src/coreclr/jit/lowerxarch.cpp
, [1];src/coreclr/jit/lsraxarch.cpp
, [2]BuildMul
to account for BMI2-specific operand handling and register constraints, ensuring proper use of implicit registers like RAX and RDX. (src/coreclr/jit/lsraxarch.cpp
, [1] [2]These changes enhance the compiler's efficiency and maintainability, particularly for architectures supporting BMI2, while ensuring correctness in operand handling and memory containment.
Code generation examples
Simple
With BMI2
Without BMI2
BigMul
The following code is generated for the following Math.BigMul method
codegen with BMI2
bmi2 codegen produces less push/pop due to better argument usage
codegen without BMI2
.NET 9 (from linqpad)
Division by constant
With BMI2 generates the following (x86, same behaviour for ulong on x64):
Instead of