-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Allow the vector and scalar code paths for min, max, and fma APIs to share logic #116804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
dca4623
to
4eb60ff
Compare
4eb60ff
to
2ef03f8
Compare
682ef24
to
0ac1e3f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR consolidates the vector and scalar code paths for math intrinsics (min, max, and FMA) by sharing common logic across different platforms and updating the intrinsic lowering functions accordingly. Key changes include:
- Removal of duplicate intrinsic case blocks in ValueNumStore and selective compilation using TARGET_RISCV64.
- Renaming and refactoring of the FMA lowering method from LowerFusedMultiplyAdd to LowerFusedMultiplyOp.
- Updates to intrinsic lists and helper APIs to support unified handling for min/max variants.
Reviewed Changes
Copilot reviewed 11 out of 13 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
src/coreclr/jit/valuenum.cpp | Removed redundant intrinsic cases for max/min and encapsulated platform‑specific handling. |
src/coreclr/jit/lowerxarch.cpp | Renamed and refactored the FMA lowering routine to share logic among different intrinsics. |
src/coreclr/jit/namedintrinsiclist.h | Added new intrinsic entries (e.g. NI_System_Math_MaxNative/MinNative) for unified API support. |
src/coreclr/jit/hwintrinsiclistxarch.h | Updated intrinsic declarations to include new min/max variants. |
src/coreclr/jit/compiler.h | Consolidated helper API signatures for SIMD min/max nodes into unified functions. |
Other files | Adjusted intrinsic processing and lowering logic across xarch, arm64, and codegen modules. |
Comments suppressed due to low confidence (3)
src/coreclr/jit/lowerxarch.cpp:1386
- Renaming the intrinsic lowering method to LowerFusedMultiplyOp improves clarity; please ensure that all call sites and related documentation are updated accordingly.
void Lowering::LowerFusedMultiplyOp(GenTreeHWIntrinsic* node)
src/coreclr/jit/compiler.h:3338
- With the introduction of unified APIs for min/max intrinsics, please confirm that all consumers of the previous gtNewSimdMaxNode/gtNewSimdMinNativeNode functions have been updated to use the new unified APIs.
bool isNumber);
CC. @dotnet/jit-contrib, @jakobbotsch, @kunalspathak for review. diffs are very positive. We see -77.3k bytes for windows arm64 in full opts due to the significantly more compact codegen as compared to the naive inlined version. There is a +11.6k in minopts regression, which is due to us having expanded intrinsics instead of calls. The diffs for x64 are much smaller since we were already accelerating most scalar scenarios already. There are some opportunities to improve the codegen more, but it would involve more work than what this did, which was mostly centralizing the existing logic. Throughput ranges from a slight improvement ( |
Looks like it generated code for AVX2 rather than AVX512 (likely just a difference in Azure machine allocation): ; Windows 64 - System.MathBenchmarks.Single:MaxTest() (Instrumented Tier0)
vaddsd xmm0, xmm0, xmm2
+ vmovaps xmm5, xmm3
+ vmovaps xmm16, xmm0
+ vrangesd xmm17, xmm5, xmm16, 5
+ vfixupimmsd xmm5, xmm16, xmm4, 0
+ vfixupimmsd xmm17, xmm5, xmm4, 0
vaddsd xmm6, xmm17, xmm6
; Linux x64 - System.MathBenchmarks.Single:MinTest() (FullOpts)
+ vsubss xmm1, xmm1, xmm2
+ vmovaps xmm4, xmm3
+ vmovaps xmm5, xmm1
+ vcmpps xmm6, xmm4, xmm5, 0
+ vxorps xmm7, xmm7, xmm7
+ vpcmpgtd xmm7, xmm7, xmm4
+ vandps xmm6, xmm7, xmm6
+ vcmpps xmm7, xmm4, xmm4, 4
+ vorps xmm6, xmm7, xmm6
+ vcmpps xmm7, xmm4, xmm5, 1
+ vorps xmm6, xmm7, xmm6
+ vblendvps xmm4, xmm5, xmm4, xmm6
+ vaddss xmm0, xmm4, xmm0 The code is still much better than the previous pattern, it's just larger than the naive code with 4+ branches. This also means more places are inlined, where-as previously they were calls. |
/azp run Antigen, Fuzzlyn |
Azure Pipelines successfully started running 2 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as long as Antigen/Fuzzlyn is green.
/azp run Antigen, Fuzzlyn |
Azure Pipelines successfully started running 2 pipeline(s). |
This fixes #116803
This fixes #115381