Skip to content

Allow the vector and scalar code paths for min, max, and fma APIs to share logic #116804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jun 21, 2025

Conversation

tannergooding
Copy link
Member

@tannergooding tannergooding commented Jun 19, 2025

This fixes #116803
This fixes #115381

@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 19, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@tannergooding tannergooding marked this pull request as ready for review June 20, 2025 18:59
@Copilot Copilot AI review requested due to automatic review settings June 20, 2025 18:59
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR consolidates the vector and scalar code paths for math intrinsics (min, max, and FMA) by sharing common logic across different platforms and updating the intrinsic lowering functions accordingly. Key changes include:

  • Removal of duplicate intrinsic case blocks in ValueNumStore and selective compilation using TARGET_RISCV64.
  • Renaming and refactoring of the FMA lowering method from LowerFusedMultiplyAdd to LowerFusedMultiplyOp.
  • Updates to intrinsic lists and helper APIs to support unified handling for min/max variants.

Reviewed Changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/coreclr/jit/valuenum.cpp Removed redundant intrinsic cases for max/min and encapsulated platform‑specific handling.
src/coreclr/jit/lowerxarch.cpp Renamed and refactored the FMA lowering routine to share logic among different intrinsics.
src/coreclr/jit/namedintrinsiclist.h Added new intrinsic entries (e.g. NI_System_Math_MaxNative/MinNative) for unified API support.
src/coreclr/jit/hwintrinsiclistxarch.h Updated intrinsic declarations to include new min/max variants.
src/coreclr/jit/compiler.h Consolidated helper API signatures for SIMD min/max nodes into unified functions.
Other files Adjusted intrinsic processing and lowering logic across xarch, arm64, and codegen modules.
Comments suppressed due to low confidence (3)

src/coreclr/jit/lowerxarch.cpp:1386

  • Renaming the intrinsic lowering method to LowerFusedMultiplyOp improves clarity; please ensure that all call sites and related documentation are updated accordingly.
void Lowering::LowerFusedMultiplyOp(GenTreeHWIntrinsic* node)

src/coreclr/jit/compiler.h:3338

  • With the introduction of unified APIs for min/max intrinsics, please confirm that all consumers of the previous gtNewSimdMaxNode/gtNewSimdMinNativeNode functions have been updated to use the new unified APIs.
                                 bool        isNumber);

@tannergooding
Copy link
Member Author

CC. @dotnet/jit-contrib, @jakobbotsch, @kunalspathak for review.

diffs are very positive. We see -77.3k bytes for windows arm64 in full opts due to the significantly more compact codegen as compared to the naive inlined version. There is a +11.6k in minopts regression, which is due to us having expanded intrinsics instead of calls.

The diffs for x64 are much smaller since we were already accelerating most scalar scenarios already. There are some opportunities to improve the codegen more, but it would involve more work than what this did, which was mostly centralizing the existing logic.

Throughput ranges from a slight improvement (-0.07% on the high end for Windows Arm64 full opts) to a slight regression (+0.04% on the high end for Windows x64 full opts)

@kunalspathak
Copy link
Member

do you know why linux/x64 is also seeing regression?

image

@tannergooding
Copy link
Member Author

do you know why linux/x64 is also seeing regression?

Looks like it generated code for AVX2 rather than AVX512 (likely just a difference in Azure machine allocation):

; Windows 64 - System.MathBenchmarks.Single:MaxTest() (Instrumented Tier0)
        vaddsd   xmm0, xmm0, xmm2
+       vmovaps  xmm5, xmm3
+       vmovaps  xmm16, xmm0
+       vrangesd xmm17, xmm5, xmm16, 5
+       vfixupimmsd xmm5, xmm16, xmm4, 0
+       vfixupimmsd xmm17, xmm5, xmm4, 0
        vaddsd   xmm6, xmm17, xmm6

; Linux x64 - System.MathBenchmarks.Single:MinTest() (FullOpts)
+       vsubss   xmm1, xmm1, xmm2
+       vmovaps  xmm4, xmm3
+       vmovaps  xmm5, xmm1
+       vcmpps   xmm6, xmm4, xmm5, 0
+       vxorps   xmm7, xmm7, xmm7
+       vpcmpgtd xmm7, xmm7, xmm4
+       vandps   xmm6, xmm7, xmm6
+       vcmpps   xmm7, xmm4, xmm4, 4
+       vorps    xmm6, xmm7, xmm6
+       vcmpps   xmm7, xmm4, xmm5, 1
+       vorps    xmm6, xmm7, xmm6
+       vblendvps xmm4, xmm5, xmm4, xmm6
+       vaddss   xmm0, xmm4, xmm0

The code is still much better than the previous pattern, it's just larger than the naive code with 4+ branches. This also means more places are inlined, where-as previously they were calls.

@kunalspathak
Copy link
Member

/azp run Antigen, Fuzzlyn

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as Antigen/Fuzzlyn is green.

@tannergooding
Copy link
Member Author

/azp run Antigen, Fuzzlyn

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT: float.{Min/Max}Native generates cmp + jump instead of v{min/max}ss Min*/Max*/Clamp with one or two constant arguments on x86/x64
2 participants