Skip to content

[IVDescriptors] Don't require nsz/nnan for (min|max)num. #137003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

fhahn
Copy link
Contributor

@fhahn fhahn commented Apr 23, 2025

We do not need to require nsz or nnan for FP reductions with minnum and maxnum. NaNs and nsz are handled consistently with the vector versions of minnum and maxnum. vector.reduce.(fmin|fmax) also matches the minnum/maxnum semantics for comparisons.

IIUC this was also the conclusion when support for minimum/maximum was added (https://reviews.llvm.org/D151482?id=531021#inline-1481555).

Alive2 agrees that maxnum/minnum can be re-ordered (scalar version: https://alive2.llvm.org/ce/z/GVmgBX) and also verifies the vectorized code end-to-end, with a tweak to replace llvm.reduce.fmax with a scalar maxnum, as Alive2 doesn't support llvm.reduce.fmax: https://alive2.llvm.org/ce/z/EwJKeJ . Note that verification requires a higher timeout than available in the online version.

We do not need to require nsz or nnan for FP reductions with minnum and
maxnum. NaNs and nsz are handled consistently with the vector versions
of minnum and maxnum. vector.reduce.(fmin|fmax) also matches the
minnum/maxnum semantics for comparisons.

IIUC this was also the conclusion when support for minimum/maximum was
added (https://reviews.llvm.org/D151482?id=531021#inline-1481555).

Alive2 agrees that maxnum/minnum can be re-ordered (scalar version:
https://alive2.llvm.org/ce/z/GVmgBX) and also verifies the vectorized
code end-to-end, with a tweak to replace llvm.reduce.fmax with a scalar
maxnum, as Alive2 doesn't support llvm.reduce.fmax:
https://alive2.llvm.org/ce/z/EwJKeJ . Note that verification requires a
higher timeout than available in the online version.
@llvmbot
Copy link
Member

llvmbot commented Apr 23, 2025

@llvm/pr-subscribers-llvm-analysis

Author: Florian Hahn (fhahn)

Changes

We do not need to require nsz or nnan for FP reductions with minnum and maxnum. NaNs and nsz are handled consistently with the vector versions of minnum and maxnum. vector.reduce.(fmin|fmax) also matches the minnum/maxnum semantics for comparisons.

IIUC this was also the conclusion when support for minimum/maximum was added (https://reviews.llvm.org/D151482?id=531021#inline-1481555).

Alive2 agrees that maxnum/minnum can be re-ordered (scalar version: https://alive2.llvm.org/ce/z/GVmgBX) and also verifies the vectorized code end-to-end, with a tweak to replace llvm.reduce.fmax with a scalar maxnum, as Alive2 doesn't support llvm.reduce.fmax: https://alive2.llvm.org/ce/z/EwJKeJ . Note that verification requires a higher timeout than available in the online version.


Full diff: https://github.com/llvm/llvm-project/pull/137003.diff

2 Files Affected:

  • (modified) llvm/lib/Analysis/IVDescriptors.cpp (+6-3)
  • (modified) llvm/test/Transforms/LoopVectorize/minmax_reduction.ll (+4-2)
diff --git a/llvm/lib/Analysis/IVDescriptors.cpp b/llvm/lib/Analysis/IVDescriptors.cpp
index 94c347b01bbfb..4a6efd0ca821a 100644
--- a/llvm/lib/Analysis/IVDescriptors.cpp
+++ b/llvm/lib/Analysis/IVDescriptors.cpp
@@ -892,10 +892,13 @@ RecurrenceDescriptor::InstDesc RecurrenceDescriptor::isRecurrenceInstr(
        return true;
      if (isa<FPMathOperator>(I) && I->hasNoNaNs() && I->hasNoSignedZeros())
        return true;
-     // minimum and maximum intrinsics do not require nsz and nnan flags since
-     // NaN and signed zeroes are propagated in the intrinsic implementation.
+     // minimum/minnum and maximum/maxnum intrinsics do not require nsz and nnan
+     // flags since NaN and signed zeroes are propagated in the intrinsic
+     // implementation.
      return match(I, m_Intrinsic<Intrinsic::minimum>(m_Value(), m_Value())) ||
-            match(I, m_Intrinsic<Intrinsic::maximum>(m_Value(), m_Value()));
+            match(I, m_Intrinsic<Intrinsic::maximum>(m_Value(), m_Value())) ||
+            match(I, m_Intrinsic<Intrinsic::minnum>(m_Value(), m_Value())) ||
+            match(I, m_Intrinsic<Intrinsic::maxnum>(m_Value(), m_Value()));
     };
     if (isIntMinMaxRecurrenceKind(Kind) ||
         (HasRequiredFMF() && isFPMinMaxRecurrenceKind(Kind)))
diff --git a/llvm/test/Transforms/LoopVectorize/minmax_reduction.ll b/llvm/test/Transforms/LoopVectorize/minmax_reduction.ll
index 85a90f2e04c5e..97b65e3435b5a 100644
--- a/llvm/test/Transforms/LoopVectorize/minmax_reduction.ll
+++ b/llvm/test/Transforms/LoopVectorize/minmax_reduction.ll
@@ -1002,7 +1002,8 @@ for.body:                                         ; preds = %entry, %for.body
 }
 
 ; CHECK-LABEL: @fmin_intrinsic_nofast(
-; CHECK-NOT: <2 x float> @llvm.minnum.v2f32
+; CHECK: call <2 x float> @llvm.minnum.v2f32
+; CHECK: call float @llvm.vector.reduce.fmin.v2f32
 define float @fmin_intrinsic_nofast(ptr nocapture readonly %x) {
 entry:
   br label %for.body
@@ -1022,7 +1023,8 @@ for.body:                                         ; preds = %entry, %for.body
 }
 
 ; CHECK-LABEL: @fmax_intrinsic_nofast(
-; CHECK-NOT: <2 x float> @llvm.maxnum.v2f32
+; CHECK: call <2 x float> @llvm.maxnum.v2f32
+; CHECK: call float @llvm.vector.reduce.fmax.v2f32
 define float @fmax_intrinsic_nofast(ptr nocapture readonly %x) {
 entry:
   br label %for.body

@llvmbot
Copy link
Member

llvmbot commented Apr 23, 2025

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

We do not need to require nsz or nnan for FP reductions with minnum and maxnum. NaNs and nsz are handled consistently with the vector versions of minnum and maxnum. vector.reduce.(fmin|fmax) also matches the minnum/maxnum semantics for comparisons.

IIUC this was also the conclusion when support for minimum/maximum was added (https://reviews.llvm.org/D151482?id=531021#inline-1481555).

Alive2 agrees that maxnum/minnum can be re-ordered (scalar version: https://alive2.llvm.org/ce/z/GVmgBX) and also verifies the vectorized code end-to-end, with a tweak to replace llvm.reduce.fmax with a scalar maxnum, as Alive2 doesn't support llvm.reduce.fmax: https://alive2.llvm.org/ce/z/EwJKeJ . Note that verification requires a higher timeout than available in the online version.


Full diff: https://github.com/llvm/llvm-project/pull/137003.diff

2 Files Affected:

  • (modified) llvm/lib/Analysis/IVDescriptors.cpp (+6-3)
  • (modified) llvm/test/Transforms/LoopVectorize/minmax_reduction.ll (+4-2)
diff --git a/llvm/lib/Analysis/IVDescriptors.cpp b/llvm/lib/Analysis/IVDescriptors.cpp
index 94c347b01bbfb..4a6efd0ca821a 100644
--- a/llvm/lib/Analysis/IVDescriptors.cpp
+++ b/llvm/lib/Analysis/IVDescriptors.cpp
@@ -892,10 +892,13 @@ RecurrenceDescriptor::InstDesc RecurrenceDescriptor::isRecurrenceInstr(
        return true;
      if (isa<FPMathOperator>(I) && I->hasNoNaNs() && I->hasNoSignedZeros())
        return true;
-     // minimum and maximum intrinsics do not require nsz and nnan flags since
-     // NaN and signed zeroes are propagated in the intrinsic implementation.
+     // minimum/minnum and maximum/maxnum intrinsics do not require nsz and nnan
+     // flags since NaN and signed zeroes are propagated in the intrinsic
+     // implementation.
      return match(I, m_Intrinsic<Intrinsic::minimum>(m_Value(), m_Value())) ||
-            match(I, m_Intrinsic<Intrinsic::maximum>(m_Value(), m_Value()));
+            match(I, m_Intrinsic<Intrinsic::maximum>(m_Value(), m_Value())) ||
+            match(I, m_Intrinsic<Intrinsic::minnum>(m_Value(), m_Value())) ||
+            match(I, m_Intrinsic<Intrinsic::maxnum>(m_Value(), m_Value()));
     };
     if (isIntMinMaxRecurrenceKind(Kind) ||
         (HasRequiredFMF() && isFPMinMaxRecurrenceKind(Kind)))
diff --git a/llvm/test/Transforms/LoopVectorize/minmax_reduction.ll b/llvm/test/Transforms/LoopVectorize/minmax_reduction.ll
index 85a90f2e04c5e..97b65e3435b5a 100644
--- a/llvm/test/Transforms/LoopVectorize/minmax_reduction.ll
+++ b/llvm/test/Transforms/LoopVectorize/minmax_reduction.ll
@@ -1002,7 +1002,8 @@ for.body:                                         ; preds = %entry, %for.body
 }
 
 ; CHECK-LABEL: @fmin_intrinsic_nofast(
-; CHECK-NOT: <2 x float> @llvm.minnum.v2f32
+; CHECK: call <2 x float> @llvm.minnum.v2f32
+; CHECK: call float @llvm.vector.reduce.fmin.v2f32
 define float @fmin_intrinsic_nofast(ptr nocapture readonly %x) {
 entry:
   br label %for.body
@@ -1022,7 +1023,8 @@ for.body:                                         ; preds = %entry, %for.body
 }
 
 ; CHECK-LABEL: @fmax_intrinsic_nofast(
-; CHECK-NOT: <2 x float> @llvm.maxnum.v2f32
+; CHECK: call <2 x float> @llvm.maxnum.v2f32
+; CHECK: call float @llvm.vector.reduce.fmax.v2f32
 define float @fmax_intrinsic_nofast(ptr nocapture readonly %x) {
 entry:
   br label %for.body

@andykaylor
Copy link
Contributor

Alive2 agrees that maxnum/minnum can be re-ordered (scalar version: https://alive2.llvm.org/ce/z/GVmgBX)

I'm not sure how Alive2 came to this conclusion, but I don't think it's correct. According to the LangRef definition of llvm.maxnum, "if either operand is sNaN, the result is qNaN" but "If either operand is a qNaN, returns the other non-NaN operand." So, if %a is sNaN and neither %b or %c are NaN,

%max = call float @llvm.maxnum.f32(float %a, float %b )
%max2 = call float @llvm.maxnum.f32(float %c, float %max)

will return %c whereas

%max = call float @llvm.maxnum.f32(float %b, float %c)
%max2 = call float @llvm.maxnum.f32(float %a, float %max)

will return qNaN.

I'm not sure how the LangRef defintion of llvm.maxnum squares with the general claim that "Floating-point math operations are allowed to treat all NaNs as if they were quiet NaNs."

@nikic
Copy link
Contributor

nikic commented Apr 23, 2025

@andykaylor This is the result of a recent (and imho misguided) change to LangRef, see #112852. Probably alive2 implements the old semantics.

@andykaylor
Copy link
Contributor

@andykaylor This is the result of a recent (and imho misguided) change to LangRef, see #112852. Probably alive2 implements the old semantics.

That change is, at the very least, troubling. We've added more rules about the handling of NaNs over time, but this is the only instance I'm aware of where sNaN and qNaN result in such different value results. @jcranmer-intel seemed to be against the change, while @arsenm was in favor, so they may want to comment here. If that change to the LangRef stands, then this PR has a problem.

I am sympathetic to the point @arsenm makes that if we're going to take the name of an IEEE-754 operation then we should do what IEEE-754 says the operation does, but the current definition still has an exception to that.

@arsenm
Copy link
Contributor

arsenm commented Apr 23, 2025

We've added more rules about the handling of NaNs over time, but this is the only instance I'm aware of where sNaN and qNaN result in such different value results.

Correct, IEEE really botched this operation definition. It fundamentally has the opposite behavior with a signaling and quiet nans and I don't see any other way around it. It is always going to be a special case and cannot abide by the "treat as-if it were quiet" handling. llvm.minimumnum is a much more sensible operation where it does work for the weird ignored-nan handling (although it still feels like a legacy concern, llvm.minimum just has completely normal nan handling).

@stephentyrone
Copy link
Contributor

I mentioned this to Florian in another channel, but:

  1. yes, fmaxnm/fminnm have always been fundamentally busted
  2. we can avoid the problem for the purposes of lowering C's fmax/fmin without no-honor-nans by multiplying by 1.0 before doing fmaxnm (the multiplication sends sNaN to qNaN, which will then be dropped when the other operand is a number).

@andykaylor
Copy link
Contributor

2. we can avoid the problem for the purposes of lowering C's fmax/fmin without no-honor-nans by multiplying by 1.0 before doing fmaxnm (the multiplication sends sNaN to qNaN, which will then be dropped when the other operand is a number).

Multiply by 1.0 is likely to be optimized away. I think you'd need llvm.canonicalize instead.

@stephentyrone
Copy link
Contributor

stephentyrone commented Apr 24, 2025

Sorry, I’m talking about what we need at the machine level in the example that Florian gave me. At the IR level, you’re right that’s a canonicalize.

@arsenm
Copy link
Contributor

arsenm commented Apr 24, 2025

  1. we can avoid the problem for the purposes of lowering C's fmax/fmin without no-honor-nans by multiplying by 1.0 before doing fmaxnm (the multiplication sends sNaN to qNaN, which will then be dropped when the other operand is a number).
    At the IR level, you’re right that’s a canonicalize.

I don't think we should be inserting canonicalize in the IR for this. You can switch to llvm.minimumnum + nsz to get ignored-signalingness behavior plus the fuzzy signed 0 handling of C fmin. In terms of usability, we mostly shouldn't use llvm.minnum other than it's handy in the lowering path for the common case of targets with the old style operations

@stephentyrone
Copy link
Contributor

Works for me, as long as we can codegen it well (armv8 at least seems to, from a quick experiment).

@fhahn
Copy link
Contributor Author

fhahn commented Apr 25, 2025

Ok, so it sounds like the new intrinsics are the way to go. I'll add support for maximumnum/minimumnum reductions and update Clang to emit use maximumnum/minimumnum for fmax/fmin.

One thing to note is that GCC uses fmaxnm on AArch64 for fmax, which appears incorrect as per the discussion above: https://godbolt.org/z/W3z1G1z3c

@fhahn fhahn closed this Apr 25, 2025
@fhahn fhahn deleted the lv-vectorize-maxnum-reduction branch April 25, 2025 14:50
@arsenm
Copy link
Contributor

arsenm commented Apr 25, 2025

One thing to note is that GCC uses fmaxnm on AArch64 for fmax, which appears incorrect as per the discussion above: https://godbolt.org/z/W3z1G1z3c

No, not incorrect. Or depends on your reading of the spec and at what point in time. glibc had the snan-as-qnan behavior until https://sourceware.org/bugzilla/show_bug.cgi?id=20947. The C99 wording was less clear than the C23 wording, which made it more clear that fmin/fmax should follow the IEEE-754 2008 behavior that llvm.maxnum is now supposed to have

@arsenm
Copy link
Contributor

arsenm commented Apr 25, 2025

One thing to note is that GCC uses fmaxnm on AArch64 for fmax, which appears incorrect as per the discussion above: https://godbolt.org/z/W3z1G1z3c

No, not incorrect. Or depends on your reading of the spec and at what point in time. glibc had the snan-as-qnan behavior until https://sourceware.org/bugzilla/show_bug.cgi?id=20947. The C99 wording was less clear than the C23 wording, which made it more clear that fmin/fmax should follow the IEEE-754 2008 behavior that llvm.maxnum is now supposed to have

Or maybe it was just clarified that signaling nans will be quieted, I'm not sure without rereading all that

@davemgreen
Copy link
Collaborator

Yeah - My understanding was that the behaviour of fmax in glibc is equivalent to maxnum, which is equivalent to fmaxnm. So the patch in #112852 brought the documentation in line how it always worked for AArch64/Arm, and what fmax should always have produced. It would certainly be nice for min/max reductions to vectorize without nonan, but generating 3 instructions for every fmin/fmax would be a shame too. (Hopefully some of them can detect no-snan from the generating instruction and not hit the issue, but it likely won't help everywhere).

@fhahn
Copy link
Contributor Author

fhahn commented Apr 25, 2025

Yeah - My understanding was that the behaviour of fmax in glibc is equivalent to maxnum, which is equivalent to fmaxnm. So the patch in #112852 brought the documentation in line how it always worked for AArch64/Arm, and what fmax should always have produced. It would certainly be nice for min/max reductions to vectorize without nonan, but generating 3 instructions for every fmin/fmax would be a shame too. (Hopefully some of them can detect no-snan from the generating instruction and not hit the issue, but it likely won't help everywhere).

I guess the question is if there's anything we could do to still match GCC's behavior? It would indeed be a shame if we would have to generate more instructions than gcc ;)

For the vector-reduction case, we will need to generate one extra op to silence the NaN, as the reduction value is guaranteed no-snan, if the start value isnt' snan.

@fhahn
Copy link
Contributor Author

fhahn commented Apr 25, 2025

t would certainly be nice for min/max reductions to vectorize without nonan, but generating 3 instructions for every fmin/fmax would be a shame too.

I guess with the new intrinsics definitions, generating maxnum for fmax is technically incorrect, and would have to be fixed regardless? Or if we accept to keep ignoring snans, we should also ignore them for reductions of maxnum, like gcc

@davemgreen
Copy link
Collaborator

davemgreen commented Apr 25, 2025

I guess with the new intrinsics definitions, generating maxnum for fmax is technically incorrect, and would have to be fixed regardless?

My understanding is that it is now correct. With the new definition glibc fmax() == llvm.maxnum == aarch64 fmaxnm.

GCC might have an equally troubled past with snan as clang, and treats it more like nan under fast-math. (But not quite like poison, that goes too far some times). They have a -fsignaling-nans from https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-fsignaling-nans, which is still described as experimental. And https://gcc.gnu.org/wiki/FloatingPointMath lists what is supported and not under the default options. It "does not care about signalling NaNs" under the default fp model (which could be interpreted be a little different from treating them as numbers vs nans in max).

I would be happy if snan were relegated to require strict-fp and 'unsupported' otherwise, but I'm not sure people would really like that response. It would allow all architectures to be happier though, with those supporting 2008 fmax behaviour and those supporting the newer ieee 2019 to both produce good code generation.

@davemgreen
Copy link
Collaborator

For the vector-reduction case, we will need to generate one extra op to silence the NaN, as the reduction value is guaranteed no-snan, if the start value isnt' snan.

Do you have a motivating case? Does it look like maxnum(reduct_iv, load) or something more complex like maxnum(reduct_iv, fp_op(..)). With the second, if you know it is not snan, then I was wondering if it generating the reduction could still be valid.

@fhahn
Copy link
Contributor Author

fhahn commented Apr 25, 2025

For the vector-reduction case, we will need to generate one extra op to silence the NaN, as the reduction value is guaranteed no-snan, if the start value isnt' snan.

Do you have a motivating case? Does it look like maxnum(reduct_iv, load) or something more complex like maxnum(reduct_iv, fp_op(..)). With the second, if you know it is not snan, then I was wondering if it generating the reduction could still be valid.

We should vectorize both, my motivating case is the first one. But even when using maximumnum, we should be able to lower maximumnum(rdx_phi, fp_op() to fmaxnm without anything else; We could also do that with maxnum(reduct_iv, fp_op(..)), but it gets a bit fuzzy and relies on nothing removing the fp_op later, which would make things incorrect again

@davemgreen
Copy link
Collaborator

For the vector-reduction case, we will need to generate one extra op to silence the NaN, as the reduction value is guaranteed no-snan, if the start value isnt' snan.

Do you have a motivating case? Does it look like maxnum(reduct_iv, load) or something more complex like maxnum(reduct_iv, fp_op(..)). With the second, if you know it is not snan, then I was wondering if it generating the reduction could still be valid.

We should vectorize both, my motivating case is the first one. But even when using maximumnum, we should be able to lower maximumnum(rdx_phi, fp_op() to fmaxnm without anything else; We could also do that with maxnum(reduct_iv, fp_op(..)), but it gets a bit fuzzy and relies on nothing removing the fp_op later, which would make things incorrect again

Good point, yeah.

@arsenm
Copy link
Contributor

arsenm commented Apr 25, 2025

For the vector-reduction case, we will need to generate one extra op to silence the NaN, as the reduction value is guaranteed no-snan, if the start value isnt' snan.

We should be able to avoid the canonicalizes in the minimumnum codegen in most cases. In AMDGPU we have a DAG optimization to try to prune out redundant canonicalizes during selection. It's theoretically broken if an incoming FP operation that should canonicalizes ends up dropping the canonicalize later. If we strengthened the CodeGen rules to disallow the canonicalize dropping we permit in the IR, we could do a better job cleaning these out in the combiner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants