Skip to content

failure to convert FP addition loop into multiplication with fast-math #28268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rotateright opened this issue May 26, 2016 · 8 comments
Open
Labels
bugzilla Issues migrated from bugzilla floating-point Floating-point math loopoptim

Comments

@rotateright
Copy link
Contributor

rotateright commented May 26, 2016

Bugzilla Link 27894
Version trunk
OS All
Blocks #34959
CC @apolukhin,@atrick,@Delena,@hfinkel,@RKSimon,@sanjoy

Extended Description

Another example in the loop vectorizer code explosion series (bug 27881, bug 27826) can be seen with:

float multiply_the_hard_way(int n) {
  float sum = 0.0f;
  for (int i=0; i<n; i++)
    sum += 7.0f;

  return sum;
}

This may look ridiculous in simplified form, but the problem could easily exist in real code or slightly more real code like bug 27881.

This may be considered a bug before it ever gets to the vectorizer as discussed recently here:
http://lists.llvm.org/pipermail/llvm-dev/2016-May/099724.html

Ie, if these were integers, we'd convert this into an imul.


$ ./clang -O2 multiply_the_hard_way.c -S -ffast-math -emit-llvm -o -

; ModuleID = 'multiply_the_hard_way.c'
source_filename = "multiply_the_hard_way.c"
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.11.0"

; Function Attrs: norecurse nounwind readnone ssp uwtable
define float @multiple_the_hard_way(i32 %n) #0 {
entry:
  %cmp5 = icmp sgt i32 %n, 0
  br i1 %cmp5, label %for.body.preheader, label %for.cond.cleanup

for.body.preheader:                               ; preds = %entry
  %min.iters.check = icmp ult i32 %n, 8
  br i1 %min.iters.check, label %for.body.preheader15, label %min.iters.checked

for.body.preheader15:                             ; preds = %middle.block, %min.iters.checked, %for.body.preheader
  %i.07.ph = phi i32 [ 0, %min.iters.checked ], [ 0, %for.body.preheader ], [ %n.vec, %middle.block ]
  %sum.06.ph = phi float [ 0.000000e+00, %min.iters.checked ], [ 0.000000e+00, %for.body.preheader ], [ %8, %middle.block ]
  br label %for.body

min.iters.checked:                                ; preds = %for.body.preheader
  %n.vec = and i32 %n, -8
  %cmp.zero = icmp eq i32 %n.vec, 0
  br i1 %cmp.zero, label %for.body.preheader15, label %vector.body.preheader

vector.body.preheader:                            ; preds = %min.iters.checked
  %0 = add i32 %n.vec, -8
  %1 = lshr exact i32 %0, 3
  %2 = add nuw nsw i32 %1, 1
  %xtraiter = and i32 %2, 7
  %3 = icmp ult i32 %0, 56
  br i1 %3, label %middle.block.unr-lcssa, label %vector.body.preheader.new

vector.body.preheader.new:                        ; preds = %vector.body.preheader
  %unroll_iter = sub nsw i32 %2, %xtraiter
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.body.preheader.new
  %vec.phi = phi <4 x float> [ zeroinitializer, %vector.body.preheader.new ], [ %4, %vector.body ]
  %vec.phi9 = phi <4 x float> [ zeroinitializer, %vector.body.preheader.new ], [ %5, %vector.body ]
  %niter = phi i32 [ %unroll_iter, %vector.body.preheader.new ], [ %niter.nsub.7, %vector.body ]
  %4 = fadd fast <4 x float> %vec.phi, <float 5.600000e+01, float 5.600000e+01, float 5.600000e+01, float 5.600000e+01>
  %5 = fadd fast <4 x float> %vec.phi9, <float 5.600000e+01, float 5.600000e+01, float 5.600000e+01, float 5.600000e+01>
  %niter.nsub.7 = add i32 %niter, -8
  %niter.ncmp.7 = icmp eq i32 %niter.nsub.7, 0
  br i1 %niter.ncmp.7, label %middle.block.unr-lcssa.loopexit, label %vector.body, !llvm.loop !2

middle.block.unr-lcssa.loopexit:                  ; preds = %vector.body
  %.lcssa20 = phi <4 x float> [ %5, %vector.body ]
  %.lcssa19 = phi <4 x float> [ %4, %vector.body ]
  br label %middle.block.unr-lcssa

middle.block.unr-lcssa:                           ; preds = %middle.block.unr-lcssa.loopexit, %vector.body.preheader
  %.lcssa16.ph = phi <4 x float> [ undef, %vector.body.preheader ], [ %.lcssa20, %middle.block.unr-lcssa.loopexit ]
  %.lcssa.ph = phi <4 x float> [ undef, %vector.body.preheader ], [ %.lcssa19, %middle.block.unr-lcssa.loopexit ]
  %vec.phi.unr = phi <4 x float> [ zeroinitializer, %vector.body.preheader ], [ %.lcssa19, %middle.block.unr-lcssa.loopexit ]
  %vec.phi9.unr = phi <4 x float> [ zeroinitializer, %vector.body.preheader ], [ %.lcssa20, %middle.block.unr-lcssa.loopexit ]
  %lcmp.mod = icmp eq i32 %xtraiter, 0
  br i1 %lcmp.mod, label %middle.block, label %vector.body.epil.preheader

vector.body.epil.preheader:                       ; preds = %middle.block.unr-lcssa
  br label %vector.body.epil

vector.body.epil:                                 ; preds = %vector.body.epil, %vector.body.epil.preheader
  %vec.phi.epil = phi <4 x float> [ %6, %vector.body.epil ], [ %vec.phi.unr, %vector.body.epil.preheader ]
  %vec.phi9.epil = phi <4 x float> [ %7, %vector.body.epil ], [ %vec.phi9.unr, %vector.body.epil.preheader ]
  %epil.iter = phi i32 [ %epil.iter.sub, %vector.body.epil ], [ %xtraiter, %vector.body.epil.preheader ]
  %6 = fadd fast <4 x float> %vec.phi.epil, <float 7.000000e+00, float 7.000000e+00, float 7.000000e+00, float 7.000000e+00>
  %7 = fadd fast <4 x float> %vec.phi9.epil, <float 7.000000e+00, float 7.000000e+00, float 7.000000e+00, float 7.000000e+00>
  %epil.iter.sub = add i32 %epil.iter, -1
  %epil.iter.cmp = icmp eq i32 %epil.iter.sub, 0
  br i1 %epil.iter.cmp, label %middle.block.epilog-lcssa, label %vector.body.epil, !llvm.loop !5

middle.block.epilog-lcssa:                        ; preds = %vector.body.epil
  %.lcssa22 = phi <4 x float> [ %7, %vector.body.epil ]
  %.lcssa21 = phi <4 x float> [ %6, %vector.body.epil ]
  br label %middle.block

middle.block:                                     ; preds = %middle.block.unr-lcssa, %middle.block.epilog-lcssa
  %.lcssa16 = phi <4 x float> [ %.lcssa16.ph, %middle.block.unr-lcssa ], [ %.lcssa22, %middle.block.epilog-lcssa ]
  %.lcssa = phi <4 x float> [ %.lcssa.ph, %middle.block.unr-lcssa ], [ %.lcssa21, %middle.block.epilog-lcssa ]
  %bin.rdx = fadd fast <4 x float> %.lcssa16, %.lcssa
  %rdx.shuf = shufflevector <4 x float> %bin.rdx, <4 x float> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
  %bin.rdx12 = fadd fast <4 x float> %bin.rdx, %rdx.shuf
  %rdx.shuf13 = shufflevector <4 x float> %bin.rdx12, <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %bin.rdx14 = fadd fast <4 x float> %bin.rdx12, %rdx.shuf13
  %8 = extractelement <4 x float> %bin.rdx14, i32 0
  %cmp.n = icmp eq i32 %n.vec, %n
  br i1 %cmp.n, label %for.cond.cleanup, label %for.body.preheader15

for.cond.cleanup.loopexit:                        ; preds = %for.body
  %add.lcssa = phi float [ %add, %for.body ]
  br label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %for.cond.cleanup.loopexit, %middle.block, %entry
  %sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %8, %middle.block ], [ %add.lcssa, %for.cond.cleanup.loopexit ]
  ret float %sum.0.lcssa

for.body:                                         ; preds = %for.body.preheader15, %for.body
  %i.07 = phi i32 [ %inc, %for.body ], [ %i.07.ph, %for.body.preheader15 ]
  %sum.06 = phi float [ %add, %for.body ], [ %sum.06.ph, %for.body.preheader15 ]
  %add = fadd fast float %sum.06, 7.000000e+00
  %inc = add nuw nsw i32 %i.07, 1
  %exitcond = icmp eq i32 %inc, %n
  br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body, !llvm.loop !7
}

attributes #0 = { norecurse nounwind readnone ssp uwtable "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="true" "no-jump-tables"="false" "no-nans-fp-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="core2" "target-features"="+cx16,+fxsr,+mmx,+sse,+sse2,+sse3,+ssse3,+x87" "unsafe-fp-math"="true" "use-soft-float"="false" }

!llvm.module.flags = !{#0}
!llvm.ident = !{#1}

!0 = !{i32 1, !"PIC Level", i32 2}
!1 = !{!"clang version 3.9.0 (trunk 270847)"}
!2 = distinct !{#2, !3, !4}
!3 = !{!"llvm.loop.vectorize.width", i32 1}
!4 = !{!"llvm.loop.interleave.count", i32 1}
!5 = distinct !{#5, !6}
!6 = !{!"llvm.loop.unroll.disable"}
!7 = distinct !{#7, !8, !3, !4}
!8 = !{!"llvm.loop.unroll.runtime.disable"}
@delena
Copy link
Mannequin

delena mannequin commented May 26, 2016

Do you suggest to convert this sequence to fmul? Is it the right thing to do under fast-math?

@rotateright
Copy link
Contributor Author

rotateright commented May 26, 2016

Do you suggest to convert this sequence to fmul? Is it the right thing to do
under fast-math?

I think so (although I don't know which pass is responsible for the transform yet).

For integers, we have to guard against a negative value for 'n':

define i32 @multiply_the_hard_way(i32 %n) {
  %cmp = icmp sgt i32 %n, 0
  %mul = mul i32 %n, 7
  %sel = select i1 %cmp, i32 %mul, i32 0
  ret i32 %sel
}

So for floats, it should be similar?

define float @multiply_the_hard_way(i32 %n) {
  %cmp = icmp sgt i32 %n, 0
  %n_as_float = sitofp i32 %n to float
  %mul = fmul float %n_as_float, 7.0
  %sel = select i1 %cmp, float %mul, float 0.0
  ret i32 %sel
}

@rotateright
Copy link
Contributor Author

ret i32 %sel

Typo: i32 --> float

@rotateright
Copy link
Contributor Author

Note that if 'n' is over ~2^23, we wouldn't have an exact result. I haven't looked to see how the add loop would diverge from an fmul in that case. The difference could be judged too much even for fast-math...although at first glance, I would think the multiply would be the more accurate answer!

@atrick
Copy link
Contributor

atrick commented Jun 2, 2016

Induction variable simplification will effectively remove this loop if SCEV expressions are formed for the floating point operations.

The question is

  • is this a desirable optimization in real code
  • is this transformation sound

It's not clear to me that fast-math means that floating point operations can be treated as scaled, infinite precision integer operations, which is effectively what you're doing. This is way beyond reassociation.

@rotateright
Copy link
Contributor Author

Induction variable simplification will effectively remove this loop if SCEV
expressions are formed for the floating point operations.

The question is

  • is this a desirable optimization in real code

I think patterns like what we have in bug 27881 are common, so IMO, yes.

  • is this transformation sound

It's not clear to me that fast-math means that floating point operations can
be treated as scaled, infinite precision integer operations, which is
effectively what you're doing. This is way beyond reassociation.

Agreed. This definitely pushes the (unspecified) bounds of fast-math. I wonder if the fact that the SCEV likely produces a more accurate answer than what the program would produce unoptimized should be taken into consideration.

@hfinkel
Copy link
Collaborator

hfinkel commented Jun 2, 2016

Induction variable simplification will effectively remove this loop if SCEV
expressions are formed for the floating point operations.

The question is

  • is this a desirable optimization in real code

I think patterns like what we have in bug 27881 are common, so IMO, yes.

  • is this transformation sound

It's not clear to me that fast-math means that floating point operations can
be treated as scaled, infinite precision integer operations, which is
effectively what you're doing. This is way beyond reassociation.

Agreed. This definitely pushes the (unspecified) bounds of fast-math. I
wonder if the fact that the SCEV likely produces a more accurate answer than
what the program would produce unoptimized should be taken into
consideration.

So this is tricky. First, because fast-math already does this sometimes (just because of reassociation), but, second, because there's no sequence of local transformations the user could do to make the unoptimized program produce the same result (unlike reassociation).

In the end, I'm okay with performing this transformation under fast-math. I do believe, however, it depends on how we define it. We probably should try to define it. Maybe something like this, "-ffast-math enables the compiler to perform transformations on floating-point calculations that are valid when treating the floating-point values as mathematical real numbers, but not semantics preserving when considering the floating-point values' machine representations. It also enables the compiler to perform transformations resulting in floating-point calculations computing fewer correct bits than they would otherwise."

@RKSimon
Copy link
Collaborator

RKSimon commented Nov 27, 2021

mentioned in issue #34959

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
@arsenm arsenm added the floating-point Floating-point math label Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla floating-point Floating-point math loopoptim
Projects
None yet
Development

No branches or pull requests

5 participants