failure to convert FP addition loop into multiplication with fast-math #28268

rotateright · 2016-05-26T18:56:20Z


Bugzilla Link	27894
Version	trunk
OS	All
Blocks	#34959
CC	@apolukhin,@atrick,@Delena,@hfinkel,@RKSimon,@sanjoy

Extended Description

Another example in the loop vectorizer code explosion series (bug 27881, bug 27826) can be seen with:

float multiply_the_hard_way(int n) {
  float sum = 0.0f;
  for (int i=0; i<n; i++)
    sum += 7.0f;

  return sum;
}

This may look ridiculous in simplified form, but the problem could easily exist in real code or slightly more real code like bug 27881.

This may be considered a bug before it ever gets to the vectorizer as discussed recently here:
http://lists.llvm.org/pipermail/llvm-dev/2016-May/099724.html

Ie, if these were integers, we'd convert this into an imul.

$ ./clang -O2 multiply_the_hard_way.c -S -ffast-math -emit-llvm -o -

; ModuleID = 'multiply_the_hard_way.c'
source_filename = "multiply_the_hard_way.c"
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.11.0"

; Function Attrs: norecurse nounwind readnone ssp uwtable
define float @multiple_the_hard_way(i32 %n) #0 {
entry:
  %cmp5 = icmp sgt i32 %n, 0
  br i1 %cmp5, label %for.body.preheader, label %for.cond.cleanup

for.body.preheader:                               ; preds = %entry
  %min.iters.check = icmp ult i32 %n, 8
  br i1 %min.iters.check, label %for.body.preheader15, label %min.iters.checked

for.body.preheader15:                             ; preds = %middle.block, %min.iters.checked, %for.body.preheader
  %i.07.ph = phi i32 [ 0, %min.iters.checked ], [ 0, %for.body.preheader ], [ %n.vec, %middle.block ]
  %sum.06.ph = phi float [ 0.000000e+00, %min.iters.checked ], [ 0.000000e+00, %for.body.preheader ], [ %8, %middle.block ]
  br label %for.body

min.iters.checked:                                ; preds = %for.body.preheader
  %n.vec = and i32 %n, -8
  %cmp.zero = icmp eq i32 %n.vec, 0
  br i1 %cmp.zero, label %for.body.preheader15, label %vector.body.preheader

vector.body.preheader:                            ; preds = %min.iters.checked
  %0 = add i32 %n.vec, -8
  %1 = lshr exact i32 %0, 3
  %2 = add nuw nsw i32 %1, 1
  %xtraiter = and i32 %2, 7
  %3 = icmp ult i32 %0, 56
  br i1 %3, label %middle.block.unr-lcssa, label %vector.body.preheader.new

vector.body.preheader.new:                        ; preds = %vector.body.preheader
  %unroll_iter = sub nsw i32 %2, %xtraiter
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.body.preheader.new
  %vec.phi = phi <4 x float> [ zeroinitializer, %vector.body.preheader.new ], [ %4, %vector.body ]
  %vec.phi9 = phi <4 x float> [ zeroinitializer, %vector.body.preheader.new ], [ %5, %vector.body ]
  %niter = phi i32 [ %unroll_iter, %vector.body.preheader.new ], [ %niter.nsub.7, %vector.body ]
  %4 = fadd fast <4 x float> %vec.phi, <float 5.600000e+01, float 5.600000e+01, float 5.600000e+01, float 5.600000e+01>
  %5 = fadd fast <4 x float> %vec.phi9, <float 5.600000e+01, float 5.600000e+01, float 5.600000e+01, float 5.600000e+01>
  %niter.nsub.7 = add i32 %niter, -8
  %niter.ncmp.7 = icmp eq i32 %niter.nsub.7, 0
  br i1 %niter.ncmp.7, label %middle.block.unr-lcssa.loopexit, label %vector.body, !llvm.loop !2

middle.block.unr-lcssa.loopexit:                  ; preds = %vector.body
  %.lcssa20 = phi <4 x float> [ %5, %vector.body ]
  %.lcssa19 = phi <4 x float> [ %4, %vector.body ]
  br label %middle.block.unr-lcssa

middle.block.unr-lcssa:                           ; preds = %middle.block.unr-lcssa.loopexit, %vector.body.preheader
  %.lcssa16.ph = phi <4 x float> [ undef, %vector.body.preheader ], [ %.lcssa20, %middle.block.unr-lcssa.loopexit ]
  %.lcssa.ph = phi <4 x float> [ undef, %vector.body.preheader ], [ %.lcssa19, %middle.block.unr-lcssa.loopexit ]
  %vec.phi.unr = phi <4 x float> [ zeroinitializer, %vector.body.preheader ], [ %.lcssa19, %middle.block.unr-lcssa.loopexit ]
  %vec.phi9.unr = phi <4 x float> [ zeroinitializer, %vector.body.preheader ], [ %.lcssa20, %middle.block.unr-lcssa.loopexit ]
  %lcmp.mod = icmp eq i32 %xtraiter, 0
  br i1 %lcmp.mod, label %middle.block, label %vector.body.epil.preheader

vector.body.epil.preheader:                       ; preds = %middle.block.unr-lcssa
  br label %vector.body.epil

vector.body.epil:                                 ; preds = %vector.body.epil, %vector.body.epil.preheader
  %vec.phi.epil = phi <4 x float> [ %6, %vector.body.epil ], [ %vec.phi.unr, %vector.body.epil.preheader ]
  %vec.phi9.epil = phi <4 x float> [ %7, %vector.body.epil ], [ %vec.phi9.unr, %vector.body.epil.preheader ]
  %epil.iter = phi i32 [ %epil.iter.sub, %vector.body.epil ], [ %xtraiter, %vector.body.epil.preheader ]
  %6 = fadd fast <4 x float> %vec.phi.epil, <float 7.000000e+00, float 7.000000e+00, float 7.000000e+00, float 7.000000e+00>
  %7 = fadd fast <4 x float> %vec.phi9.epil, <float 7.000000e+00, float 7.000000e+00, float 7.000000e+00, float 7.000000e+00>
  %epil.iter.sub = add i32 %epil.iter, -1
  %epil.iter.cmp = icmp eq i32 %epil.iter.sub, 0
  br i1 %epil.iter.cmp, label %middle.block.epilog-lcssa, label %vector.body.epil, !llvm.loop !5

middle.block.epilog-lcssa:                        ; preds = %vector.body.epil
  %.lcssa22 = phi <4 x float> [ %7, %vector.body.epil ]
  %.lcssa21 = phi <4 x float> [ %6, %vector.body.epil ]
  br label %middle.block

middle.block:                                     ; preds = %middle.block.unr-lcssa, %middle.block.epilog-lcssa
  %.lcssa16 = phi <4 x float> [ %.lcssa16.ph, %middle.block.unr-lcssa ], [ %.lcssa22, %middle.block.epilog-lcssa ]
  %.lcssa = phi <4 x float> [ %.lcssa.ph, %middle.block.unr-lcssa ], [ %.lcssa21, %middle.block.epilog-lcssa ]
  %bin.rdx = fadd fast <4 x float> %.lcssa16, %.lcssa
  %rdx.shuf = shufflevector <4 x float> %bin.rdx, <4 x float> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
  %bin.rdx12 = fadd fast <4 x float> %bin.rdx, %rdx.shuf
  %rdx.shuf13 = shufflevector <4 x float> %bin.rdx12, <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %bin.rdx14 = fadd fast <4 x float> %bin.rdx12, %rdx.shuf13
  %8 = extractelement <4 x float> %bin.rdx14, i32 0
  %cmp.n = icmp eq i32 %n.vec, %n
  br i1 %cmp.n, label %for.cond.cleanup, label %for.body.preheader15

for.cond.cleanup.loopexit:                        ; preds = %for.body
  %add.lcssa = phi float [ %add, %for.body ]
  br label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %for.cond.cleanup.loopexit, %middle.block, %entry
  %sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %8, %middle.block ], [ %add.lcssa, %for.cond.cleanup.loopexit ]
  ret float %sum.0.lcssa

for.body:                                         ; preds = %for.body.preheader15, %for.body
  %i.07 = phi i32 [ %inc, %for.body ], [ %i.07.ph, %for.body.preheader15 ]
  %sum.06 = phi float [ %add, %for.body ], [ %sum.06.ph, %for.body.preheader15 ]
  %add = fadd fast float %sum.06, 7.000000e+00
  %inc = add nuw nsw i32 %i.07, 1
  %exitcond = icmp eq i32 %inc, %n
  br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body, !llvm.loop !7
}

attributes #0 = { norecurse nounwind readnone ssp uwtable "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="true" "no-jump-tables"="false" "no-nans-fp-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="core2" "target-features"="+cx16,+fxsr,+mmx,+sse,+sse2,+sse3,+ssse3,+x87" "unsafe-fp-math"="true" "use-soft-float"="false" }

!llvm.module.flags = !{#0}
!llvm.ident = !{#1}

!0 = !{i32 1, !"PIC Level", i32 2}
!1 = !{!"clang version 3.9.0 (trunk 270847)"}
!2 = distinct !{#2, !3, !4}
!3 = !{!"llvm.loop.vectorize.width", i32 1}
!4 = !{!"llvm.loop.interleave.count", i32 1}
!5 = distinct !{#5, !6}
!6 = !{!"llvm.loop.unroll.disable"}
!7 = distinct !{#7, !8, !3, !4}
!8 = !{!"llvm.loop.unroll.runtime.disable"}

delena · 2016-05-26T19:28:16Z

Do you suggest to convert this sequence to fmul? Is it the right thing to do under fast-math?

rotateright · 2016-05-26T19:39:14Z

Do you suggest to convert this sequence to fmul? Is it the right thing to do
under fast-math?

I think so (although I don't know which pass is responsible for the transform yet).

For integers, we have to guard against a negative value for 'n':

define i32 @multiply_the_hard_way(i32 %n) {
  %cmp = icmp sgt i32 %n, 0
  %mul = mul i32 %n, 7
  %sel = select i1 %cmp, i32 %mul, i32 0
  ret i32 %sel
}

So for floats, it should be similar?

define float @multiply_the_hard_way(i32 %n) {
  %cmp = icmp sgt i32 %n, 0
  %n_as_float = sitofp i32 %n to float
  %mul = fmul float %n_as_float, 7.0
  %sel = select i1 %cmp, float %mul, float 0.0
  ret i32 %sel
}

rotateright · 2016-05-26T19:40:25Z

ret i32 %sel

Typo: i32 --> float

rotateright · 2016-05-26T19:48:27Z

Note that if 'n' is over ~2^23, we wouldn't have an exact result. I haven't looked to see how the add loop would diverge from an fmul in that case. The difference could be judged too much even for fast-math...although at first glance, I would think the multiply would be the more accurate answer!

atrick · 2016-06-02T22:39:53Z

Induction variable simplification will effectively remove this loop if SCEV expressions are formed for the floating point operations.

The question is

is this a desirable optimization in real code
is this transformation sound

It's not clear to me that fast-math means that floating point operations can be treated as scaled, infinite precision integer operations, which is effectively what you're doing. This is way beyond reassociation.

rotateright · 2016-06-02T22:48:45Z

Induction variable simplification will effectively remove this loop if SCEV
expressions are formed for the floating point operations.

The question is

is this a desirable optimization in real code

I think patterns like what we have in bug 27881 are common, so IMO, yes.

is this transformation sound

It's not clear to me that fast-math means that floating point operations can
be treated as scaled, infinite precision integer operations, which is
effectively what you're doing. This is way beyond reassociation.

Agreed. This definitely pushes the (unspecified) bounds of fast-math. I wonder if the fact that the SCEV likely produces a more accurate answer than what the program would produce unoptimized should be taken into consideration.

hfinkel · 2016-06-02T23:08:32Z

Induction variable simplification will effectively remove this loop if SCEV
expressions are formed for the floating point operations.

The question is

is this a desirable optimization in real code

I think patterns like what we have in bug 27881 are common, so IMO, yes.

is this transformation sound

It's not clear to me that fast-math means that floating point operations can
be treated as scaled, infinite precision integer operations, which is
effectively what you're doing. This is way beyond reassociation.

Agreed. This definitely pushes the (unspecified) bounds of fast-math. I
wonder if the fact that the SCEV likely produces a more accurate answer than
what the program would produce unoptimized should be taken into
consideration.

So this is tricky. First, because fast-math already does this sometimes (just because of reassociation), but, second, because there's no sequence of local transformations the user could do to make the unoptimized program produce the same result (unlike reassociation).

In the end, I'm okay with performing this transformation under fast-math. I do believe, however, it depends on how we define it. We probably should try to define it. Maybe something like this, "-ffast-math enables the compiler to perform transformations on floating-point calculations that are valid when treating the floating-point values as mathematical real numbers, but not semantics preserving when considering the floating-point values' machine representations. It also enables the compiler to perform transformations resulting in floating-point calculations computing fewer correct bits than they would otherwise."

RKSimon · 2021-11-27T00:04:48Z

mentioned in issue #34959

RKSimon mentioned this issue Dec 10, 2017

[meta] Missed fast-math optimizations #34959

Open

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

arsenm added the floating-point Floating-point math label Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failure to convert FP addition loop into multiplication with fast-math #28268

failure to convert FP addition loop into multiplication with fast-math #28268

rotateright commented May 26, 2016 •

edited by RKSimon

Loading

delena mannequin commented May 26, 2016

rotateright commented May 26, 2016 •

edited by RKSimon

Loading

rotateright commented May 26, 2016

rotateright commented May 26, 2016

atrick commented Jun 2, 2016

rotateright commented Jun 2, 2016

hfinkel commented Jun 2, 2016

RKSimon commented Nov 27, 2021

failure to convert FP addition loop into multiplication with fast-math #28268

failure to convert FP addition loop into multiplication with fast-math #28268

Comments

rotateright commented May 26, 2016 • edited by RKSimon Loading

Extended Description

delena mannequin commented May 26, 2016

rotateright commented May 26, 2016 • edited by RKSimon Loading

rotateright commented May 26, 2016

rotateright commented May 26, 2016

atrick commented Jun 2, 2016

rotateright commented Jun 2, 2016

hfinkel commented Jun 2, 2016

RKSimon commented Nov 27, 2021

rotateright commented May 26, 2016 •

edited by RKSimon

Loading

rotateright commented May 26, 2016 •

edited by RKSimon

Loading