[llvm-bugs] [Bug 47840] New: De-optimized vector loop with -ffast-math

Wed Oct 14 04:06:52 PDT 2020

https://bugs.llvm.org/show_bug.cgi?id=47840

            Bug ID: 47840
           Summary: De-optimized vector loop with -ffast-math
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Loop Optimizer
          Assignee: unassignedbugs at nondot.org
          Reporter: martin.pavlicek11 at gmail.com
                CC: llvm-bugs at lists.llvm.org

Following C++ snippet gets badly "de-optimized" when compiled with `clang -O3`
and `-ffast-math` flag is used, resulting in 20% runtime performance regression
compared to case when the flag is not used on x86_64 target.
https://llvm.godbolt.org/z/nndY4f

Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
|    Method  | Count |        Mean |     Error |    StdDev |
|----------- |------ |------------:|----------:|----------:|
| NoFastMath |    10 |    19.89 ns |  0.042 ns |  0.035 ns |
| FastMath   |    10 |    31.68 ns |  0.596 ns |  0.558 ns |
| NoFastMath |   100 |   119.88 ns |  0.808 ns |  0.755 ns |
| FastMath   |   100 |   146.66 ns |  0.638 ns |  0.565 ns |
| NoFastMath |  1000 | 1,103.35 ns |  8.993 ns |  8.412 ns |
| FastMath   |  1000 | 1,331.71 ns |  8.616 ns |  8.060 ns |

 Count  : Value of `count` parameter
 Mean   : Arithmetic mean of all measurements
 Error  : Half of 99.9% confidence interval
 StdDev : Standard deviation of all measurements

The case can be reduced to following IR. Notice the `reassoc` flag used next to
the 4 `fadd` instructions, when you remove them it will get back to normal.
```
; https://llvm.godbolt.org/z/T9Mcdf
%struct.Vec = type { float, float, float, float }

define dso_local void @Funtion(%struct.Vec* nocapture readonly %0, i32 %1,
%struct.Vec* nocapture %2)
{
  %4 = icmp sgt i32 %1, 0
  br i1 %4, label %5, label %7

5:                                                ; preds = %3
  %6 = zext i32 %1 to i64
  br label %16

7:                                                ; preds = %16, %3
  %8 = phi float [ 0.000000e+00, %3 ], [ %33, %16 ]
  %9 = phi float [ 0.000000e+00, %3 ], [ %32, %16 ]
  %10 = phi float [ 0.000000e+00, %3 ], [ %31, %16 ]
  %11 = phi float [ 0.000000e+00, %3 ], [ %30, %16 ]
  %12 = getelementptr inbounds %struct.Vec, %struct.Vec* %2, i64 0, i32 0
  store float %11, float* %12
  %13 = getelementptr inbounds %struct.Vec, %struct.Vec* %2, i64 0, i32 1
  store float %10, float* %13
  %14 = getelementptr inbounds %struct.Vec, %struct.Vec* %2, i64 0, i32 2
  store float %9, float* %14
  %15 = getelementptr inbounds %struct.Vec, %struct.Vec* %2, i64 0, i32 3
  store float %8, float* %15
  ret void

16:                                               ; preds = %16, %5
  %17 = phi i64 [ 0, %5 ], [ %34, %16 ]
  %18 = phi float [ 0.000000e+00, %5 ], [ %30, %16 ]
  %19 = phi float [ 0.000000e+00, %5 ], [ %31, %16 ]
  %20 = phi float [ 0.000000e+00, %5 ], [ %32, %16 ]
  %21 = phi float [ 0.000000e+00, %5 ], [ %33, %16 ]
  %22 = getelementptr inbounds %struct.Vec, %struct.Vec* %0, i64 %17, i32 0
  %23 = load float, float* %22
  %24 = getelementptr inbounds %struct.Vec, %struct.Vec* %0, i64 %17, i32 1
  %25 = load float, float* %24
  %26 = getelementptr inbounds %struct.Vec, %struct.Vec* %0, i64 %17, i32 2
  %27 = load float, float* %26
  %28 = getelementptr inbounds %struct.Vec, %struct.Vec* %0, i64 %17, i32 3
  %29 = load float, float* %28
  %30 = fadd reassoc float %18, %23
  %31 = fadd reassoc float %19, %25
  %32 = fadd reassoc float %20, %27
  %33 = fadd reassoc float %21, %29
  %34 = add nuw nsw i64 %17, 1
  %35 = icmp eq i64 %34, %6
  br i1 %35, label %7, label %16
}
```

As best as I can evaluate the issue the `reassoc` flag permits the "Loop
Vectorizer" pass to vectorize the loop in per-component manner (e.g. 4 Xs
together, 4 Ys together, 4 Zs, 4 Qs per loop) early on which prevents clean
per-Vec vectorization (e.g. sum up of 4 consecutive components of single Vec
with components of another Vec) later down the line by "SLP Vectorizer".
Without the `reassoc` (the -ffast-math) flag this "optimization" fails and "SLP
Vectorizer" can proceed cleanly, resulting in less memory shuffle and faster
code.
Recompiling the original example with `-fno-vectorize` confirms this
observation.

You can see results after respective passes here:
 Loop Vectorizer with `reassoc`. Notice the `shufflevector <16 x float>
%wide.vec, <16 x float> undef, <4 x i32> <i32 0, i32 4, i32 8, i32 12>` that
rest of the pipeline fails to get rid of
 https://llvm.godbolt.org/z/Wvfn6K

 SLP Vectorizer without `reassoc`. Notice that "SLP Vectorizer" correctly
figured it can combine the loop body into simple&fast SIMD form this time.
 https://llvm.godbolt.org/z/ojc4e3

I'm no optimization pipeline expert so this solution is probably very
suboptimal and maybe even really silly so please don't beat me for it, but it
looks to me that adding one more "SLP Vectorizer" pass before the "Loop
Vectorizer" could catch cases like this.
Besides certain impact on compilation time, should one expect any unwanted
runtime side effects/regressions from such solution?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20201014/0a8bc399/attachment-0001.html>