[llvm-bugs] [Bug 47840] New: De-optimized vector loop with -ffast-math
via llvm-bugs
llvm-bugs at lists.llvm.org
Wed Oct 14 04:06:52 PDT 2020
https://bugs.llvm.org/show_bug.cgi?id=47840
Bug ID: 47840
Summary: De-optimized vector loop with -ffast-math
Product: libraries
Version: trunk
Hardware: PC
OS: All
Status: NEW
Severity: enhancement
Priority: P
Component: Loop Optimizer
Assignee: unassignedbugs at nondot.org
Reporter: martin.pavlicek11 at gmail.com
CC: llvm-bugs at lists.llvm.org
Following C++ snippet gets badly "de-optimized" when compiled with `clang -O3`
and `-ffast-math` flag is used, resulting in 20% runtime performance regression
compared to case when the flag is not used on x86_64 target.
https://llvm.godbolt.org/z/nndY4f
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
| Method | Count | Mean | Error | StdDev |
|----------- |------ |------------:|----------:|----------:|
| NoFastMath | 10 | 19.89 ns | 0.042 ns | 0.035 ns |
| FastMath | 10 | 31.68 ns | 0.596 ns | 0.558 ns |
| NoFastMath | 100 | 119.88 ns | 0.808 ns | 0.755 ns |
| FastMath | 100 | 146.66 ns | 0.638 ns | 0.565 ns |
| NoFastMath | 1000 | 1,103.35 ns | 8.993 ns | 8.412 ns |
| FastMath | 1000 | 1,331.71 ns | 8.616 ns | 8.060 ns |
Count : Value of `count` parameter
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
The case can be reduced to following IR. Notice the `reassoc` flag used next to
the 4 `fadd` instructions, when you remove them it will get back to normal.
```
; https://llvm.godbolt.org/z/T9Mcdf
%struct.Vec = type { float, float, float, float }
define dso_local void @Funtion(%struct.Vec* nocapture readonly %0, i32 %1,
%struct.Vec* nocapture %2)
{
%4 = icmp sgt i32 %1, 0
br i1 %4, label %5, label %7
5: ; preds = %3
%6 = zext i32 %1 to i64
br label %16
7: ; preds = %16, %3
%8 = phi float [ 0.000000e+00, %3 ], [ %33, %16 ]
%9 = phi float [ 0.000000e+00, %3 ], [ %32, %16 ]
%10 = phi float [ 0.000000e+00, %3 ], [ %31, %16 ]
%11 = phi float [ 0.000000e+00, %3 ], [ %30, %16 ]
%12 = getelementptr inbounds %struct.Vec, %struct.Vec* %2, i64 0, i32 0
store float %11, float* %12
%13 = getelementptr inbounds %struct.Vec, %struct.Vec* %2, i64 0, i32 1
store float %10, float* %13
%14 = getelementptr inbounds %struct.Vec, %struct.Vec* %2, i64 0, i32 2
store float %9, float* %14
%15 = getelementptr inbounds %struct.Vec, %struct.Vec* %2, i64 0, i32 3
store float %8, float* %15
ret void
16: ; preds = %16, %5
%17 = phi i64 [ 0, %5 ], [ %34, %16 ]
%18 = phi float [ 0.000000e+00, %5 ], [ %30, %16 ]
%19 = phi float [ 0.000000e+00, %5 ], [ %31, %16 ]
%20 = phi float [ 0.000000e+00, %5 ], [ %32, %16 ]
%21 = phi float [ 0.000000e+00, %5 ], [ %33, %16 ]
%22 = getelementptr inbounds %struct.Vec, %struct.Vec* %0, i64 %17, i32 0
%23 = load float, float* %22
%24 = getelementptr inbounds %struct.Vec, %struct.Vec* %0, i64 %17, i32 1
%25 = load float, float* %24
%26 = getelementptr inbounds %struct.Vec, %struct.Vec* %0, i64 %17, i32 2
%27 = load float, float* %26
%28 = getelementptr inbounds %struct.Vec, %struct.Vec* %0, i64 %17, i32 3
%29 = load float, float* %28
%30 = fadd reassoc float %18, %23
%31 = fadd reassoc float %19, %25
%32 = fadd reassoc float %20, %27
%33 = fadd reassoc float %21, %29
%34 = add nuw nsw i64 %17, 1
%35 = icmp eq i64 %34, %6
br i1 %35, label %7, label %16
}
```
As best as I can evaluate the issue the `reassoc` flag permits the "Loop
Vectorizer" pass to vectorize the loop in per-component manner (e.g. 4 Xs
together, 4 Ys together, 4 Zs, 4 Qs per loop) early on which prevents clean
per-Vec vectorization (e.g. sum up of 4 consecutive components of single Vec
with components of another Vec) later down the line by "SLP Vectorizer".
Without the `reassoc` (the -ffast-math) flag this "optimization" fails and "SLP
Vectorizer" can proceed cleanly, resulting in less memory shuffle and faster
code.
Recompiling the original example with `-fno-vectorize` confirms this
observation.
You can see results after respective passes here:
Loop Vectorizer with `reassoc`. Notice the `shufflevector <16 x float>
%wide.vec, <16 x float> undef, <4 x i32> <i32 0, i32 4, i32 8, i32 12>` that
rest of the pipeline fails to get rid of
https://llvm.godbolt.org/z/Wvfn6K
SLP Vectorizer without `reassoc`. Notice that "SLP Vectorizer" correctly
figured it can combine the loop body into simple&fast SIMD form this time.
https://llvm.godbolt.org/z/ojc4e3
I'm no optimization pipeline expert so this solution is probably very
suboptimal and maybe even really silly so please don't beat me for it, but it
looks to me that adding one more "SLP Vectorizer" pass before the "Loop
Vectorizer" could catch cases like this.
Besides certain impact on compilation time, should one expect any unwanted
runtime side effects/regressions from such solution?
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20201014/0a8bc399/attachment-0001.html>
More information about the llvm-bugs
mailing list