<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - De-optimized vector loop with -ffast-math"

   href="https://bugs.llvm.org/show_bug.cgi?id=47840">47840</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>De-optimized vector loop with -ffast-math

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Loop Optimizer

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>martin.pavlicek11@gmail.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Following C++ snippet gets badly "de-optimized" when compiled with `clang -O3`

and `-ffast-math` flag is used, resulting in 20% runtime performance regression

compared to case when the flag is not used on x86_64 target.

<a href="https://llvm.godbolt.org/z/nndY4f">https://llvm.godbolt.org/z/nndY4f</a>

Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores

|    Method  | Count |        Mean |     Error |    StdDev |

|----------- |------ |------------:|----------:|----------:|

| NoFastMath |    10 |    19.89 ns |  0.042 ns |  0.035 ns |

| FastMath   |    10 |    31.68 ns |  0.596 ns |  0.558 ns |

| NoFastMath |   100 |   119.88 ns |  0.808 ns |  0.755 ns |

| FastMath   |   100 |   146.66 ns |  0.638 ns |  0.565 ns |

| NoFastMath |  1000 | 1,103.35 ns |  8.993 ns |  8.412 ns |

| FastMath   |  1000 | 1,331.71 ns |  8.616 ns |  8.060 ns |

 Count  : Value of `count` parameter

 Mean   : Arithmetic mean of all measurements

 Error  : Half of 99.9% confidence interval

 StdDev : Standard deviation of all measurements

The case can be reduced to following IR. Notice the `reassoc` flag used next to

the 4 `fadd` instructions, when you remove them it will get back to normal.

```

; <a href="https://llvm.godbolt.org/z/T9Mcdf">https://llvm.godbolt.org/z/T9Mcdf</a>

%struct.Vec = type { float, float, float, float }

define dso_local void @Funtion(%struct.Vec* nocapture readonly %0, i32 %1,

%struct.Vec* nocapture %2)

{

  %4 = icmp sgt i32 %1, 0

  br i1 %4, label %5, label %7

5:                                                ; preds = %3

  %6 = zext i32 %1 to i64

  br label %16

7:                                                ; preds = %16, %3

  %8 = phi float [ 0.000000e+00, %3 ], [ %33, %16 ]

  %9 = phi float [ 0.000000e+00, %3 ], [ %32, %16 ]

  %10 = phi float [ 0.000000e+00, %3 ], [ %31, %16 ]

  %11 = phi float [ 0.000000e+00, %3 ], [ %30, %16 ]

  %12 = getelementptr inbounds %struct.Vec, %struct.Vec* %2, i64 0, i32 0

  store float %11, float* %12

  %13 = getelementptr inbounds %struct.Vec, %struct.Vec* %2, i64 0, i32 1

  store float %10, float* %13

  %14 = getelementptr inbounds %struct.Vec, %struct.Vec* %2, i64 0, i32 2

  store float %9, float* %14

  %15 = getelementptr inbounds %struct.Vec, %struct.Vec* %2, i64 0, i32 3

  store float %8, float* %15

  ret void

16:                                               ; preds = %16, %5

  %17 = phi i64 [ 0, %5 ], [ %34, %16 ]

  %18 = phi float [ 0.000000e+00, %5 ], [ %30, %16 ]

  %19 = phi float [ 0.000000e+00, %5 ], [ %31, %16 ]

  %20 = phi float [ 0.000000e+00, %5 ], [ %32, %16 ]

  %21 = phi float [ 0.000000e+00, %5 ], [ %33, %16 ]

  %22 = getelementptr inbounds %struct.Vec, %struct.Vec* %0, i64 %17, i32 0

  %23 = load float, float* %22

  %24 = getelementptr inbounds %struct.Vec, %struct.Vec* %0, i64 %17, i32 1

  %25 = load float, float* %24

  %26 = getelementptr inbounds %struct.Vec, %struct.Vec* %0, i64 %17, i32 2

  %27 = load float, float* %26

  %28 = getelementptr inbounds %struct.Vec, %struct.Vec* %0, i64 %17, i32 3

  %29 = load float, float* %28

  %30 = fadd reassoc float %18, %23

  %31 = fadd reassoc float %19, %25

  %32 = fadd reassoc float %20, %27

  %33 = fadd reassoc float %21, %29

  %34 = add nuw nsw i64 %17, 1

  %35 = icmp eq i64 %34, %6

  br i1 %35, label %7, label %16

}

```

As best as I can evaluate the issue the `reassoc` flag permits the "Loop

Vectorizer" pass to vectorize the loop in per-component manner (e.g. 4 Xs

together, 4 Ys together, 4 Zs, 4 Qs per loop) early on which prevents clean

per-Vec vectorization (e.g. sum up of 4 consecutive components of single Vec

with components of another Vec) later down the line by "SLP Vectorizer".

Without the `reassoc` (the -ffast-math) flag this "optimization" fails and "SLP

Vectorizer" can proceed cleanly, resulting in less memory shuffle and faster

code.

Recompiling the original example with `-fno-vectorize` confirms this

observation.

You can see results after respective passes here:

 Loop Vectorizer with `reassoc`. Notice the `shufflevector <16 x float>

%wide.vec, <16 x float> undef, <4 x i32> <i32 0, i32 4, i32 8, i32 12>` that

rest of the pipeline fails to get rid of

 <a href="https://llvm.godbolt.org/z/Wvfn6K">https://llvm.godbolt.org/z/Wvfn6K</a>

 SLP Vectorizer without `reassoc`. Notice that "SLP Vectorizer" correctly

figured it can combine the loop body into simple&fast SIMD form this time.

 <a href="https://llvm.godbolt.org/z/ojc4e3">https://llvm.godbolt.org/z/ojc4e3</a>

I'm no optimization pipeline expert so this solution is probably very

suboptimal and maybe even really silly so please don't beat me for it, but it

looks to me that adding one more "SLP Vectorizer" pass before the "Loop

Vectorizer" could catch cases like this.

Besides certain impact on compilation time, should one expect any unwanted

runtime side effects/regressions from such solution?</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>