<html>

    <head>

      <base href="http://llvm.org/bugs/" />

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW --- - operating on unknown FP operands is a bad idea"

   href="http://llvm.org/bugs/show_bug.cgi?id=20358">20358</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>operating on unknown FP operands is a bad idea

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Transformation Utilities

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>spatel+llvm@rotateright.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvmbugs@cs.uiuc.edu

          </td>

        </tr>

        <tr>

          <th>Classification</th>

          <td>Unclassified

          </td>

        </tr></table>

      <p>

        <div>

        <pre>In <a class="bz_bug_link 

          bz_status_RESOLVED  bz_closed"

   title="RESOLVED FIXED - wrong code (FP exception) at -O3 on x86_64-linux-gnu (bad instcombine)"

   href="show_bug.cgi?id=20059">bug 20059</a>, I argued that this transformation in InstCombine shouldn't be

used on FP vectors:

  // If both arguments of binary operation are shuffles, which use the same

  // mask and shuffle within a single vector, it is worthwhile to move the

  // shuffle after binary operation:

  //   Op(shuffle(v1, m), shuffle(v2, m)) -> shuffle(Op(v1, v2), m)

I was worried about FP exceptions, but Hal Finkel pointed out that Clang

doesn't support messing with FP exception state, so we don't have to care about

those (<a href="http://reviews.llvm.org/D4424">http://reviews.llvm.org/D4424</a>).

We do, however, still need to think about denormals and their performance. 

Here's a test case to illustrate my point:

$ cat splat_opt_is_bad.c

#include <xmmintrin.h>

#include <float.h>

#define ITERATIONS (200 * 1000 * 1000)

#define MY_DENORM ( 1.0e-39 )

__m128 splat_mul(__m128 a, __m128 b) {

        a = _mm_shuffle_ps(a, a, 0); // splat the 0 element of a

        b = _mm_shuffle_ps(b, b, 0); // splat the 0 element of b

        a = _mm_mul_ps(a, b);

        return a;

}

int main() {

    unsigned int i;

    float scalar;

    __m128 ones = { 1.0f, 1.0f, 1.0f, 1.0f };

    __m128 known_unknowns = { 1.0f, MY_DENORM, MY_DENORM, MY_DENORM };

    for (i=0; i<ITERATIONS; i++) {

        ones = splat_mul(ones, known_unknowns);

    }

    _mm_store_ss(&scalar, ones); // try to make sure we don't optimize away

everything

    return scalar;

}

------------------------------------------

Or if you prefer LLVM IR (and this should isolate the perf difference):

$ cat splat_opt_is_bad.ll

define <4 x float> @splat_mul(<4 x float> %a, <4 x float> %b) {

; instcombine will change this to mul then shuffle

  %asplat = shufflevector <4 x float> %a, <4 x float> undef, <4 x i32>

zeroinitializer

  %bsplat = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32>

zeroinitializer

  %mul = fmul <4 x float> %asplat, %bsplat

  ret <4 x float> %mul

}

define i32 @main() {

entry:

  br label %for.body

for.body:

  %i = phi i32 [ 0, %entry ], [ %inc, %for.body ]

  %ones = phi <4 x float> [ <float 1.000000e+00, float 1.000000e+00, float

1.000000e+00, float 1.000000e+00>, %entry ], [ %call, %for.body ]

  %call = tail call <4 x float> @splat_mul(<4 x float> %ones, <4 x float>

<float 1.000000e+00, float 0x37D5C73000000000, float 0x37D5C73000000000, float

0x37D5C73000000000>)

  %inc = add i32 %i, 1

  %exitcond = icmp eq i32 %inc, 200000000

  br i1 %exitcond, label %for.end, label %for.body

for.end:

  %vecext = extractelement <4 x float> %call, i32 0

  %conv = fptosi float %vecext to i32

  ret i32 %conv

}

---------------------------------------------

Testing on an Intel Sandy Bridge:

$ ./opt splat_opt_is_bad.ll | ./llc | ./clang -x assembler -

$ time ./a.out 

real    0m0.529s

user    0m0.490s

sys    0m0.003s

$ ./opt -instcombine splat_opt_is_bad.ll | ./llc | ./clang -x assembler -

$ time ./a.out 

real    0m9.443s

user    0m9.408s

sys    0m0.006s

----------------------------------------------

So we're getting a ~19x slowdown because we're operating on denorms when we

shouldn't be. 

Different CPUs will vary on that penalty (Intel HW is notoriously bad), but

that's a very high price to potentially pay for removing a single vector

shuffle instruction.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>