[LLVMbugs] [Bug 20358] New: operating on unknown FP operands is a bad idea

bugzilla-daemon at llvm.org bugzilla-daemon at llvm.org
Fri Jul 18 08:49:09 PDT 2014


            Bug ID: 20358
           Summary: operating on unknown FP operands is a bad idea
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: All
            Status: NEW
          Severity: normal
          Priority: P
         Component: Transformation Utilities
          Assignee: unassignedbugs at nondot.org
          Reporter: spatel+llvm at rotateright.com
                CC: llvmbugs at cs.uiuc.edu
    Classification: Unclassified

In bug 20059, I argued that this transformation in InstCombine shouldn't be
used on FP vectors:

  // If both arguments of binary operation are shuffles, which use the same
  // mask and shuffle within a single vector, it is worthwhile to move the
  // shuffle after binary operation:
  //   Op(shuffle(v1, m), shuffle(v2, m)) -> shuffle(Op(v1, v2), m)

I was worried about FP exceptions, but Hal Finkel pointed out that Clang
doesn't support messing with FP exception state, so we don't have to care about
those (http://reviews.llvm.org/D4424).

We do, however, still need to think about denormals and their performance. 

Here's a test case to illustrate my point:

$ cat splat_opt_is_bad.c
#include <xmmintrin.h>
#include <float.h>

#define ITERATIONS (200 * 1000 * 1000)
#define MY_DENORM ( 1.0e-39 )

__m128 splat_mul(__m128 a, __m128 b) {
        a = _mm_shuffle_ps(a, a, 0); // splat the 0 element of a
        b = _mm_shuffle_ps(b, b, 0); // splat the 0 element of b
        a = _mm_mul_ps(a, b);
        return a;

int main() {
    unsigned int i;
    float scalar;

    __m128 ones = { 1.0f, 1.0f, 1.0f, 1.0f };
    __m128 known_unknowns = { 1.0f, MY_DENORM, MY_DENORM, MY_DENORM };

    for (i=0; i<ITERATIONS; i++) {
        ones = splat_mul(ones, known_unknowns);

    _mm_store_ss(&scalar, ones); // try to make sure we don't optimize away
    return scalar;


Or if you prefer LLVM IR (and this should isolate the perf difference):

$ cat splat_opt_is_bad.ll
define <4 x float> @splat_mul(<4 x float> %a, <4 x float> %b) {
; instcombine will change this to mul then shuffle
  %asplat = shufflevector <4 x float> %a, <4 x float> undef, <4 x i32>
  %bsplat = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32>
  %mul = fmul <4 x float> %asplat, %bsplat
  ret <4 x float> %mul

define i32 @main() {
  br label %for.body

  %i = phi i32 [ 0, %entry ], [ %inc, %for.body ]
  %ones = phi <4 x float> [ <float 1.000000e+00, float 1.000000e+00, float
1.000000e+00, float 1.000000e+00>, %entry ], [ %call, %for.body ]
  %call = tail call <4 x float> @splat_mul(<4 x float> %ones, <4 x float>
<float 1.000000e+00, float 0x37D5C73000000000, float 0x37D5C73000000000, float
  %inc = add i32 %i, 1
  %exitcond = icmp eq i32 %inc, 200000000
  br i1 %exitcond, label %for.end, label %for.body

  %vecext = extractelement <4 x float> %call, i32 0
  %conv = fptosi float %vecext to i32
  ret i32 %conv


Testing on an Intel Sandy Bridge:

$ ./opt splat_opt_is_bad.ll | ./llc | ./clang -x assembler -
$ time ./a.out 

real    0m0.529s
user    0m0.490s
sys    0m0.003s

$ ./opt -instcombine splat_opt_is_bad.ll | ./llc | ./clang -x assembler -
$ time ./a.out 

real    0m9.443s
user    0m9.408s
sys    0m0.006s


So we're getting a ~19x slowdown because we're operating on denorms when we
shouldn't be. 

Different CPUs will vary on that penalty (Intel HW is notoriously bad), but
that's a very high price to potentially pay for removing a single vector
shuffle instruction.

You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20140718/c7a31f79/attachment.html>

More information about the llvm-bugs mailing list