[LLVMbugs] [Bug 20358] New: operating on unknown FP operands is a bad idea
bugzilla-daemon at llvm.org
bugzilla-daemon at llvm.org
Fri Jul 18 08:49:09 PDT 2014
http://llvm.org/bugs/show_bug.cgi?id=20358
Bug ID: 20358
Summary: operating on unknown FP operands is a bad idea
Product: libraries
Version: trunk
Hardware: PC
OS: All
Status: NEW
Severity: normal
Priority: P
Component: Transformation Utilities
Assignee: unassignedbugs at nondot.org
Reporter: spatel+llvm at rotateright.com
CC: llvmbugs at cs.uiuc.edu
Classification: Unclassified
In bug 20059, I argued that this transformation in InstCombine shouldn't be
used on FP vectors:
// If both arguments of binary operation are shuffles, which use the same
// mask and shuffle within a single vector, it is worthwhile to move the
// shuffle after binary operation:
// Op(shuffle(v1, m), shuffle(v2, m)) -> shuffle(Op(v1, v2), m)
I was worried about FP exceptions, but Hal Finkel pointed out that Clang
doesn't support messing with FP exception state, so we don't have to care about
those (http://reviews.llvm.org/D4424).
We do, however, still need to think about denormals and their performance.
Here's a test case to illustrate my point:
$ cat splat_opt_is_bad.c
#include <xmmintrin.h>
#include <float.h>
#define ITERATIONS (200 * 1000 * 1000)
#define MY_DENORM ( 1.0e-39 )
__m128 splat_mul(__m128 a, __m128 b) {
a = _mm_shuffle_ps(a, a, 0); // splat the 0 element of a
b = _mm_shuffle_ps(b, b, 0); // splat the 0 element of b
a = _mm_mul_ps(a, b);
return a;
}
int main() {
unsigned int i;
float scalar;
__m128 ones = { 1.0f, 1.0f, 1.0f, 1.0f };
__m128 known_unknowns = { 1.0f, MY_DENORM, MY_DENORM, MY_DENORM };
for (i=0; i<ITERATIONS; i++) {
ones = splat_mul(ones, known_unknowns);
}
_mm_store_ss(&scalar, ones); // try to make sure we don't optimize away
everything
return scalar;
}
------------------------------------------
Or if you prefer LLVM IR (and this should isolate the perf difference):
$ cat splat_opt_is_bad.ll
define <4 x float> @splat_mul(<4 x float> %a, <4 x float> %b) {
; instcombine will change this to mul then shuffle
%asplat = shufflevector <4 x float> %a, <4 x float> undef, <4 x i32>
zeroinitializer
%bsplat = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32>
zeroinitializer
%mul = fmul <4 x float> %asplat, %bsplat
ret <4 x float> %mul
}
define i32 @main() {
entry:
br label %for.body
for.body:
%i = phi i32 [ 0, %entry ], [ %inc, %for.body ]
%ones = phi <4 x float> [ <float 1.000000e+00, float 1.000000e+00, float
1.000000e+00, float 1.000000e+00>, %entry ], [ %call, %for.body ]
%call = tail call <4 x float> @splat_mul(<4 x float> %ones, <4 x float>
<float 1.000000e+00, float 0x37D5C73000000000, float 0x37D5C73000000000, float
0x37D5C73000000000>)
%inc = add i32 %i, 1
%exitcond = icmp eq i32 %inc, 200000000
br i1 %exitcond, label %for.end, label %for.body
for.end:
%vecext = extractelement <4 x float> %call, i32 0
%conv = fptosi float %vecext to i32
ret i32 %conv
}
---------------------------------------------
Testing on an Intel Sandy Bridge:
$ ./opt splat_opt_is_bad.ll | ./llc | ./clang -x assembler -
$ time ./a.out
real 0m0.529s
user 0m0.490s
sys 0m0.003s
$ ./opt -instcombine splat_opt_is_bad.ll | ./llc | ./clang -x assembler -
$ time ./a.out
real 0m9.443s
user 0m9.408s
sys 0m0.006s
----------------------------------------------
So we're getting a ~19x slowdown because we're operating on denorms when we
shouldn't be.
Different CPUs will vary on that penalty (Intel HW is notoriously bad), but
that's a very high price to potentially pay for removing a single vector
shuffle instruction.
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20140718/c7a31f79/attachment.html>
More information about the llvm-bugs
mailing list