<html>
<head>
<base href="http://llvm.org/bugs/" />
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW --- - operating on unknown FP operands is a bad idea"
href="http://llvm.org/bugs/show_bug.cgi?id=20358">20358</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>operating on unknown FP operands is a bad idea
</td>
</tr>
<tr>
<th>Product</th>
<td>libraries
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>All
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>normal
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Transformation Utilities
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>spatel+llvm@rotateright.com
</td>
</tr>
<tr>
<th>CC</th>
<td>llvmbugs@cs.uiuc.edu
</td>
</tr>
<tr>
<th>Classification</th>
<td>Unclassified
</td>
</tr></table>
<p>
<div>
<pre>In <a class="bz_bug_link
bz_status_RESOLVED bz_closed"
title="RESOLVED FIXED - wrong code (FP exception) at -O3 on x86_64-linux-gnu (bad instcombine)"
href="show_bug.cgi?id=20059">bug 20059</a>, I argued that this transformation in InstCombine shouldn't be
used on FP vectors:
// If both arguments of binary operation are shuffles, which use the same
// mask and shuffle within a single vector, it is worthwhile to move the
// shuffle after binary operation:
// Op(shuffle(v1, m), shuffle(v2, m)) -> shuffle(Op(v1, v2), m)
I was worried about FP exceptions, but Hal Finkel pointed out that Clang
doesn't support messing with FP exception state, so we don't have to care about
those (<a href="http://reviews.llvm.org/D4424">http://reviews.llvm.org/D4424</a>).
We do, however, still need to think about denormals and their performance.
Here's a test case to illustrate my point:
$ cat splat_opt_is_bad.c
#include <xmmintrin.h>
#include <float.h>
#define ITERATIONS (200 * 1000 * 1000)
#define MY_DENORM ( 1.0e-39 )
__m128 splat_mul(__m128 a, __m128 b) {
a = _mm_shuffle_ps(a, a, 0); // splat the 0 element of a
b = _mm_shuffle_ps(b, b, 0); // splat the 0 element of b
a = _mm_mul_ps(a, b);
return a;
}
int main() {
unsigned int i;
float scalar;
__m128 ones = { 1.0f, 1.0f, 1.0f, 1.0f };
__m128 known_unknowns = { 1.0f, MY_DENORM, MY_DENORM, MY_DENORM };
for (i=0; i<ITERATIONS; i++) {
ones = splat_mul(ones, known_unknowns);
}
_mm_store_ss(&scalar, ones); // try to make sure we don't optimize away
everything
return scalar;
}
------------------------------------------
Or if you prefer LLVM IR (and this should isolate the perf difference):
$ cat splat_opt_is_bad.ll
define <4 x float> @splat_mul(<4 x float> %a, <4 x float> %b) {
; instcombine will change this to mul then shuffle
%asplat = shufflevector <4 x float> %a, <4 x float> undef, <4 x i32>
zeroinitializer
%bsplat = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32>
zeroinitializer
%mul = fmul <4 x float> %asplat, %bsplat
ret <4 x float> %mul
}
define i32 @main() {
entry:
br label %for.body
for.body:
%i = phi i32 [ 0, %entry ], [ %inc, %for.body ]
%ones = phi <4 x float> [ <float 1.000000e+00, float 1.000000e+00, float
1.000000e+00, float 1.000000e+00>, %entry ], [ %call, %for.body ]
%call = tail call <4 x float> @splat_mul(<4 x float> %ones, <4 x float>
<float 1.000000e+00, float 0x37D5C73000000000, float 0x37D5C73000000000, float
0x37D5C73000000000>)
%inc = add i32 %i, 1
%exitcond = icmp eq i32 %inc, 200000000
br i1 %exitcond, label %for.end, label %for.body
for.end:
%vecext = extractelement <4 x float> %call, i32 0
%conv = fptosi float %vecext to i32
ret i32 %conv
}
---------------------------------------------
Testing on an Intel Sandy Bridge:
$ ./opt splat_opt_is_bad.ll | ./llc | ./clang -x assembler -
$ time ./a.out
real 0m0.529s
user 0m0.490s
sys 0m0.003s
$ ./opt -instcombine splat_opt_is_bad.ll | ./llc | ./clang -x assembler -
$ time ./a.out
real 0m9.443s
user 0m9.408s
sys 0m0.006s
----------------------------------------------
So we're getting a ~19x slowdown because we're operating on denorms when we
shouldn't be.
Different CPUs will vary on that penalty (Intel HW is notoriously bad), but
that's a very high price to potentially pay for removing a single vector
shuffle instruction.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>