<html>
<head>
<base href="https://llvm.org/bugs/" />
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW --- - [X86] float/double -> unsigned long conversion slow when inputs are predictable"
href="https://llvm.org/bugs/show_bug.cgi?id=31602">31602</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>[X86] float/double -> unsigned long conversion slow when inputs are predictable
</td>
</tr>
<tr>
<th>Product</th>
<td>libraries
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>normal
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Backend: X86
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>mkuper@google.com
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org
</td>
</tr>
<tr>
<th>Classification</th>
<td>Unclassified
</td>
</tr></table>
<p>
<div>
<pre>SSE and AVX (up until AVX512) don't have convert instructions from FP (both
float or double) and unsigned long. So, these conversion have to be emulated
using FP -> signed long conversions.
GCC lowers this:
unsigned long foo(double x) {
return x;
}
as:
foo(double):
movsd .LC0(%rip), %xmm1
ucomisd %xmm1, %xmm0
jnb .L2
cvttsd2siq %xmm0, %rax
ret
.L2:
subsd %xmm1, %xmm0
movabsq $-9223372036854775808, %rdx
cvttsd2siq %xmm0, %rax
xorq %rdx, %rax
ret
.LC0:
.long 0
.long 1138753536
That is - check whether the value is in range, and if not, force it into range,
convert, and correct the value.
What we do, on the other hand, is:
.LCPI0_0:
.quad 4890909195324358656 # double 9.2233720368547758E+18
foo(double):
movsd .LCPI0_0(%rip), %xmm1
movapd %xmm0, %xmm2
subsd %xmm1, %xmm2
cvttsd2si %xmm2, %rax
movabsq $-9223372036854775808, %rcx # imm = 0x8000000000000000
xorq %rax, %rcx
cvttsd2si %xmm0, %rax
ucomisd %xmm1, %xmm0
cmovaeq %rcx, %rax
retq
Which is basically an if-converted version of the GCC code.
Since cvttsd2si has a fairly long latency, the GCC version is much faster when
the branch is well-predicted, and slower when it's not.
But it seems like in most cases this branch should be well-predicted - e.g. if
all inputs are "small", and actually fit into the signed range.
A few additional notes:
1) Our current version is problematic in the presence of FP exceptions, see
PR17686.
2) I tried playing around with selecting on the input instead of the output,
but that doesn't really improve the situation, since we then need to adjust the
sign bit of the output of one of the converts.
There are two options here - (1) adjusting and selecting again between the
original and the adjusted version, or (2) fudging the adjustment so that it's a
nop for the right convert. ICC generates code which is basically (2). This
avoids the problem in PR17686, but both options appear to be even slower than
what we have now.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>