<html>
<head>
<base href="https://llvm.org/bugs/" />
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW --- - A zexted setcc generates a setcc + movzbl instead of xor + setcc"
href="https://llvm.org/bugs/show_bug.cgi?id=28146">28146</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>A zexted setcc generates a setcc + movzbl instead of xor + setcc
</td>
</tr>
<tr>
<th>Product</th>
<td>libraries
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>normal
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Backend: X86
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>mkuper@google.com
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org
</td>
</tr>
<tr>
<th>Classification</th>
<td>Unclassified
</td>
</tr></table>
<p>
<div>
<pre>Consider:
#include <stdio.h>
int main() {
unsigned x = 0;
unsigned y = 0;
#pragma nounroll
for (unsigned i = 0; i < 1000000000; ++i) {
y += x ^ 13;
x += ((i + 100) >= 1000) * 3;
}
return y;
}
We generate:
.text
.globl main
.p2align 4, 0x90
.type main,@function
main:
.cfi_startproc
xorl %eax, %eax
movl $100, %ecx
xorl %edi, %edi
.p2align 4, 0x90
.LBB0_1:
movl %edi, %esi
xorl $13, %esi
addl %esi, %eax
cmpl $999, %ecx
seta %dl # <===
movzbl %dl, %edx # <===
leal (%rdx,%rdx,2), %edx
addl %edx, %edi
incl %ecx
cmpl $1000000100, %ecx
jne .LBB0_1
retq
.Lfunc_end0:
.size main, .Lfunc_end0-main
.cfi_endproc
Instead of:
.text
.globl main
.p2align 4, 0x90
.type main,@function
main:
.cfi_startproc
xorl %eax, %eax
movl $100, %ecx
xorl %edi, %edi
.p2align 4, 0x90
.LBB0_1:
movl %edi, %esi
xorl $13, %esi
addl %esi, %eax
xorl %edx, %edx # <===
cmpl $999, %ecx
seta %dl # <===
leal (%rdx,%rdx,2), %edx
addl %edx, %edi
incl %ecx
cmpl $1000000100, %ecx
jne .LBB0_1
retq
.Lfunc_end0:
.size main, .Lfunc_end0-main
.cfi_endproc
The xor encodes smaller than the movzbl, which in itself is a good reason to
generate the former. However, there is a more surprising performance issue -
even though both versions ought to avoid partial register stalls, using the xor
idiom turns out to be much faster.
On a Haswell machine:
$ bin/clang -O2 ~/llvm/temp/setcc.s -o ~/llvm/temp/setcc.exe && time
~/llvm/temp/setcc.exe
real 0m1.045s
user 0m1.043s
sys 0m0.001s
$ bin/clang -O2 ~/llvm/temp/setcc-faster.s -o ~/llvm/temp/setcc.exe && time
~/llvm/temp/setcc.exe
real 0m0.876s
user 0m0.874s
sys 0m0.002s
Could someone at Intel confirm that this is expected? IACA doesn't show
significant stalling for the slower version, but it exists in practice (for the
slower version, about ~15% stalls, and this can be significantly increased by
making the dependency chain longer.)</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>