<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - unnecessary 8-bit partial-register usage creates false dependencies."
href="https://bugs.llvm.org/show_bug.cgi?id=34707">34707</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>unnecessary 8-bit partial-register usage creates false dependencies.
</td>
</tr>
<tr>
<th>Product</th>
<td>libraries
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>enhancement
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Backend: X86
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>peter@cordes.ca
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org
</td>
</tr></table>
<p>
<div>
<pre>unsigned long bzhi_l(unsigned long x, unsigned c) {
return x & ((1UL << c) - 1);
}
// <a href="https://godbolt.org/g/sBEyfd">https://godbolt.org/g/sBEyfd</a>
clang 6.0.0 (trunk 313965) -xc -O3 -march=haswell -m32 or znver1
movb 8(%esp), %al
bzhil %eax, 4(%esp), %eax
retl
This is technically correct (because BZHI only looks at the low 8 bits of
src2), but horrible. There is *no* advantage to using an 8-bit load here
instead of a 32-bit load. Same code size, but creates a false dependency on
the old value of rax.
(znver1 definitely doesn't rename partial registers. Intel Haswell/Skylake
don't rename low8 registers separately from the full register, unlike
Sandybridge or Core2/Nehalem.
<a href="https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to">https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to</a>).
On Haswell and Skylake, movb 8(%esp), %al runs at 1 per cycle, as a
micro-fused ALU+load uop. An occasional dep-breaking xor %eax,%eax lets it
bottleneck on 2 loads per clock.
Clang seems to be very eager to only move 8 bits instead of the full register.
Clang 3.9 fixed this for reg-reg moves (e.g. unsigned shift(unsigned x,
unsigned c) { return x<<c; } without BMI2), but we're still getting 8-bit
loads. On Intel CPUs, MOVZX loads are cheaper than narrow MOV loads because
they avoid the ALU uop to merge into the destination. (It does take an extra
code byte). AMD CPUs may use an ALU port for MOVZX, but Intel handles it
purely in the load ports.
But anyway, when loading from 32-bit memory location, it makes no sense to load
only the low 8 bits, unless we have reason to expect it was written with
separate byte stores and we want to avoid a store-forwarding stall.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>