<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - load merging for (data[0]<<0) | (data[1]<<8) | ... endian agnostic load goes berserk with AVX2 variable-shift"
href="https://bugs.llvm.org/show_bug.cgi?id=35047">35047</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>load merging for (data[0]<<0) | (data[1]<<8) | ... endian agnostic load goes berserk with AVX2 variable-shift
</td>
</tr>
<tr>
<th>Product</th>
<td>new-bugs
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Keywords</th>
<td>performance
</td>
</tr>
<tr>
<th>Severity</th>
<td>normal
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>new bugs
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>peter@cordes.ca
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org
</td>
</tr></table>
<p>
<div>
<pre>unsigned load_le32(unsigned char *data) {
unsigned le32 = (data[0]<<0) | (data[1]<<8) | (data[2]<<16) |
(data[3]<<24);
return le32;
}
// <a href="https://godbolt.org/g/X8i1pr">https://godbolt.org/g/X8i1pr</a>
clang 6.0.0 (trunk 316311) -O3 -march=haswell -mno-avx
movl (%rdi), %eax
retq
-O3 -march=haswell (with AVX2)
.LCPI0_0:
.quad 16 # 0x10
.quad 24 # 0x18
load_le32: # @load_le32
movzbl (%rdi), %eax
movzbl 1(%rdi), %ecx
shll $8, %ecx
vpmovzxbq 2(%rdi), %xmm0 # xmm0 =
mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero
orl %eax, %ecx
vpsllvq .LCPI0_0(%rip), %xmm0, %xmm0
vmovd %xmm0, %edx
vpextrd $2, %xmm0, %eax
orl %edx, %eax
orl %ecx, %eax
retq
So if vpsllvq is available, clang uses it and doesn't notice that it could have
coalesced the loads into one. -fno-vectorize doesn't block this. (And if the
shift counts didn't line up this way, it's quite poorly vectorized. VPMOVZXBD
would have worked, then do 4 shifts, and then a horizontal reduction with OR,
using the same pattern as a horizontal sum. e.g. vpunpckhqdq / vpor / vmovq /
rorx $32, %rax, %rdx / or %edx, %eax)
(And BTW, for Haswell and later, movb 1(%rdi), %al merges into RAX without
stalling at all. It's a single micro-fused load+merge uop, so it's better than
a separate movzx load + OR instruction. See
<a href="https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to">https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to</a>)
clang 4.0.1 doesn't merge the loads.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>