<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - [SSE, AVX]: doesn't use packuswb with 2 different registers"
href="https://bugs.llvm.org/show_bug.cgi?id=34773">34773</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>[SSE, AVX]: doesn't use packuswb with 2 different registers
</td>
</tr>
<tr>
<th>Product</th>
<td>new-bugs
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>enhancement
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>new bugs
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>peter@cordes.ca
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org
</td>
</tr></table>
<p>
<div>
<pre>void pack_high8_baseline(uint8_t *__restrict__ dst, const uint16_t
*__restrict__ src, size_t bytes) {
uint8_t *end_dst = dst + bytes;
do{
*dst++ = *src++ >> 8;
} while(dst < end_dst);
}
<a href="https://godbolt.org/g/RoiFnW">https://godbolt.org/g/RoiFnW</a>
clang 6.0.0 (trunk 314277) -O3
.LBB0_4: # =>This Inner Loop Header: Depth=1
movdqu (%rsi,%rcx,2), %xmm0
movdqu 16(%rsi,%rcx,2), %xmm1
psrlw $8, %xmm0
psrlw $8, %xmm1
packuswb %xmm0, %xmm0
packuswb %xmm1, %xmm1
punpcklqdq %xmm1, %xmm0 # xmm0 = xmm0[0],xmm1[0]
movdqu %xmm0, (%rdi,%rcx)
(repeated again with +32 / +16 offsets)
addq $32, %rcx
addq $2, %rax
jne .LBB0_4
Those three shuffles can (and should) be a single packuswb %xmm1, %xmm0.
We can see from -fno-unroll-loops output that it thinks the base case is an
8-byte store. And that it's only using the shift+pack as a stand-in for SSSE3
pshufb (which it uses when available. e.g. -mssse3 -mtune=skylake)
# no shifts
pshufb %xmm0, %xmm1
pshufb %xmm0, %xmm2
punpcklqdq %xmm2, %xmm1 # xmm1 = xmm1[0],xmm2[0]
This sucks everywhere, but sucks the most on Haswell and later which only have
1 shuffle port. Skylake even has 2 per clock shift throughput. If not for the
front-end bottleneck, it can execute 2 shifts + 1 shuffle per clock. (There's
no load+shift instruction in AVX2, only AVX512. And BTW clang auto-vectorizes
well for AVX512, with vpsrlw $8, (%rsi,%rcx,2), %zmm0 / vpmovwb %zmm0,
(%rdi,%rcx) which is probably optimal on Skylake-avx512.
I think the optimal strategy without AVX512 is to replace one shift with an AND
+ unaligned load offset by -1. Especially with AVX, that lets the load+ALU
fold into one instruction. (see <a href="https://stackoverflow.com/a/46477080/224132">https://stackoverflow.com/a/46477080/224132</a>
for details. With src 32B-aligned, this will never cache-line split, and
should be good on all CPUs back to Nehalem or K10, compiled with or without
-mavx)
// uint8_t *dst, *src;
__m128i v0 = _mm_loadu_si128((__m128i*)src);
__m128i v1_offset = _mm_loadu_si128(1+(__m128i*)(src-1));
v0 = _mm_srli_epi16(v0, 8);
__m128i v1 = _mm_and_si128(v1_offset, _mm_set1_epi16(0x00FF));
__m128i pack = _mm_packus_epi16(v0, v1);
_mm_storeu_si128((__m128i*)dst, pack);
This works for AVX2 256b vectors with one extra VPERMQ at the end to fix-up
the in-lane vpackuswb behaviour (like gcc emits when auto-vectorizing).
clang's AVX2 auto-vectorization is *horrible* here: With -fno-unroll-loops:
vmovdqu (%rsi,%rax,2), %ymm1
vpsrlw $8, %ymm1, %ymm1
vextracti128 $1, %ymm1, %xmm2
vpshufb %xmm0, %xmm2, %xmm2
vpshufb %xmm0, %xmm1, %xmm1
vpunpcklqdq %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0]
vmovdqu %xmm1, (%rdi,%rax)
With default loop unrolling, it's basically that but with vinserti128 to feed
256b stores. Because bottlenecking even harder on the shuffle port seems like
a great idea (with -march=skylake)...
(If this isn't a symptom of the same bug as the SSE2 sillyness, I guess this
part should be reported separately.)</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>