[llvm-bugs] [Bug 34773] New: [SSE, AVX]: doesn't use packuswb with 2 different registers
via llvm-bugs
llvm-bugs at lists.llvm.org
Thu Sep 28 14:49:51 PDT 2017
https://bugs.llvm.org/show_bug.cgi?id=34773
Bug ID: 34773
Summary: [SSE, AVX]: doesn't use packuswb with 2 different
registers
Product: new-bugs
Version: trunk
Hardware: PC
OS: Linux
Status: NEW
Severity: enhancement
Priority: P
Component: new bugs
Assignee: unassignedbugs at nondot.org
Reporter: peter at cordes.ca
CC: llvm-bugs at lists.llvm.org
void pack_high8_baseline(uint8_t *__restrict__ dst, const uint16_t
*__restrict__ src, size_t bytes) {
uint8_t *end_dst = dst + bytes;
do{
*dst++ = *src++ >> 8;
} while(dst < end_dst);
}
https://godbolt.org/g/RoiFnW
clang 6.0.0 (trunk 314277) -O3
.LBB0_4: # =>This Inner Loop Header: Depth=1
movdqu (%rsi,%rcx,2), %xmm0
movdqu 16(%rsi,%rcx,2), %xmm1
psrlw $8, %xmm0
psrlw $8, %xmm1
packuswb %xmm0, %xmm0
packuswb %xmm1, %xmm1
punpcklqdq %xmm1, %xmm0 # xmm0 = xmm0[0],xmm1[0]
movdqu %xmm0, (%rdi,%rcx)
(repeated again with +32 / +16 offsets)
addq $32, %rcx
addq $2, %rax
jne .LBB0_4
Those three shuffles can (and should) be a single packuswb %xmm1, %xmm0.
We can see from -fno-unroll-loops output that it thinks the base case is an
8-byte store. And that it's only using the shift+pack as a stand-in for SSSE3
pshufb (which it uses when available. e.g. -mssse3 -mtune=skylake)
# no shifts
pshufb %xmm0, %xmm1
pshufb %xmm0, %xmm2
punpcklqdq %xmm2, %xmm1 # xmm1 = xmm1[0],xmm2[0]
This sucks everywhere, but sucks the most on Haswell and later which only have
1 shuffle port. Skylake even has 2 per clock shift throughput. If not for the
front-end bottleneck, it can execute 2 shifts + 1 shuffle per clock. (There's
no load+shift instruction in AVX2, only AVX512. And BTW clang auto-vectorizes
well for AVX512, with vpsrlw $8, (%rsi,%rcx,2), %zmm0 / vpmovwb %zmm0,
(%rdi,%rcx) which is probably optimal on Skylake-avx512.
I think the optimal strategy without AVX512 is to replace one shift with an AND
+ unaligned load offset by -1. Especially with AVX, that lets the load+ALU
fold into one instruction. (see https://stackoverflow.com/a/46477080/224132
for details. With src 32B-aligned, this will never cache-line split, and
should be good on all CPUs back to Nehalem or K10, compiled with or without
-mavx)
// uint8_t *dst, *src;
__m128i v0 = _mm_loadu_si128((__m128i*)src);
__m128i v1_offset = _mm_loadu_si128(1+(__m128i*)(src-1));
v0 = _mm_srli_epi16(v0, 8);
__m128i v1 = _mm_and_si128(v1_offset, _mm_set1_epi16(0x00FF));
__m128i pack = _mm_packus_epi16(v0, v1);
_mm_storeu_si128((__m128i*)dst, pack);
This works for AVX2 256b vectors with one extra VPERMQ at the end to fix-up
the in-lane vpackuswb behaviour (like gcc emits when auto-vectorizing).
clang's AVX2 auto-vectorization is *horrible* here: With -fno-unroll-loops:
vmovdqu (%rsi,%rax,2), %ymm1
vpsrlw $8, %ymm1, %ymm1
vextracti128 $1, %ymm1, %xmm2
vpshufb %xmm0, %xmm2, %xmm2
vpshufb %xmm0, %xmm1, %xmm1
vpunpcklqdq %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0]
vmovdqu %xmm1, (%rdi,%rax)
With default loop unrolling, it's basically that but with vinserti128 to feed
256b stores. Because bottlenecking even harder on the shuffle port seems like
a great idea (with -march=skylake)...
(If this isn't a symptom of the same bug as the SSE2 sillyness, I guess this
part should be reported separately.)
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20170928/757dce6e/attachment.html>
More information about the llvm-bugs
mailing list