[llvm-bugs] [Bug 40720] New: [AVX] extracting high 128 and zero-extending to 256 pessimized from vextracti128 to vperm2f128, even with -march=znver1
via llvm-bugs
llvm-bugs at lists.llvm.org
Wed Feb 13 09:10:22 PST 2019
https://bugs.llvm.org/show_bug.cgi?id=40720
Bug ID: 40720
Summary: [AVX] extracting high 128 and zero-extending to 256
pessimized from vextracti128 to vperm2f128, even with
-march=znver1
Product: new-bugs
Version: trunk
Hardware: PC
OS: Linux
Status: NEW
Keywords: performance
Severity: enhancement
Priority: P
Component: new bugs
Assignee: unassignedbugs at nondot.org
Reporter: peter at cordes.ca
CC: htmldeveloper at gmail.com, llvm-bugs at lists.llvm.org
vextracti128 is always at least as fast as vperm2i128 on all CPUs, and *much*
better on AMD CPUs that handle 256-bit vectors as two 128-bit halves. If we
can use vextracti128 or f128 in place of any other lane-crossing shuffle, we
should. (Of course, only ever with imm=1. With imm=0, use vmovaps xmm,xmm
instead. I can only assume Intel was already planning for extension to 512
bit, because the imm8 is pointless here. Or it makes decoding simpler if it's
like vinserti128.)
vextracti128 xmm1, ymm0, 1 zero-extends the high lane of ymm0 into ymm1.
Writing an xmm vector *always* zero-extends to the full MAXVL (256 or 512 or
whatever the CPU supports).
#include <immintrin.h>
__m256i high128_zext_vextracti128(__m256i v) {
return _mm256_zextsi128_si256(_mm256_extracti128_si256(v,1));
}
clang 9.0.0 (trunk 353904) -O3 -march=znver1
vperm2f128 ymm0, ymm0, ymm0, 129 # ymm0 = ymm0[2,3],zero,zero
https://godbolt.org/z/jMdCp9 (with this and other ways of writing it, with
gcc/clang/icc/msvc). MSVC is the only compiler that compiles this efficiently,
exactly as written with no `vmovaps xmm0,xmm0` (ICC) or transforming it into
vperm2f128. And GCC doesn't support _mm256_zext cast intrinsics at all :(
setting the high bit of a nibble in the imm8 zeros the corresponding lane for
vperm2f128 / i128, so this is valid, but MUCH worse on Ryzen: 8 uops / 3c lat /
3c tput vs. 1 uop / 1c / 0.33c for vextracti128 xmm0, ymm0, 1. On mainstream
Intel they perform identically, but perm2 is slower than extract on KNL: an
extra cycle of latency and reciprocal throughput
AFAIK, there's no bypass-delay latency on any current CPUs for using f128
shuffles between integer instructions, but it's an odd choice to use f128
instead of i128 when the source used AVX2 integer intrinsics.
---
The only way I could get clang to emit it was with an unsafe cast that leaves
the upper 128 bits undefined instead of zeroed. This will work in practice at
least 99% of the time, but it leaves the door open for some optimization where
another vector with the same low128 but non-zero hi128 is also needed, and this
cast CSEs to that.
return _mm256_castsi128_si256(_mm256_extracti128_si256(v,1));
vextractf128 xmm0, ymm0, 1
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20190213/696be3af/attachment.html>
More information about the llvm-bugs
mailing list