[llvm-bugs] [Bug 40720] New: [AVX] extracting high 128 and zero-extending to 256 pessimized from vextracti128 to vperm2f128, even with -march=znver1

Wed Feb 13 09:10:22 PST 2019

https://bugs.llvm.org/show_bug.cgi?id=40720

            Bug ID: 40720
           Summary: [AVX] extracting high 128 and zero-extending to 256
                    pessimized from vextracti128 to vperm2f128, even with
                    -march=znver1
           Product: new-bugs
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Keywords: performance
          Severity: enhancement
          Priority: P
         Component: new bugs
          Assignee: unassignedbugs at nondot.org
          Reporter: peter at cordes.ca
                CC: htmldeveloper at gmail.com, llvm-bugs at lists.llvm.org

vextracti128 is always at least as fast as vperm2i128 on all CPUs, and *much*
better on AMD CPUs that handle 256-bit vectors as two 128-bit halves.  If we
can use vextracti128 or f128 in place of any other lane-crossing shuffle, we
should.  (Of course, only ever with imm=1.  With imm=0, use vmovaps xmm,xmm
instead.  I can only assume Intel was already planning for extension to 512
bit, because the imm8 is pointless here.  Or it makes decoding simpler if it's
like vinserti128.)

vextracti128 xmm1, ymm0, 1   zero-extends the high lane of ymm0 into ymm1.
Writing an xmm vector *always* zero-extends to the full MAXVL (256 or 512 or
whatever the CPU supports).

#include <immintrin.h>
__m256i high128_zext_vextracti128(__m256i v) {
    return _mm256_zextsi128_si256(_mm256_extracti128_si256(v,1));
}

clang 9.0.0 (trunk 353904) -O3 -march=znver1
   vperm2f128      ymm0, ymm0, ymm0, 129 # ymm0 = ymm0[2,3],zero,zero

https://godbolt.org/z/jMdCp9  (with this and other ways of writing it, with
gcc/clang/icc/msvc).  MSVC is the only compiler that compiles this efficiently,
exactly as written with no `vmovaps xmm0,xmm0` (ICC) or transforming it into
vperm2f128.  And GCC doesn't support _mm256_zext cast intrinsics at all :(

setting the high bit of a nibble in the imm8 zeros the corresponding lane for
vperm2f128 / i128, so this is valid, but MUCH worse on Ryzen: 8 uops / 3c lat /
3c tput vs. 1 uop / 1c / 0.33c for vextracti128 xmm0, ymm0, 1.  On mainstream
Intel they perform identically, but perm2 is slower than extract on KNL: an
extra cycle of  latency and reciprocal throughput

AFAIK, there's no bypass-delay latency on any current CPUs for using f128
shuffles between integer instructions, but it's an odd choice to use f128
instead of i128 when the source used AVX2 integer intrinsics.

---

The only way I could get clang to emit it was with an unsafe cast that leaves
the upper 128 bits undefined instead of zeroed.  This will work in practice at
least 99% of the time, but it leaves the door open for some optimization where
another vector with the same low128 but non-zero hi128 is also needed, and this
cast CSEs to that.  

return _mm256_castsi128_si256(_mm256_extracti128_si256(v,1));

        vextractf128    xmm0, ymm0, 1

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20190213/696be3af/attachment.html>