[llvm-bugs] [Bug 32862] New: AVX512 _mm512_setzero_ps could save a byte by using a VEX-encoded vxorps xmm instead of EVEX

via llvm-bugs llvm-bugs at lists.llvm.org
Sun Apr 30 16:50:32 PDT 2017


https://bugs.llvm.org/show_bug.cgi?id=32862

            Bug ID: 32862
           Summary: AVX512 _mm512_setzero_ps could save a byte by using a
                    VEX-encoded vxorps xmm instead of EVEX
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Backend: X86
          Assignee: unassignedbugs at nondot.org
          Reporter: peter at cordes.ca
                CC: llvm-bugs at lists.llvm.org

VEX-encoded vxorps %xmm0, %xmm0, %xmm0 does the same thing as EVEX-encoded
vxorps %zmm0, %zmm0, %zmm0, zeroing the full-width vector and breaking
dependencies on the old value of the architectural register.

The VEX version is one byte shorter than the EVEX version.

#include <immintrin.h>
__m512 zerovec(){ return _mm512_setzero_ps(); }

current compiles to (https://godbolt.org/g/i5LA9Y):
gcc8 and clang5.0.0 (trunk 301766):  vxorps  %zmm0, %zmm0, %zmm0
ICC17:                               vpxord   %zmm0, %zmm0, %zmm0  
MSVC:                                vxorps   xmm0, xmm0, xmm0

Always using 128b zeroing instructions wouldn't hurt for AVX/AVX2, as well.

I'm not sure if CPUs like Jaguar or Bulldozer-family (which crack 256b
instructions into two 128b ops) handle xor-zeroing specially and only need one
internal operation for vxorps %ymm,%ymm,%ymm zeroing.  If not, using 128b would
save execution throughput.  (e.g. maybe they crack instructions at decode, but
independence-detection happens later?  Unlikely, because it probably has to
decode to a special zeroing micro-op).

One possible downside is that  vxorps %ymm0,%ymm0,%ymm0  warms up the 256b
execution units on Intel CPUs like Skylake, but vxorps %xmm0,%xmm0,%xmm0
doesn't.  As Agner Fog describes (in
http://agner.org/optimize/microarchitecture.pdf), a program can run a single
256b AVX instruction at least 56,000 clock cycles before an AVX loop to start
the warm-up process before a critical 256b loop.

IDK if any existing code uses something this function to achieve that:
    __attribute__((noinline)) __m256 warmup_avx256(void) {
        return _mm256_setzero_ps();
    }

If  vxorps %xmm0,%xmm0,%xmm0  is faster on Bulldozer or Jaguar, we should
probably make sure to always use that.  People will just have to use something
else for their warmup function, like maybe all-ones.

OTOH, during the warm-up period, vxorps %xmm0,%xmm0,%xmm0 may be faster.  (e.g.
at the start of executing an AVX 256b function when 256b execution units were
asleep).

----

AFAIK, there are no problems with mixing VEX and EVEX vector instructions on
any existing AVX512 hardware (KNL and skylake-avx512).

In asm syntax, I'm not sure there's even a way to request the EVEX encoding of
vaddps %ymm, %ymm, %ymm with no masking or broadcast-load source operand.  You
could easily have that as part of a horizontal-add of a zmm vector, and there
aren't EVEX versions of

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20170430/6fc411ec/attachment.html>


More information about the llvm-bugs mailing list