[llvm-bugs] [Bug 32368] New: Miscompilation of _mm512_andnot_ps when folding a load (it's not commutative)

Tue Mar 21 22:19:35 PDT 2017

https://bugs.llvm.org/show_bug.cgi?id=32368

            Bug ID: 32368
           Summary: Miscompilation of _mm512_andnot_ps when folding a load
                    (it's not commutative)
           Product: new-bugs
           Version: 3.9
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: new bugs
          Assignee: unassignedbugs at nondot.org
          Reporter: peter at cordes.ca
                CC: llvm-bugs at lists.llvm.org

clang3.9.1 seems to think vandnps is commutative in some cases.  It's not: it
does AND(NOT(op1), op2) (for both the intrinsic and the intel-syntax asm
instruction).

I can't reproduce this with clang4.0.0, but I'm not sure if clang4.0 sidesteps
it on purpose or by chance.  In any case, wrong-code from the 3.9 branch is a
problem until the fix is backported, so reporting it in case nobody realized
that was needed.  Sorry if this is already in progress.

#include <immintrin.h>
__m512 foo(__m512 a) {
  //__m512 c = set1ps(scalar_c);
  __m512 c =_mm512_castsi512_ps(_mm512_set1_epi32(0x00123));

  __m512 t1 = _mm512_andnot_ps(c, a);
  __m512 t2 = t1 * _mm512_set1_ps(2.0f);
  __m512 t3 = _mm512_and_ps(c, t2);
  return t3;
}

clang++ -O3 -march=skylake-avx512 clangbug.cpp -o- -S -masm=intel -fverbose-asm

.LCPI0_0:
        .long   291                     # float 4.07777853E-43

        vandnps zmm0, zmm0, dword ptr [rip + .LCPI0_0]{1to16}  ## bug here
        vaddps  zmm0, zmm0, zmm0
        vpandd  zmm0, zmm0, dword ptr [rip + .LCPI0_0]{1to16}

Notice that the constant from memory is being used as the second operand to
vandnps here, when it should be the first operand.

---

clang++ 4.0.0 makes an inverted copy of the constant so it can use vandps both
times.  (https://godbolt.org/g/yAJb5g)

.LCPI0_0:
        .long   4294967004              # 0xfffffedc
.LCPI0_1:
        .long   291                     # 0x123
foo(float __vector(16), float):                          # @foo(float
__vector(16), float)
        vandps  zmm0, zmm0, dword ptr [rip + .LCPI0_0]{1to16}
        vaddps  zmm0, zmm0, zmm0
        vandps  zmm0, zmm0, dword ptr [rip + .LCPI0_1]{1to16}
        ret

But it fails to use VANDNPS when it already has the constant in a register, so
it has to waste a broadcast-load outside the loop, and also waste a register. 
(Code for this test-case not included).

----

Also, clang3.9.1's use of vpandd is weird here, since it adds an extra cycle of
latency (bypass delay) on Skylake Xeon vs using vandps.  Knight's Landing has
no penalty for mixing integer boolean ops between FP operations, but no benefit
either.

According to Agner Fog, Skylake still has the bypass-delay penalty for this use
of an integer boolean following an FP add/mul/fma (at least with AVX/AVX2). 
There is no throughput benefit for using integer booleans when tuning for
Skylake, because FP booleans can also run on any of its three vector execution
ports.  (This was a change from Broadwell)

Even for pre-Skylake Intel CPUs (where FP-booleans can only run on port 5), I'd
suggest only using integer booleans when the C/C++ source uses integer
intrinsics, since either choice can be appropriate depending on the use-case. 
Unless you want to model the data-flow / CPU pipeline and use integer booleans
when there are a lot of boolean ops and port5 would be a bottleneck...

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20170322/c1026ce5/attachment.html>