[llvm-bugs] [Bug 51908] New: wrong codegen due to _mm_mpsadbw_epu8 intrinsic incorrectly marked as commutative

Sun Sep 19 12:47:05 PDT 2021

https://bugs.llvm.org/show_bug.cgi?id=51908

            Bug ID: 51908
           Summary: wrong codegen due to _mm_mpsadbw_epu8 intrinsic
                    incorrectly marked as commutative
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: All
            Status: NEW
          Severity: normal
          Priority: P
         Component: Backend: X86
          Assignee: unassignedbugs at nondot.org
          Reporter: benjsith at gmail.com
                CC: craig.topper at gmail.com, llvm-bugs at lists.llvm.org,
                    llvm-dev at redking.me.uk, pengfei.wang at intel.com,
                    spatel+llvm at rotateright.com

I came across a case where using the Intel SSE4.1 intrinsic _mm_mpsadbw_epu8
appears to lead to a mis-compilation when optimization (O1) is turned on.

I tried to come up with a minimal repro, as follows:

__m128i do_stuff(const __m128i* iVals) {
        const __m128i I0 = _mm_load_si128(&iVals[0]);
        const __m128i I1 = _mm_load_si128(&iVals[1]);
        const __m128i I2 = _mm_load_si128(&iVals[2]);

        const __m128i A = _mm_mpsadbw_epu8(I0, I2, 0);
        const __m128i B = _mm_add_epi8(I2, I1);
        const __m128i C = _mm_add_epi8(B, A);
        return C;
}

This function will run fine when compiled with -O0, but when using -O1 it gives
incorrect results. The -O1 assembly output is as follows:

do_stuff(long long __vector(2) const*):
        vmovdqa xmm0, xmmword ptr [rdi + 32]
        vmpsadbw        xmm1, xmm0, xmmword ptr [rdi], 0
        vpaddb  xmm0, xmm0, xmmword ptr [rdi + 16]
        vpaddb  xmm0, xmm0, xmm1
        ret

This is mostly correct, however the vmpsadbw opcode has had its operand order
flipped. It would be equivalent to having called
_mm_mpsadbw_epu8(I2, I0, 0)
instead. I believe this is because in the LLVM code, this op code is marked as
commutative. In llvm/include/llvm/IR/IntrinsicsX86.td, lines 791-796:

// Vector sum of absolute differences
let TargetPrefix = "x86" in {  // All intrinsics start with "llvm.x86.".
  def int_x86_sse41_mpsadbw         : GCCBuiltin<"__builtin_ia32_mpsadbw128">,
          Intrinsic<[llvm_v8i16_ty], [llvm_v16i8_ty, llvm_v16i8_ty,llvm_i8_ty],
                    [IntrNoMem, Commutative, ImmArg<ArgIndex<2>>]>;
}

This opcode is not commutative however. The byte-wise differences are
calculated using different indices for the first and second argument, meaning
that swapping the argument order leads to different results.

I first noticed this on Clang 12.0 for Windows, however I tested it on Godbolt
using the trunk Clang compiler and it still repros there. Here is a link to the
Godbolt code: https://godbolt.org/z/zs576oh39

Cheers, and let me know if you need anything else (or if I've gotten something
wrong: this is my first bug filed)

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20210919/3c569848/attachment.html>