[llvm-bugs] [Bug 52275] New: [SLP] slp-vectorizer incorrectly optimizes avx intrinsics code involving _mm_insert_epi8

Sat Oct 23 11:03:06 PDT 2021

https://bugs.llvm.org/show_bug.cgi?id=52275

            Bug ID: 52275
           Summary: [SLP] slp-vectorizer incorrectly optimizes avx
                    intrinsics code involving _mm_insert_epi8
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: All
            Status: NEW
          Severity: normal
          Priority: P
         Component: Scalar Optimizations
          Assignee: unassignedbugs at nondot.org
          Reporter: benjsith at gmail.com
                CC: llvm-bugs at lists.llvm.org

I encountered some code that appears to be incorrectly optimized by the
SLPVectorizer pass. The following C code is a minimal repro:

__m128i do_stuff(__m128i I0, const int* IVals) {
        int Int0 = IVals[0];
        int Int1 = IVals[1];
        __m128i A = _mm_insert_epi8(I0, Int0, 0);
        __m128i B = _mm_insert_epi8(A, Int1, 1);
        __m128i C = _mm_add_epi8(A, B);
        return C;
}

Here is a Godbolt showing it compiled with -O1 vs -O2:
https://godbolt.org/z/Mqc5x3oxh

The corresponding LLVM IR for that function is as follows:

define dso_local <2 x i64> @do_stuff(<16 x i8> %I0, i32* nocapture readonly
%iVals) local_unnamed_addr #0 {
entry:
  %0 = load i32, i32* %iVals, align 4
  %arrayidx2 = getelementptr inbounds i32, i32* %iVals, i64 1
  %1 = load i32, i32* %arrayidx2, align 4
  %conv = trunc i32 %0 to i8
  %2 = insertelement <16 x i8> %I0, i8 %conv, i64 0
  %conv1 = trunc i32 %1 to i8
  %3 = insertelement <16 x i8> %2, i8 %conv1, i64 1
  %add.i = add <16 x i8> %3, %2
  %4 = bitcast <16 x i8> %add.i to <2 x i64>
  ret <2 x i64> %4
}

which when run through 'opt -passes=slp-vectorizer' will produce:

  %arrayidx2 = getelementptr inbounds i32, i32* %iVals, i64 1
  %0 = bitcast i32* %iVals to <2 x i32>*
  %1 = load <2 x i32>, <2 x i32>* %0, align 4
  %2 = trunc <2 x i32> %1 to <2 x i8>
  %3 = shufflevector <2 x i8> %2, <2 x i8> poison, <16 x i32> <i32 0, i32 1,
i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef,
i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %4 = shufflevector <16 x i8> %I0, <16 x i8> %3, <16 x i32> <i32 16, i32 17,
i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12,
i32 13, i32 14, i32 15>
  %add.i = add <16 x i8> %4, %4
  %5 = bitcast <16 x i8> %add.i to <2 x i64>
  ret <2 x i64> %5

The problematic line is "%add.i = add <16 x i8> %4, %4", which performs the add
on two copies of the result of doing the second insert, rather than between the
result of the first insert and the result of the second. This appears to be
from the slp-vectorizer pass thinking the result of the first insert doesn't
need to be used post-vectorization, and both sides of the add getting replaced
with the shufflevector result. But I'm not entirely sure why that's happening.

I noticed that this behaviour was present in 13.0 but not 12.0.1, so I tried
bisecting to find which commit caused it. However it ended up finding
49d3a367c0376a95b9518e90426cdd6d5508e64a, which just adjusted cost metrics for
the trunc instructions (and I think made the compiler decide the optimization
was worth doing). I didn't test further to isolate the actual commit the bug
was introduced.

I tested this on latest trunk (710596a1e15188171edd5c6fffe6b7fe483ca594) and
confirmed it was still present. I observed it on both Windows and Linux.

This was not in code I manually wrote, it was found by a fuzzer I made to test
intrinsics compilation.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20211023/c6171c27/attachment.html>