<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - [SLP] slp-vectorizer incorrectly optimizes avx intrinsics code involving _mm_insert_epi8"

   href="https://bugs.llvm.org/show_bug.cgi?id=52275">52275</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>[SLP] slp-vectorizer incorrectly optimizes avx intrinsics code involving _mm_insert_epi8

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Scalar Optimizations

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>benjsith@gmail.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>I encountered some code that appears to be incorrectly optimized by the

SLPVectorizer pass. The following C code is a minimal repro:

__m128i do_stuff(__m128i I0, const int* IVals) {

        int Int0 = IVals[0];

        int Int1 = IVals[1];

        __m128i A = _mm_insert_epi8(I0, Int0, 0);

        __m128i B = _mm_insert_epi8(A, Int1, 1);

        __m128i C = _mm_add_epi8(A, B);

        return C;

}

Here is a Godbolt showing it compiled with -O1 vs -O2:

<a href="https://godbolt.org/z/Mqc5x3oxh">https://godbolt.org/z/Mqc5x3oxh</a>

The corresponding LLVM IR for that function is as follows:

define dso_local <2 x i64> @do_stuff(<16 x i8> %I0, i32* nocapture readonly

%iVals) local_unnamed_addr #0 {

entry:

  %0 = load i32, i32* %iVals, align 4

  %arrayidx2 = getelementptr inbounds i32, i32* %iVals, i64 1

  %1 = load i32, i32* %arrayidx2, align 4

  %conv = trunc i32 %0 to i8

  %2 = insertelement <16 x i8> %I0, i8 %conv, i64 0

  %conv1 = trunc i32 %1 to i8

  %3 = insertelement <16 x i8> %2, i8 %conv1, i64 1

  %add.i = add <16 x i8> %3, %2

  %4 = bitcast <16 x i8> %add.i to <2 x i64>

  ret <2 x i64> %4

}

which when run through 'opt -passes=slp-vectorizer' will produce:

  %arrayidx2 = getelementptr inbounds i32, i32* %iVals, i64 1

  %0 = bitcast i32* %iVals to <2 x i32>*

  %1 = load <2 x i32>, <2 x i32>* %0, align 4

  %2 = trunc <2 x i32> %1 to <2 x i8>

  %3 = shufflevector <2 x i8> %2, <2 x i8> poison, <16 x i32> <i32 0, i32 1,

i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef,

i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>

  %4 = shufflevector <16 x i8> %I0, <16 x i8> %3, <16 x i32> <i32 16, i32 17,

i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12,

i32 13, i32 14, i32 15>

  %add.i = add <16 x i8> %4, %4

  %5 = bitcast <16 x i8> %add.i to <2 x i64>

  ret <2 x i64> %5

The problematic line is "%add.i = add <16 x i8> %4, %4", which performs the add

on two copies of the result of doing the second insert, rather than between the

result of the first insert and the result of the second. This appears to be

from the slp-vectorizer pass thinking the result of the first insert doesn't

need to be used post-vectorization, and both sides of the add getting replaced

with the shufflevector result. But I'm not entirely sure why that's happening.

I noticed that this behaviour was present in 13.0 but not 12.0.1, so I tried

bisecting to find which commit caused it. However it ended up finding

49d3a367c0376a95b9518e90426cdd6d5508e64a, which just adjusted cost metrics for

the trunc instructions (and I think made the compiler decide the optimization

was worth doing). I didn't test further to isolate the actual commit the bug

was introduced.

I tested this on latest trunk (710596a1e15188171edd5c6fffe6b7fe483ca594) and

confirmed it was still present. I observed it on both Windows and Linux.

This was not in code I manually wrote, it was found by a fuzzer I made to test

intrinsics compilation.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>