[llvm-bugs] [Bug 52275] New: [SLP] slp-vectorizer incorrectly optimizes avx intrinsics code involving _mm_insert_epi8
via llvm-bugs
llvm-bugs at lists.llvm.org
Sat Oct 23 11:03:06 PDT 2021
https://bugs.llvm.org/show_bug.cgi?id=52275
Bug ID: 52275
Summary: [SLP] slp-vectorizer incorrectly optimizes avx
intrinsics code involving _mm_insert_epi8
Product: libraries
Version: trunk
Hardware: PC
OS: All
Status: NEW
Severity: normal
Priority: P
Component: Scalar Optimizations
Assignee: unassignedbugs at nondot.org
Reporter: benjsith at gmail.com
CC: llvm-bugs at lists.llvm.org
I encountered some code that appears to be incorrectly optimized by the
SLPVectorizer pass. The following C code is a minimal repro:
__m128i do_stuff(__m128i I0, const int* IVals) {
int Int0 = IVals[0];
int Int1 = IVals[1];
__m128i A = _mm_insert_epi8(I0, Int0, 0);
__m128i B = _mm_insert_epi8(A, Int1, 1);
__m128i C = _mm_add_epi8(A, B);
return C;
}
Here is a Godbolt showing it compiled with -O1 vs -O2:
https://godbolt.org/z/Mqc5x3oxh
The corresponding LLVM IR for that function is as follows:
define dso_local <2 x i64> @do_stuff(<16 x i8> %I0, i32* nocapture readonly
%iVals) local_unnamed_addr #0 {
entry:
%0 = load i32, i32* %iVals, align 4
%arrayidx2 = getelementptr inbounds i32, i32* %iVals, i64 1
%1 = load i32, i32* %arrayidx2, align 4
%conv = trunc i32 %0 to i8
%2 = insertelement <16 x i8> %I0, i8 %conv, i64 0
%conv1 = trunc i32 %1 to i8
%3 = insertelement <16 x i8> %2, i8 %conv1, i64 1
%add.i = add <16 x i8> %3, %2
%4 = bitcast <16 x i8> %add.i to <2 x i64>
ret <2 x i64> %4
}
which when run through 'opt -passes=slp-vectorizer' will produce:
%arrayidx2 = getelementptr inbounds i32, i32* %iVals, i64 1
%0 = bitcast i32* %iVals to <2 x i32>*
%1 = load <2 x i32>, <2 x i32>* %0, align 4
%2 = trunc <2 x i32> %1 to <2 x i8>
%3 = shufflevector <2 x i8> %2, <2 x i8> poison, <16 x i32> <i32 0, i32 1,
i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef,
i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%4 = shufflevector <16 x i8> %I0, <16 x i8> %3, <16 x i32> <i32 16, i32 17,
i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12,
i32 13, i32 14, i32 15>
%add.i = add <16 x i8> %4, %4
%5 = bitcast <16 x i8> %add.i to <2 x i64>
ret <2 x i64> %5
The problematic line is "%add.i = add <16 x i8> %4, %4", which performs the add
on two copies of the result of doing the second insert, rather than between the
result of the first insert and the result of the second. This appears to be
from the slp-vectorizer pass thinking the result of the first insert doesn't
need to be used post-vectorization, and both sides of the add getting replaced
with the shufflevector result. But I'm not entirely sure why that's happening.
I noticed that this behaviour was present in 13.0 but not 12.0.1, so I tried
bisecting to find which commit caused it. However it ended up finding
49d3a367c0376a95b9518e90426cdd6d5508e64a, which just adjusted cost metrics for
the trunc instructions (and I think made the compiler decide the optimization
was worth doing). I didn't test further to isolate the actual commit the bug
was introduced.
I tested this on latest trunk (710596a1e15188171edd5c6fffe6b7fe483ca594) and
confirmed it was still present. I observed it on both Windows and Linux.
This was not in code I manually wrote, it was found by a fuzzer I made to test
intrinsics compilation.
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20211023/c6171c27/attachment.html>
More information about the llvm-bugs
mailing list