[PATCH] D50840: [InstCombine] Extend collectShuffleElements to support extract/zext/insert patterns
Joey Gouly via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Aug 16 08:27:09 PDT 2018
joey added a comment.
Here is the (reduced) motivating example:
__kernel void foo(__global uchar4 *p1, __global ushort2 *p2)
{
uchar4 t0 = p1[0];
uchar4 t1 = p1[1];
ushort2 t00 = (ushort2)((ushort)t0.x, (ushort)t0.y);
ushort2 t10 = (ushort2)((ushort)t1.x, (ushort)t1.y);
*p2 += (t00 * t10);
}
I haven't worked with the SLPVectorizer before, so would need some guidance in making the change there. Or someone could take over the change, if it's easier.
I found that if I apply the following patch:
diff --git a/lib/Transforms/Vectorize/SLPVectorizer.cpp b/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 32df6d58157..76103732adc 100644
--- a/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -2431,8 +2431,10 @@ bool BoUpSLP::isFullyVectorizableTinyTree() {
return true;
// Gathering cost would be too much for tiny trees.
+/*
if (VectorizableTree[0].NeedToGather || VectorizableTree[1].NeedToGather)
return false;
+*/
Using the test:
define <4 x i32> @test3(<8 x i16> %in, <8 x i16> %in2) {
%elt0e = extractelement <8 x i16> %in, i32 3
%elt1e = extractelement <8 x i16> %in, i32 1
%elt2e = extractelement <8 x i16> %in, i32 0
%elt3e = extractelement <8 x i16> %in, i32 3
%elt0 = zext i16 %elt0e to i32
%elt1 = zext i16 %elt1e to i32
%elt2 = zext i16 %elt2e to i32
%elt3 = zext i16 %elt3e to i32
%vec.0 = insertelement <4 x i32> undef, i32 %elt0, i32 0
%vec.1 = insertelement <4 x i32> %vec.0, i32 %elt1, i32 1
%vec.2 = insertelement <4 x i32> %vec.1, i32 %elt2, i32 2
%vec.3 = insertelement <4 x i32> %vec.2, i32 %elt3, i32 3
ret <4 x i32> %vec.3
}
The SLPVectorizer produces:
define <4 x i32> @test3(<8 x i16> %in, <8 x i16> %in2) {
%elt0e = extractelement <8 x i16> %in, i32 3
%elt1e = extractelement <8 x i16> %in, i32 1
%elt2e = extractelement <8 x i16> %in, i32 0
%1 = insertelement <4 x i16> undef, i16 %elt0e, i32 0
%2 = insertelement <4 x i16> %1, i16 %elt1e, i32 1
%3 = insertelement <4 x i16> %2, i16 %elt2e, i32 2
%4 = insertelement <4 x i16> %3, i16 %elt0e, i32 3
%5 = zext <4 x i16> %4 to <4 x i32>
%6 = extractelement <4 x i32> %5, i32 0
%vec.0 = insertelement <4 x i32> undef, i32 %6, i32 0
%7 = extractelement <4 x i32> %5, i32 1
%vec.1 = insertelement <4 x i32> %vec.0, i32 %7, i32 1
%8 = extractelement <4 x i32> %5, i32 2
%vec.2 = insertelement <4 x i32> %vec.1, i32 %8, i32 2
%9 = extractelement <4 x i32> %5, i32 3
%vec.3 = insertelement <4 x i32> %vec.2, i32 %9, i32 3
ret <4 x i32> %vec.3
}
Then InstCombine can clean that up into:
define <4 x i32> @test3(<8 x i16> %in, <8 x i16> %in2) {
%1 = shufflevector <8 x i16> %in, <8 x i16> undef, <4 x i32> <i32 3, i32 1, i32 0, i32 3>
%2 = zext <4 x i16> %1 to <4 x i32>
ret <4 x i32> %2
}
So it looks like the SLPVectorizer already can do this, with some tweaks.
Repository:
rL LLVM
https://reviews.llvm.org/D50840
More information about the llvm-commits
mailing list