[PATCH] D26905: [SLP] Vectorize loads of consecutive memory accesses, accessed in non-consecutive (jumbled) way.

Thu Dec 1 10:56:26 PST 2016

mkuper added inline comments.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1217

+      if (VL.size() > 2) {
+        SmallVector<Value *, 8> list;
----------------
mssimpso wrote:
> mkuper wrote:
> > ashahid wrote:
> > > mkuper wrote:
> > > > mkuper wrote:
> > > > > Do we still need the ReverseConsecutive case at all? That was introduced in r276477 as a patch for the common case of PR28474, where we find the loads in reverse order. But this should completely supersede it, right?
> > > > Why not for VL.size() == 2?
> > > A jumbled VL of VL.size() == 2 is essentially a case of reversed VL. Considering the tradeoff between compile time of extra buildTree() for VL.size==2  vs additional runtime for shufflevector, I opted for extra compile time over extra runtime.
> > The trade off here is more one of code complexity - is the gain in compile time worth having all the additional logic present for both the "fully unsorted" case and the "reversed" case.
> Am I wrong in thinking we don't necessarily know if rebuilding the tree with reversed loads would be any better than having the shuffle? Previously we were going to bail, but now we have an option.
I don't think you're wrong, on the contrary - I was advocating removing the code I added (for reversing loads) and completely replacing it with something like this. But I didn't realize that'll introduce an extra shuffle.

================
Comment at: test/Transforms/SLPVectorizer/X86/jumbled-load.ll:51
   %mul.4 = mul i32 %load.1, %load.6
   %gep.7 = getelementptr inbounds i32, i32* %out, i64 0
   store i32 %mul.1, i32* %gep.7, align 4
----------------
What happens if the stores are also out of order?
(IIRC, we should already have code to deal with that, I just want to make sure it meshes with the stores being out of order correctly)

================
Comment at: test/Transforms/SLPVectorizer/X86/reduction_loads.ll:20
 ; CHECK-NEXT:    [[TMP1:%.*]] = load <8 x i32>, <8 x i32>* [[TMP0]], align 4
-; CHECK-NEXT:    [[TMP2:%.*]] = mul <8 x i32> <i32 42, i32 42, i32 42, i32 42, i32 42, i32 42, i32 42, i32 42>, [[TMP1]]
+; CHECK-NEXT:    [[TMP2:%.*]] = shufflevector <8 x i32> [[TMP1]], <8 x i32> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    [[TMP3:%.*]] = mul <8 x i32> <i32 42, i32 42, i32 42, i32 42, i32 42, i32 42, i32 42, i32 42>, [[TMP2]]
----------------
RKSimon wrote:
> mkuper wrote:
> > RKSimon wrote:
> > > What can be done to avoid this regression?
> > Ohh, right, wanted to ask about this as well.
> > My guess is that this wasn't actually a regression, but we moved the shuffle from store side to the load side.  Is that right?
> If the update_test_checks script has done its job and generated checks for all the IR then this is an additional shuffle, I can't see an equivalent shuffle or set of extracts in the codegen on the left.
Argh, I didn't even look at the new version of the test, my assumption was from looking at the non-generated one (which is even more embarrassing, since I originally wrote that test, and didn't remember it doesn't have a shuffle...)
We really should not be regressing this.

https://reviews.llvm.org/D26905