[PATCH] D26905: [SLP] Vectorize loads of consecutive memory accesses, accessed in non-consecutive (jumbled) way.

Thu Dec 1 10:33:14 PST 2016

mssimpso added inline comments.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:466
+      assert(VL.size() == Scalars.size() && "Invalid size");
+      for (unsigned i = 0; i < VL.size(); i++) {
+        bool Found = false;
----------------
mkuper wrote:
> ashahid wrote:
> > mkuper wrote:
> > > RKSimon wrote:
> > > > This might be simplified with a for range loop and use of llvm:none_of / any_of?
> > > Would it make sense to pre-sort both arrays, and then check the two sorted arrays are equal? This would make it O(nlogn) instead of O(n^2)
> > > (I'm not sure sort based on what, though - as well as the actual gain, since I guess VL.size() is small in practice.)
> > Pre-sorting would require two calls for sort and then compare, IMO, for the given small VL.size it would not make much difference. However I am open to other views.
> To be honest, I'm not sure - so I'd appreciate another opinion about pre-sorting.
> Matt/Hal/Simon?
I think we currently limit VL.size() to a maximum of 16? If so, the gain may not be that much, but I wouldn't expect presorting to be any worse.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1217

+      if (VL.size() > 2) {
+        SmallVector<Value *, 8> list;
----------------
mkuper wrote:
> ashahid wrote:
> > mkuper wrote:
> > > mkuper wrote:
> > > > Do we still need the ReverseConsecutive case at all? That was introduced in r276477 as a patch for the common case of PR28474, where we find the loads in reverse order. But this should completely supersede it, right?
> > > Why not for VL.size() == 2?
> > A jumbled VL of VL.size() == 2 is essentially a case of reversed VL. Considering the tradeoff between compile time of extra buildTree() for VL.size==2  vs additional runtime for shufflevector, I opted for extra compile time over extra runtime.
> The trade off here is more one of code complexity - is the gain in compile time worth having all the additional logic present for both the "fully unsorted" case and the "reversed" case.
Am I wrong in thinking we don't necessarily know if rebuilding the tree with reversed loads would be any better than having the shuffle? Previously we were going to bail, but now we have an option.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:2565
+      // must be followed by a 'shuffle' with the required jumbled mask.
+      if (E->NeedToShuffle && (VL.size() == E->Scalars.size())) {
+        SmallVector<Constant *, 8> Mask;
----------------
I probably missed this, but why are we checking the sizes? Does this mean there will be cases where E->NeedToShuffle is true but we don't generate the shuffle?

https://reviews.llvm.org/D26905