[PATCH] D43776: [SLP] Fix PR36481: vectorize reassociated instructions.

Fri Mar 9 14:44:09 PST 2018

Ayal added inline comments.

================
Comment at: include/llvm/Analysis/LoopAccessAnalysis.h:679
+/// saves the mask for actual memory accesses in program order in
+/// \p SortedIndices as <2,0,1,3>
+bool sortPtrAccesses(ArrayRef<Value *> VL, const DataLayout &DL,
----------------
ABataev wrote:
> Ayal wrote:
> > The usage later and documentation above look for a set of references that are consecutive under some permutation. I.e., w/o gaps. The implementation below allows arbitrary gaps, and does not check for zero gaps, i.e., replicated references.
> > 
> > Would it be better to simply do a Pigeonhole sort, and perhaps call it isPermutedConsecutiveAccess():
> > 1. Scan all references to find the one with minimal address. Bail out if any reference is incomparable with the running min.
> > 2. Scan all references and set SortedIndices[i] = difference(VL[i], Minimum). Bail out if this entry is to be set more than once.
> > 
> > Note: it may be good to support replicated references, say when the gaps are internal to the vector to avoid loading out of bounds. Perhaps add a TODO.
> > 
> > Note: it may be good to have LAA provide some common support for both SLP and LV's InterleaveGroup construction, which share some aspects. Perhaps add a TODO.
> The documentation above does not say anything about consecutive access. It just states, that the pointers are sorted and that's it. I did it on purpose, so later we could reuse this function for masked loads|stores.
> Masked loads are not supported yet, that's why in the SLPVectorizer I added an additional check for the consecutive access.
By "The documentation above" I meant the one example of a[i+2], a[i+0], a[i+1] and a[i+3] which is a permutation of consecutive addresses.

Masked loads and stores (will) also require that all pointers be within range of a single vector, right?

================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:1113
+    const auto *Diff =
+        dyn_cast<SCEVConstant>(SE.getMinusSCEV(SE.getSCEV(Ptr), Scev0));
+    // The pointers may not have a constant offset from each other, or SCEV
----------------
ABataev wrote:
> Ayal wrote:
> > Use computeConstantDifference() instead of computing it explicitly? It should compare GetUnderlyingObject()'s, if worthwhile, rather than doing so here.
> Tried, it does not work in many cases.
Then that should be fixed, worth opening a PR?

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1229
+  SmallVector<OrdersType, 4> OpOrders;
+  unsigned DirectOrderNum = 0;

----------------
ABataev wrote:
> Ayal wrote:
> > Document what DirectOrderNum counts and/or have a more self-explanatory name similar to the original one, e.g., `NumOpsWantToKeepOriginalOrder`
> > 
> > Can add that `NumOpsWantToKeepOriginalOrder` holds the count for the identity permutation, instead of holding this count inside `NumOpsWantsToKeepOrder` along with all other permutations.
> I don't want to add the new entry for operations, that do not require reordering. I'd better the code in another way.
Understood; suggested before to simply **document** that one could alternatively hold the count for the identity permutation inside `NumOpsWantsToKeepOrder` map, but we're instead keeping it outside using `NumOpsWantToKeepOriginalOrder` counter.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1581
     case Instruction::ExtractElement: {
-      bool Reuse = canReuseExtract(VL, VL0);
+      OrdersType BestOrder;
+      bool Reuse = canReuseExtract(VL, VL0, BestOrder);
----------------
ABataev wrote:
> Ayal wrote:
> > BestOrder >> CurrentOrder?
> I think it does not matter because the current order is the best order.
The current order may not be the best order, the first time the tree is built, right?

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:2029
+    }
+    Optional<unsigned> Idx = matchExtractIndex(Inst, I, Inst->getOpcode());
+    if (Idx.hasValue()) {
----------------
ABataev wrote:
> Ayal wrote:
> > Can simplify by checking if BestOrder is the identify permutation at the end, as done at the end of sortPtrAccesses(); using `getExtractIndex(Inst)` which returns Idx even if it's equal to I. Better rename BestOrder here too. 
> Check the boolean flag is faster than to perform N comparisons. That's why I'd prefer to leave it as is.
Right; it could simplify the code though, and N is usually very small.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1621
+        newTreeEntry(VL, /*Vectorized=*/true, UserTreeIdx, ReuseShuffleIndicies,
+                     I->getFirst());
+        return;
----------------
Very good, thanks.
Can `CurrentOrder` be used instead of `I->getFirst()`? 
Can `++NumOpsWantToKeepOrder[CurrentOrder]` work? After all the trouble...

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:3027
+  const unsigned E = Indices.size();
+  Mask.resize(Indices.size());
+  for (unsigned I = 0; I < E; ++I)
----------------
May as well use `E` when resizing `Mask`.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:3358
+        for (unsigned I = 0, End = E->Scalars.size(); I < End; ++I)
+          Mask[E->ReorderIndices[I]] = I;
+        V = Builder.CreateShuffleVector(V, UndefValue::get(V->getType()),
----------------
Another inversePermutation?

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:5693
+        SmallVector<Value *, 4> ReorderedOps;
+        ReorderedOps.reserve(VL.size());
+        for (const unsigned Idx : *Order)
----------------
Fold by feeding the constructor with the size.

Repository:
  rL LLVM

https://reviews.llvm.org/D43776