[PATCH] D43776: [SLP] Fix PR36481: vectorize reassociated instructions.

Fri Mar 9 09:52:49 PST 2018

ABataev marked 12 inline comments as done.
ABataev added a comment.

In https://reviews.llvm.org/D43776#1031044, @Ayal wrote:

> This patch addresses the following TODO, plus handles extracts:
>
>   // Check if the loads are consecutive, reversed, or neither.
>   // TODO: What we really want is to sort the loads, but for now, check
>   // the two likely directions.
>
>
> At some point it's worth documenting that the best order is set once, at the root of the tree, and then gets propagated to its leaves. Would be good to do so w/o having to rebuild the tree (introduced before this patch)?

Yes, but this must be in a different patch, not this one.

================
Comment at: include/llvm/Analysis/LoopAccessAnalysis.h:679
+/// saves the mask for actual memory accesses in program order in
+/// \p SortedIndices as <2,0,1,3>
+bool sortPtrAccesses(ArrayRef<Value *> VL, const DataLayout &DL,
----------------
Ayal wrote:
> The usage later and documentation above look for a set of references that are consecutive under some permutation. I.e., w/o gaps. The implementation below allows arbitrary gaps, and does not check for zero gaps, i.e., replicated references.
> 
> Would it be better to simply do a Pigeonhole sort, and perhaps call it isPermutedConsecutiveAccess():
> 1. Scan all references to find the one with minimal address. Bail out if any reference is incomparable with the running min.
> 2. Scan all references and set SortedIndices[i] = difference(VL[i], Minimum). Bail out if this entry is to be set more than once.
> 
> Note: it may be good to support replicated references, say when the gaps are internal to the vector to avoid loading out of bounds. Perhaps add a TODO.
> 
> Note: it may be good to have LAA provide some common support for both SLP and LV's InterleaveGroup construction, which share some aspects. Perhaps add a TODO.
The documentation above does not say anything about consecutive access. It just states, that the pointers are sorted and that's it. I did it on purpose, so later we could reuse this function for masked loads|stores.
Masked loads are not supported yet, that's why in the SLPVectorizer I added an additional check for the consecutive access.

================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:1113
+    const auto *Diff =
+        dyn_cast<SCEVConstant>(SE.getMinusSCEV(SE.getSCEV(Ptr), Scev0));
+    // The pointers may not have a constant offset from each other, or SCEV
----------------
Ayal wrote:
> Use computeConstantDifference() instead of computing it explicitly? It should compare GetUnderlyingObject()'s, if worthwhile, rather than doing so here.
Tried, it does not work in many cases.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1229
+  SmallVector<OrdersType, 4> OpOrders;
+  unsigned DirectOrderNum = 0;

----------------
Ayal wrote:
> Document what DirectOrderNum counts and/or have a more self-explanatory name similar to the original one, e.g., `NumOpsWantToKeepOriginalOrder`
> 
> Can add that `NumOpsWantToKeepOriginalOrder` holds the count for the identity permutation, instead of holding this count inside `NumOpsWantsToKeepOrder` along with all other permutations.
I don't want to add the new entry for operations, that do not require reordering. I'd better the code in another way.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1581
     case Instruction::ExtractElement: {
-      bool Reuse = canReuseExtract(VL, VL0);
+      OrdersType BestOrder;
+      bool Reuse = canReuseExtract(VL, VL0, BestOrder);
----------------
Ayal wrote:
> BestOrder >> CurrentOrder?
I think it does not matter because the current order is the best order.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1593
+          auto I = NumOpsWantToKeepOrder.find(BestOrder);
+          if (I == NumOpsWantToKeepOrder.end()) {
+            OpOrders.emplace_back(BestOrder);
----------------
Ayal wrote:
> This is pretty hard to follow, and deserves an explanation. Would be better to simply do something like `++NumOpsWantToKeepOrder[BestOrder]`.
I need to use an iterator, will rework the code.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1659
+        // Check that the sorted loads are consecutive.
+        if (Diff && Diff->getAPInt().getZExtValue() == (VL.size() - 1) * Size) {
+          if (BestOrder.empty()) {
----------------
Ayal wrote:
> Better have sortPtrAccess() set "BestOrder" only if the given pointers are indeed consecutive once permuted, instead of checking here the Diff of max - min.
Just like I said, I did it for future support of masked loads|stores.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:2029
+    }
+    Optional<unsigned> Idx = matchExtractIndex(Inst, I, Inst->getOpcode());
+    if (Idx.hasValue()) {
----------------
Ayal wrote:
> Can simplify by checking if BestOrder is the identify permutation at the end, as done at the end of sortPtrAccesses(); using `getExtractIndex(Inst)` which returns Idx even if it's equal to I. Better rename BestOrder here too. 
Check the boolean flag is faster than to perform N comparisons. That's why I'd prefer to leave it as is.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:3086
+            Builder.SetInsertPoint(VL0);
           Builder.SetInsertPoint(VL0);
           V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),
----------------
Ayal wrote:
> InsertPoint may be set twice to VL0?
> 
Missed it, thanks.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:3298
+      if (IsReorder)
+        VL0 = cast<Instruction>(E->Scalars[E->ReorderIndices.front()]);
       setInsertPointAfterBundle(E->Scalars, VL0);
----------------
Ayal wrote:
> Should we first inverse the permutation and then take its front()?
> 
> Would be good to have a testcase where this makes a difference and check it (one way or the other), if there isn't one already.
There are several tests already that test that this is correct code.
SLPVectorizer/X86/PR32086.ll, SLPVectorizer/X86/jumbled-load.ll, SLPVectorizer/X86/store-jumbled.ll etc.

================
Comment at: test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll:24
 ; CHECK-NEXT:    [[TMP13:%.*]] = insertelement <4 x i32> [[TMP11]], i32 [[TMP12]], i32 3
 ; CHECK-NEXT:    store <4 x i32> [[TMP13]], <4 x i32>* [[SINK:%.*]]
 ; CHECK-NEXT:    ret void
----------------
Ayal wrote:
> This redundant sequence of extractions from REORDER_SHUFFLE and insertions into TMP13 is hopefully eliminated later. Is the cost model ignoring it, or can we avoid generating it? Would be good to have the test CHECK the cost.
Yes, InstCombiner will squash all these instructions into one shuffle. Yes, cost model is aware of these operations and ignores their cost (canReuseExtract() is intended to do this).

Repository:
  rL LLVM

https://reviews.llvm.org/D43776