[PATCH] D43776: [SLP] Fix PR36481: vectorize reassociated instructions.

Fri Mar 23 16:58:02 PDT 2018

Ayal added a comment.

Have test(s) for extractvalue's, for completeness.
Make sure tests cover best-order selection: cases where original order is just as frequent as other orders (tie-break), less frequent, more frequent.

================
Comment at: include/llvm/Analysis/LoopAccessAnalysis.h:679
+/// saves the mask for actual memory accesses in program order in
+/// \p SortedIndices as <2,0,1,3>
+bool sortPtrAccesses(ArrayRef<Value *> VL, const DataLayout &DL,
----------------
ABataev wrote:
> Ayal wrote:
> > ABataev wrote:
> > > Ayal wrote:
> > > > The usage later and documentation above look for a set of references that are consecutive under some permutation. I.e., w/o gaps. The implementation below allows arbitrary gaps, and does not check for zero gaps, i.e., replicated references.
> > > > 
> > > > Would it be better to simply do a Pigeonhole sort, and perhaps call it isPermutedConsecutiveAccess():
> > > > 1. Scan all references to find the one with minimal address. Bail out if any reference is incomparable with the running min.
> > > > 2. Scan all references and set SortedIndices[i] = difference(VL[i], Minimum). Bail out if this entry is to be set more than once.
> > > > 
> > > > Note: it may be good to support replicated references, say when the gaps are internal to the vector to avoid loading out of bounds. Perhaps add a TODO.
> > > > 
> > > > Note: it may be good to have LAA provide some common support for both SLP and LV's InterleaveGroup construction, which share some aspects. Perhaps add a TODO.
> > > The documentation above does not say anything about consecutive access. It just states, that the pointers are sorted and that's it. I did it on purpose, so later we could reuse this function for masked loads|stores.
> > > Masked loads are not supported yet, that's why in the SLPVectorizer I added an additional check for the consecutive access.
> > By "The documentation above" I meant the one example of a[i+2], a[i+0], a[i+1] and a[i+3] which is a permutation of consecutive addresses.
> > 
> > Masked loads and stores (will) also require that all pointers be within range of a single vector, right?
> 1. I see.
> 2. Right
Check for zero gaps, i.e., replicated references?
These are currently not supported, when checking `canReuseExtract()`.

================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:1113
+    const auto *Diff =
+        dyn_cast<SCEVConstant>(SE.getMinusSCEV(SE.getSCEV(Ptr), Scev0));
+    // The pointers may not have a constant offset from each other, or SCEV
----------------
ABataev wrote:
> Ayal wrote:
> > ABataev wrote:
> > > Ayal wrote:
> > > > Use computeConstantDifference() instead of computing it explicitly? It should compare GetUnderlyingObject()'s, if worthwhile, rather than doing so here.
> > > Tried, it does not work in many cases.
> > Then that should be fixed, worth opening a PR?
> I'm not sure that this is a bug, it is a feature :) . computeConstantDifference() is used to get the difference between the pointer returned by the `GetUnderlyingObject()` and  kind of a `GEP %ptr, 0, n`, where `n` is constant. But this function does not work if the first load element is from `GEP %ptr, 0, m`, not from `%ptr`, and others are from `GEP %ptr, 0, m+1`, `GEP %ptr, 0, m+2` etc.
ok, right; `computeConstantDifference()` is lightweight. But it would be good to refactor a more time consuming variant which checks if `getMinusSCEV` returns a constant, and first compares UnderlyingObjects; and also AddressSpaces, following LoopVectorize.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1659
+        // Check that the sorted loads are consecutive.
+        if (Diff && Diff->getAPInt().getZExtValue() == (VL.size() - 1) * Size) {
+          if (BestOrder.empty()) {
----------------
ABataev wrote:
> Ayal wrote:
> > Better have sortPtrAccess() set "BestOrder" only if the given pointers are indeed consecutive once permuted, instead of checking here the Diff of max - min.
> Just like I said, I did it for future support of masked loads|stores.
OK, so `sortPtrAccesses()` can serve more general cases where max - min > VF. It would still be helpful to wrap it, and refactor the above code/usage inside an `isPermutedConsecutiveAccess()` method which would call `sortPtrAccesses()`.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1621
+        newTreeEntry(VL, /*Vectorized=*/true, UserTreeIdx, ReuseShuffleIndicies,
+                     I->getFirst());
+        return;
----------------
ABataev wrote:
> Ayal wrote:
> > Very good, thanks.
> > Can `CurrentOrder` be used instead of `I->getFirst()`? 
> > Can `++NumOpsWantToKeepOrder[CurrentOrder]` work? After all the trouble...
> No, we cannot use `CurrentOrder`. I tried to reduce the memory usage and instead of copying the `CurrentOrder` in the `newTreEntry()` function I keep the ArrayRef for this order. But `CurrentOrder` is the local variable and it will be destroyed when we exit out of its declaration scope. And, thus, we will keep the reference to incorrect memory. Instead, I need to use the reference that will exist until the end of the vectorization process (as the key of the map)
OK, right; it's clear why each Tree Entry should not hold an OrdersType object, and that `newTreeEntry()` should be given the object stored inside `NumOpsWantToKeepOrder` rather than the equivalent temporary `CurrentOrder`.

So `++NumOpsWantToKeepOrder[CurrentOrder]` can work, but then `newTreeEntry()` will need to be given `NumOpsWantToKeepOrder.find(CurrentOrder)->getFirst()`?

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:458
   assert(Opcode == Instruction::ExtractElement ||
          Opcode == Instruction::ExtractValue);
   if (Opcode == Instruction::ExtractElement) {
----------------
Add a message to the assert.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1250
+  /// operations in this order. It stores only those orders that require
+  /// reordering, if reordering is not required it is counted using \a
+  /// DirectOrderNum.
----------------
"\a" >> "\p"

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1254
+  /// Number of bundles that do not require reordering.
+  unsigned DirectOrderNum = 0;

----------------
"DirectOrderNum" >> "NumOpsWantToKeepOriginalOrder" instead of adding the comment?

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1617
+        DEBUG(dbgs()
+              << "SLP: Reusing or shuffling of reordered extract sequence.\n");
+        // Insert new order with initial value 0, if it does not exist,
----------------
May be helpful to also dump CurrentOrder into dbgs().

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1651
+      SmallVector<Value *, 4> PointerOps;
+      PointerOps.reserve(VL.size());
+      for (Value *V : VL) {
----------------
Fold the size into the constructor.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:2041
+  unsigned E = VL.size();
+  // Assign initial value to of all items to E + 1 so we can check if the
+  // extract instruction index was used already.
----------------
> // Assign initial value to of all items to E + 1 so we can check if the

// Assign to all items the initial value `E + 1` so we can check if the

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:2044
+  // Also, later we can check that all the indices are used and we have a
+  // consecutive access in the extract instructions (if at least one element of
+  // the BestOrder is still E + 1, we don't have consecutive extract
----------------
> (if at least one element of ... ).

", by checking that no element of `CurrentOrder` still has value `E + 1`."
Note that there is no such check later.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:2074
+    }
   }

----------------
It may be easier to read if instead of clearing CurrentOrder at each early exit, we break from the loop, and right after the loop check if it exited early and if (`I < E`) clear CurrentOrder and return false.

Repository:
  rL LLVM

https://reviews.llvm.org/D43776