[PATCH] D30732: LoopVectorizer: let target limit memory intensive loops

Wed Mar 8 03:55:11 PST 2017

jonpa created this revision.
Herald added a subscriber: mzolotukhin.

On SystemZ it is imperative during loop unrolling that the number of stores in the resulting loop do not exceed the point where the processor can't handle them all and as a result severely slows down. This is the result when store tags run out. To avoid this during loop unrolling, the SystemZ backend counts the number of stores, and computes based on that sum the limit of number of iterations to produce.

This problem should be handled during loop vectorization as well. The loop vectorizer may decide to vectorize a loop while  scalarizing a particular (store) instruction, which means the number of stores is increased. It can also perform unrolling  "interleaving"), which also increases the number of stores.

In order to handle the case of scalarization, the widening decision must be available via a call to getWideningDecision(). Therefore this check could either be implemented in LoopVectorize.cpp, or the LoopVectorizationCostModel class must somehow be factored out of the file so that the target can get the InstWidening result for each store. I have begun with the simpler task of implementing this directly in LoopVectorizer, in the hope that this does not prove to be too crude to accept.

- checkVectorizationFactorForMem() must be called after expectedCost(), so that the widening decisions for each VF are available.

- Since getWideningDecision() is parameterized with VF, checkVectorizationFactorForMem() is called with each VF considered.

- limitUnrollForMem() computes the max unroll factor in a similar fashion by counting stores. I felt I had to avoid the name limitInterleaveFactorForMem, because 'interleaving' is already used (in my opinion in a confusing way) for both memory-interleaving and unrolling.


https://reviews.llvm.org/D30732

Files:
  lib/Transforms/Vectorize/LoopVectorize.cpp


Index: lib/Transforms/Vectorize/LoopVectorize.cpp
===================================================================

--- lib/Transforms/Vectorize/LoopVectorize.cpp
+++ lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -2164,6 +2164,48 @@
 
   DecisionList WideningDecisions;
 
+  unsigned countResultingNumStores(unsigned VF) {
+    unsigned NumStores = 0;
+    for (BasicBlock *BB : TheLoop->blocks()) {
+      for (Instruction &I : *BB) {
+        if (isa<StoreInst>(&I)) {
+          Type *MemAccessTy = I.getOperand(0)->getType();
+          unsigned N = TTI.getMemoryOpCost(Instruction::Store, MemAccessTy, 0, 0);
+          if (VF > 1 &&
+              getWideningDecision(&I, VF) == LoopVectorizationCostModel::CM_Scalarize)
+            N *= VF;
+          NumStores += N;
+        }
+      }
+    }
+    return NumStores;
+  }
+
+  // Do an extra check to see if VF is ok, in the context of memory
+  // accesses. If the target has specified a limit for the number of stores
+  // in the resulting loop, the stores will be counted (and multiplied by VF
+  // in case of scalarization), and then true will be returned only if the
+  // sum is less than the limit.
+  bool checkVectorizationFactorForMem(unsigned VF) {
+    unsigned MaxNumStores = TTI.getMaxNumStoresInResultingLoop();
+    if (!MaxNumStores)
+      return true;
+    return (countResultingNumStores(VF) <= MaxNumStores);
+  }
+
+  // Similar to above, except that this involves the interleaving factor
+  // (unrolling) of the loop after VF has been decided on. If the target
+  // specifies a limit for the number of stores, a limit for the interleave
+  // factor is returned.
+  unsigned limitUnrollForMem(unsigned VF) {
+    unsigned MaxNumStores = TTI.getMaxNumStoresInResultingLoop();
+    if (!MaxNumStores)
+      return UINT_MAX;
+    unsigned NumStores = countResultingNumStores(VF);
+    unsigned Max = (NumStores ? (MaxNumStores / NumStores) : UINT_MAX);
+    return (Max > 0 ? Max : 1);
+  }
+
 public:
   /// The loop that we evaluate.
   Loop *TheLoop;
@@ -6312,6 +6354,11 @@
     // we need to divide the cost of the vector loops by the width of
     // the vector elements.
     VectorizationCostTy C = expectedCost(i);
+
+    // Target may put a limit on memory intenisve loops.
+    if (!checkVectorizationFactorForMem(i))
+      break;
+
     float VectorCost = C.first / (float)i;
     DEBUG(dbgs() << "LV: Vector loop of width " << i
                  << " costs: " << (int)VectorCost << ".\n");
@@ -6460,6 +6507,11 @@
   // Clamp the interleave ranges to reasonable counts.
   unsigned MaxInterleaveCount = TTI.getMaxInterleaveFactor(VF);
 
+  // Target may put a limit on memory intenisve loops.
+  unsigned Lim = limitUnrollForMem(VF);
+  if (Lim < MaxInterleaveCount)
+    MaxInterleaveCount = Lim;
+
   // Check if the user has overridden the max.
   if (VF == 1) {
     if (ForceTargetMaxScalarInterleaveFactor.getNumOccurrences() > 0)


-------------- next part --------------
A non-text attachment was scrubbed...
Name: D30732.90993.patch
Type: text/x-patch
Size: 2936 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20170308/4990298f/attachment.bin>