[PATCH] D32871: [LV] Using VPlan to model the vectorized code and drive its transformation

Mon Jun 12 02:35:52 PDT 2017

Ayal marked 2 inline comments as done.
Ayal added inline comments.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:2265
+    while (!VPlans.empty()) {
+      VPlan *Plan = VPlans.back();
+      VPlans.pop_back();
----------------
mkuper wrote:
> Ayal wrote:
> > rengolin wrote:
> > > SmallVectorImpl should destroy its own range, you don't need to do it yourself.
> > Leaving ~LoopVectorizationPlanner() {} does not seem to deallocate the VPlan objects created by buildVPlan(). Is there another way to prevent them from leaking?
> A vector of unique_ptr, maybe?
> Anyway, I'm fine with this as is.
Originally D28975 did use "shared_ptr<VPlan>". However using regular pointers to VPlan seems simpler, just need to destroy them correctly at the end.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:1953
   bool isProfitableToScalarize(Instruction *I, unsigned VF) const {
+    // Unroller also calls this method, but does not collectInstsToScalarize.
+    if (VF == 1)
----------------
mkuper wrote:
> Why?
Good catch; isProfitableToScalarize() should be called only for VF > 1, as it compares scalarizing to vectorizing-for-VF. All original callers to isProfitableToScalarize() first check if VF > 1 before calling it. This patch adds a new caller inside tryToWiden(), which also avoids calling it with VF == 1.

Will adding an "assert(VF > 1);" instead of this "if (VF ==1) return true;".

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:2324
+  /// returned value holds for the entire \p Range.
+  bool testVFRange(const std::function<bool(unsigned)> &Predicate,
+                   VFRange &Range);
----------------
mkuper wrote:
> This seems like a somewhat odd API.
> Why do we only support cutting from the top, and not from the bottom or the middle? Do we expect to only ever prune from the top? Otherwise, I'd expect the range to be represented as a set, rather than an interval, and use a filter over that set. The current state may be fine for simplicity's sake, but i'd like to understand this better.
> 
> Regardless, please rename the method. It's really surprising that a bool test...() modifies one of its arguments.
Indeed a more general implementation of building VPlans could represent arbitrary sets of VF's rather than "Ranges" or intervals of VF's, which this patch uses. VPlans themselves are not confined to represent only ranges.

This implementation builds VPlans for the full {1,2,4,8,...,MaxVF} range of feasible VF's by repeatedly building a VPlan starting from a given VF up until the maximum VF possible. Each vectorization decision can potentially reduce this maximum. The two extremes we could end up with are: one VPlan for all VF's, and one VPlan for each VF. Decisions hopefully exhibit this form of continuity, but they certainly don't have to.

Will see if above should be added as a comment to buildVPlans().

Some alternative names we came up with:
```
testAndClampVFRange()
prefixTestVFRange()
VFRangePrefixTest()
testStartAndSetEndVF()
```
any one of these looks ok? Better suggestions are welcome.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:4687
   case InductionDescriptor::IK_FpInduction:
-    return widenIntOrFpInduction(P);
+    llvm_unreachable("Integer/fp induction is handled elsewhere.");
   case InductionDescriptor::IK_PtrInduction: {
----------------
mkuper wrote:
> Why the split between how we handle int and ptr inductions? To minimize the patch?
Yes, we're keeping ILV's current split between its widenIntOrFpInduction() and how it handles ptr inductions and other phi instructions. The former was last addressed and merged by r296145; the latter, appearing below, is simpler and remains inline.

In any case, addressing this split is independent of introducing VPlan.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:4735

 void InnerLoopVectorizer::vectorizeInstruction(Instruction &I) {
   switch (I.getOpcode()) {
----------------
mkuper wrote:
> Perhaps this should now be named widenInstruction?
Yup.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7634
+
+  for (auto *VPlanIter = VPlans.begin(); VPlanIter != VPlans.end();) {
+    VPlan *Plan = *VPlanIter;
----------------
mkuper wrote:
> I suppose this (and the rest of this code) is constructed this way because we intend to support more than one VPlan per VF/UF  pair soon?
Currently each VPlan covers a range of VF's and arbitrary UF's, where every feasible VF is covered by a single VPlan.

In the end a single VPlan should remain, which is the one that gets executed. As soon as possible all other VPlans are discharged, as done here.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7750

+namespace {
+/// VPWidenRecipe is a recipe for producing a copy of vector type for each
----------------
mkuper wrote:
> Maybe break all of the recipe stuff out into a separate file? Two files, even? (header/implementation)
Yes, this large LoopVectorizer.cpp file should be broken into multiple smaller files. ILV in its current form hinders doing so with these recipes. Follow-up patches should help facilitate this desired change.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:8250
+
+    for (Instruction &I : *BB) {
+      Instruction *Instr = &I;
----------------
mkuper wrote:
> Please add a comment somewhere explaining that the order in which we try the recipes matters. (e.g. tryToInterleave needs to come first.)
> I'm not sure how we should be handling recipe priority in general, if at all, but that's not for this review.
Sure. This ordering reflects the existing logic in LV.

Note that in the future, some recipes such as Interleave Groups may be constructed as a VPlan-based transformation, instead of jointly with other recipes, relieving this specific ordering concern here.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:8282
+          VPBB->appendRecipe(Recipe);
+        LastWidenRecipe = cast<VPWidenRecipe>(Recipe);
+        continue;
----------------
mkuper wrote:
> Shouldn't we be resetting LastWidenRecipe somewhere, if we encountered an instruction that uses a different recipe?
We could, as an optimization, because VPWidenRecipe's appendInstruction(Instr) currently succeeds only if Instr is its 'next' ingredient.

But it's not necessary - we can simply call appendInstruction(Instr) for the LastWidenRecipe even if other instructions have gone by (and have it fail if so). This way more relaxed forms of appendInstruction() can be supported.

OTOH, doing so would clutter each 'continue'.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7805
+  VPWidenIntOrFpInductionRecipe(PHINode *IV, TruncInst *Trunc = nullptr)
+      : VPRecipeBase(VPWidenIntOrFpInductionSC), IV(IV), Trunc(Trunc) {}
+
----------------
rengolin wrote:
> shouldn't you have some asserts here to make sure the PHI node is what you need it to be?
> 
> Same for other VPlans.
The asserts are there when execute() is called, e.g., when it calls ILV's widenIntOrFpInduction().

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:8017
+void LoopVectorizationPlanner::buildVPlans(unsigned MinVF, unsigned MaxVF) {
+  while (MinVF < MaxVF + 1) {
+    VFRange Range = {MinVF, MaxVF + 1};
----------------
rengolin wrote:
> I know you can re-use MinVF, but it would be more readable if you created a new variable VF to use inside the loop.
OK, sure.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:8022
+    std::string PlanName;
+    raw_string_ostream RSO(PlanName);
+    unsigned VF = MinVF;
----------------
rengolin wrote:
> This is confusing. Is this RSO *just* for the plan name?
> 
> It'd seem more appropriate to have the name be a property of the plans, not some outside property.
Yes, this is to concatenate the VF's into the plan name, which is indeed a property of the plan. Will move from buildVPlans() to the end of buildVPlan().

================
Comment at: lib/Transforms/Vectorize/VPlan.h:222
+
+  /// \return the closest ancestor starting from "this", which has successors.
+  /// Returns the root ancestor if all ancestors have no successors.
----------------
mkuper wrote:
> Ayal wrote:
> > mkuper wrote:
> > > The terminology here is somewhat confusing, since it's not immediately clear what the difference between an "ancestor" and a "predecessor" is. Maybe "enclosing region" instead of ancestor?
> > Added the following definition to clarify the terminology:
> >   /// An Ancestor of a block B is any block containing B, including B itself.
> > So it's either an "enclosing region" or the block itself, which in turn may be a region or a basic-block.
> I would strongly prefer less confusing terminology (I think ancestor very strongly evokes "direct or indirect predecessor"), but if I'm the only one getting confused by this, feel free to keep it.
ok, "Ancestors" and "Predecessors" are admittedly ambiguous terms in English... will use "getEnclosingBlockWithPredecessors()" and "getEnclosingBlockWithSuccecessors()" instead, after defining an Enclosing Block of block B to be any block that contains B, including itself.

================
Comment at: test/Transforms/LoopVectorize/if-pred-non-void.ll:222
 ; CHECK:   %[[T00:.+]] = extractelement <2 x i32> %wide.load, i32 0
-; CHECK:   %[[T01:.+]] = extractelement <2 x i32> %wide.load, i32 0
-; CHECK:   %[[T02:.+]] = add nsw i32 %[[T01]], %x
-; CHECK:   %[[T03:.+]] = udiv i32 %[[T00]], %[[T02]]
+; CHECK:   %[[T01:.+]] = add nsw i32 %[[T00]], %x
+; CHECK:   %[[T02:.+]] = extractelement <2 x i32> %wide.load, i32 0
----------------
mkuper wrote:
> The test changes look benign, but I'm curious about why we have them.
These test changes are due to the order in which predicated-and-scalarized basic-blocks are created and filled.

In current trunk, when the “udiv” is generated, the “extract” feeding its nominator is also generated and placed before it, courtesy of getScalarValue(). Later the “udiv” is placed in a basic-block of its own w/o its operands. Finally sinkScalarOperands() kicks in, sinking both its operands to appear next to it.

In this patch, when the “udiv” is generated, the “extract” feeding its nominator is also generated and placed before it, courtesy of getOrCreateScalar(). But having already created a basic-block for the “udiv”, this “extract” is placed there as well, complying with LV's approach of placing extracts next to their uses. When finally sinkScalarOperands() kicks in, it has only the other operand to sink (the denominator).

This discrepancy in the sinking order swaps the final placement of the two operands.

https://reviews.llvm.org/D32871