[PATCH] D28975: [LV] Introducing VPlan to model the vectorized code and drive its transformation

Wed Feb 1 07:23:58 PST 2017

Ayal added inline comments.

================
Comment at: docs/VectorizationPlan.rst:10
+
+- be a *lightweight* NFC patch;
+- show key aspects of VPlan's Hierarchical CFG concept;
----------------
mkuper wrote:
> As @chandlerc pointed out to me, this isn't actually NFC. It's expected not to change the output of the vectorizer, but it's a huge change in how the vectorizer *functions*. 
Agreed, patch is more than "a pure refactoring/cleanup". We meant to say "no change in output intended".

================
Comment at: docs/VectorizationPlan.rst:27
+
+  * aligning Cost step with Transformation step,
+  * representing entire code being transformed,
----------------
mkuper wrote:
> By this do you mean that making a cost decision will "record" the transformation we expect to make?
That could be one way to think of it. The aim is to record each decision once, and have both cost estimate and transformation stem from the same recording.

================
Comment at: docs/VectorizationPlan.rst:36
+
+Out of scope for initial patch:
+
----------------
mkuper wrote:
> This sounds reasonable to me.
> I assume this is in sync with what Elena is doing in, say, D27919, right?
Right; Legal should be refactored as well to reflect its true Legality nature, by adjacent patches.

================
Comment at: docs/VectorizationPlan.rst:60
+
+2. A LoopVectorizationPlanner can construct, optimize and discard one or more
+   VPlans, providing different ways to vectorize the loop or the loop nest.
----------------
mkuper wrote:
> How is "can construct, optimize and discard one or more VPlans" different from just "generates VPlans"?
> (I'm trying to understand if this adds information I'm missing. :-) )
They are the same if "generates VPlans" is understood as an iterative process, which includes taking already generated VPlans and modifying them to produce new candidates for consideration.

================
Comment at: docs/VectorizationPlan.rst:111
+
+1. Legal Step: check if loop can be legally vectorized; encode contraints and
+   artifacts if so. Initiate Vectorization Plan showing how the loop can be
----------------
mkuper wrote:
> contraints -> constraints.
> Also,  I understand what constraints are in this context (e.g. dependence distance), but what do you mean by artifacts?
Right, a constraint can be, e.g., an upper bound on the VF.
Artifacts are pieces of information collected by Legal which are useful for the vectorization process. An artifact can be, e.g., the set of reduction phi's; Legal has to identify them to make sure they are all vectorizable.
The distinction may be somewhat vague; e.g., Legal has to make sure runtime guards can be constructed for possible memory dependencies.

Another way of putting this, w/o constraints nor artifacts: Legal can be made responsible for either determining that a loop cannot be vectorizable, or else for providing initial VPlan(s) as constructive proof showing how it can be vectorized, albeit sub-optimally.

================
Comment at: docs/VectorizationPlan.rst:131
+
+One can compare with LLVM's existing SLP vectorizer, where TSLP [3]_ adds
+Step 2.b.
----------------
mkuper wrote:
> Could you give a more concrete example for what optimizing an LV VPlan would mean?
> 
> I haven't read the TSLP paper (although I should), but skimming it, it looks like it's basically constructing a lot of subtrees (each of which could be considered, in our context, its own VPlan) and choose the best among those. So this isn't exactly "optimizing" a plan, it's just recursively generating plans and choosing the cheapest one.
> Or should I just go and read the paper properly?
An example of optimizing an LV VPlan is LoopVectorizationPlanner::optimizePredicatedInstructions(). Another should be the recognition of interleave groups.

The reference to TSLP is by analogy, which you've summarized quite well. The point was that there could be various candidates worth considering, in this case all with the same VF. Furthermore, constructing TSLP subtrees is somewhat analogous to LV's isProfitableToScalarize().

(BTW, you can also watch the TSLP video recording ;-)

================
Comment at: docs/VectorizationPlan.rst:140
+OpenMP [6]_. This design to extend the Loop Vectorizer to outer loops supports
+and raises the importance of explicit vectorization beyond the current
+capabilities of Clang and LLVM. Namely, from currently forcing the
----------------
mkuper wrote:
> rengolin wrote:
> > mkuper wrote:
> > > I don't believe anything in this design raises the importance of explicit outer loop vectorization.
> > > I think what you're trying to say is that if we want to support explicit outer loop vectorization (which I believe we do), the best way to do it would be using this design. Am I right?
> > From the original RFC (linked in the description), the idea is that outer loops will add a lot more complexity to the vectoriser, which is already getting side-tracked by the multiple strategies. This effort is the first step in joining the strategies, which will be the sanest way forward to adding more functionality to the loop vectoriser, including outer loops and better interactions with Polly.
> All of that is correct, I didn't mean to imply we should be handling outer loops *now*. :-)
> I was just confused about the "raising the importance" part.
Comment tries to say that it'll be harder to continue and make "right" decisions as the scope of vectorization increases to outerloops, with the proposed design (or any other). Explicit vectorization tries to mitigate this expected hardship.

================
Comment at: docs/VectorizationPlan.rst:240
+   In particular, the transition can start with an NFC patch.
+   In particular, VPlan must support efficient selection of VF and/or UF.
+
----------------
rengolin wrote:
> I believe this is not possible, but that's ok.
> 
> What I think you mean is that we can start with a single VPlan that does **exactly** what the current cost model does, in which case, it will have no noticeable impact to the users (this is not the same as NFC, as Michael said). But just doing that is just replacing six of one by half-a-dozen of the other.
> 
> This is only worth doing if we can add more (different) VPlans, in which case, there will be a noticeable impact in higher optimisation levels.
> 
> But again, that's ok. We currently change the behaviour of the vectoriser depending on the -O flag and we can continue doing it. -O2 means one VPlan, -O3 means more. We could even break the invisible 3 barrier and add an actual 4 which is exhaust all options.
> 
> My point is: this cannot be a design guideline.
The guideline requires VPlan to be designed such that it can support any decision taken by current LV; but of-course not be limited to it.

================
Comment at: docs/VectorizationPlan.rst:244
+   model and the code generation phases, where the cost estimation must
+   evaluate the to-be-generated code accurately.
+
----------------
mkuper wrote:
> rengolin wrote:
> > Aligning cost and codegen will always have to rely on heuristics, even if we codegen multiple variations, as this is modelling IR, not assembly.
> > 
> > We should respect the cost analysis and we should strive to do the best we can, yes, but we shouldn't try to estimate the generated code **accurately** over other improvements.
> I think "codegen" here means "the IR generated by the vectorizer", not "backend code generation". See line 123.
> Actually getting the cost model to accurately estimate the cost of complex IR constructs is an orthogonal problem.
Right, sorry for the confusion. "CodeGen" indeed means here "the IR code generated by the vectorizer".

================
Comment at: docs/VectorizationPlan.rst:251
+      vectorized loop which may include multiple basic-blocks and nested loops.
+   b. SLP vectorization.
+   c. Combinations of the above, including nested vectorization: vectorizing
----------------
mkuper wrote:
> If this happens, it would be fantastic. 
> How would this design help support this?
The original thought of a Recipe should also match an SLP 'bundle'. An interleave-group recipe for instance takes care of a group of related loads or stores, albeit doing more than stitching them together to produce a single vector load or store. Modeling SLP requires more attention.

================
Comment at: docs/VectorizationPlan.rst:271
+:VPlan:
+  A recipe for generating a vectorized version from a given IR code.
+  Takes a "scenario-based approach" to vectorization planning.
----------------
mkuper wrote:
> So, given the text below, by "a given IR code" you mean "a given IR SESE region", right?
Right.

================
Comment at: docs/VectorizationPlan.rst:281
+  basic-blocks as in Sharir [9]_, promoting the Tile encapsulation. We use the
+  terms Region and Block rather than Tile [8]_ to avoid confusion with loop
+  tiling.
----------------
mkuper wrote:
> Each Region is still required to be SESE, right?
Right.

================
Comment at: docs/VectorizationPlan.rst:289
+  VPBasicBlocks.
+  The VPBasicBlock takes care of the control-flow
+  relations with other VPBasicBlock's and Regions.
----------------
mkuper wrote:
> Something here is garbled.
How so?

================
Comment at: docs/VectorizationPlan.rst:295
+  may happen, e.g., to support disjoint Regions and to ensure Regions have a
+  single exit, possibly an empty one.
+
----------------
rengolin wrote:
> mkuper wrote:
> > What do you mean by supporting disjoint Regions?
> If this is related to detecting in-loop disjoint regions, wouldn't it be easier to do that transformation before the vectoriser and generate two loops, so that they're always SESE or not worth looking?
Two VPRegions are disjoint (or else contain one another) in the sense that they share no common VP(Basic)Block. Say we have two if-the-else hammocks one after the other, and we wish to wrap each in a VPRegion. To avoid having the Exit of the first be the Entry of the second, it should be split somehow, possibly resulting in an empty block.

================
Comment at: docs/VectorizationPlan.rst:307
+  **Design principle:** in order to reason about how to vectorize an Instruction
+  or how much it would cost, one has to consult the VPRecipe holding it.
+
----------------
rengolin wrote:
> This makes sense. And VRecipes consult the target information based on instructions and sizes.
> 
> It should be straightforward to have cross-target VRecipes just based on isLegal() and friends, which would prune all unsupported functionality really early.
VPRecipes should indeed be cross-target. If something in the loop is illegal we avoid building any VPlans for it in the first place, as a form of early pruning. OTOH, it should conceptually be possible, and arguably profitable, to scalarize any instruction.

https://reviews.llvm.org/D28975