[llvm] [VPlan] Delay adding canonical IV increment and exit branches. (PR #82270)

Wed Jun 12 14:01:08 PDT 2024

fhahn wrote:

> Can you please share a VPlan design doc which has been discussed with a community ? Perhaps I'm missing something and as right now my view of a VPlan is it has self-contained i.e. it should be possible to get all required information to make xform without accessing underlying IR (except for live-ins and live-outs) and at any point be able to generate code out of it without any extra manipulations on a VPlan.

Put up PRs to update some parts of the documentation (which started out as the original design docs from when VPlan was originally proposed) based on the recent Developers' Meeting talk https://github.com/llvm/llvm-project/pull/85688, https://github.com/llvm/llvm-project/pull/85689. W.r.t. to the VPWidenIntOrFpInductionRecipe, the increment was intentionally not modeled explicitly IIRC to make use of VPlan’s capabilities to model higher-level concepts directly, thus having simpler plans with more expressive recipes.

The part of generating code at any point would get less true with that patch, but we already employ gradual lowering, so initial VPlans cannot be code-gen'd directly (e.g. predicated VPReplicateRecipes only get expanded to replicate regions later, which helps to simplifier earlier transforms) 

> Ignoring cost model part, for VPlan that represents just one loop it's certainly simplifies xforms. But if we consider future work with peel/remainder (prolog/epilog/tail/cleanup) loops represented that is going to complicate xforms (unrolling, interleaving), codegen and cost model, am I wrong ? That technically can be mitigated by https://reviews.llvm.org/D148581, but if we consider SEME loops, from my perspective, that becomes complicated too.
> 

For some transformations it may be helpful to have the fully expanded form, but in that case it could be introduced when needed (e.g. as we already do with expanding replicate regions before merging them)

I finally managed to wrap up the implementation of an explicit interleave transform on the current state: https://github.com/llvm/llvm-project/pull/94339

The current version adds a set of VPInstructions to compute the vector parts of inductions, which requires quite a bit of work to handle all cases. The expanded form is quite a bit more verbose, which I think is why it is desirable to keep things more abstract until needed. I think the expanded form early on would also force us to detect the induction increments (chained together across parts) and handle that explicitly. It might simplify things to further delay full expansion until after interleaving by introducing an intermediate opcode to compute the step vector for a given part.

> Eventually yes, to simplify codegen. But most importantly, for [EVL-based vectorization](https://github.com/llvm/llvm-project/pull/76172) that change helps to replace `WidenVFxUF` that is used with `WidenEVL`. That will be true for the remaining `VPWidenPointerInductionRecipe` that also better to be decomposed. Also want to point out #82021 moves `trunc` optimization to xform so that VPlan became more explicit, which IMO makes more sense. However, if VPlan is going to allow some implicitness that's not so clear if it's good.
> 

`VPWidenPointerInductionRecipe` has recently been updated to only generate the vector part, the scalar part is now modeled via scalar-iv-steps (since https://github.com/llvm/llvm-project/pull/83068 )

VPWidenIntOrFpInductionRecipe’s are by definition being tied to being an induction, which as a consequence also means their step is loop-invariant. Not modeling the increment explicitly as the benefit of any transform changing their increment to not be loop-invariant. 

To support wide inductions with EVL, I think it would be better to introduce a generic wide phi recipe (which doesn’t have to be an induction) and lower to that + explicit increment. This can also be used to lower wide induction recipes for simpler code-gen late in the pipeline.

Another thing to note is that at the moment VPWidenIntOrFpInductionRecipe will create a step vector per part, using the step-vector of the last part as incoming value.

There’s a small benefit from explicitly adding the VF as operand to VPWidenIntOrFpInductionRecipe (avoiding some vscale * VF.getKnownMinValue())  as in https://github.com/llvm/llvm-project/pull/95305. But that wouldn’t help with EVL, as EVL computations in the vector loop region won’t dominate the wide IV recipes.

> Right, and that's my point: now all places that need to be aware of the update will have extra logic to deal with them. Also, for the `VPWidenIntOrFpRecipe` heuristics becomes a bit more complicated / unclear to estimate broadcast of VF. For the `VPWidenPointerInductionRecipe` it would also have to accommodate `onlyScalarsGenerated` case. For these vectorized IVs, is there a plan to represent `broadcast(VF)` in the preheader ?

`VPWidenIntOrFpRecipe` already store their scalar step, it would probably also make sense to model broadcast of step scaled by VF explicitly in the preheader, to allow CSE once we support explicit broadcasts/extracts. Again, this kind of transformation would fit well late in the pipeline, once we lowered more complex constructs for code-gen. The legacy cost-modeling for inductions at the moment doesn't consider any of this, but I think we are close to improving on this aspect, even though we might not be able to capture all aspects completely accurately, at least initially.

Finally, another concrete example where abstract recipes can be used to simplify transformations initially is modeling the vector loop’s header mask as general recipe initially, to be specialized later once we decide how to model it concretely (e.g. EVL, active.lane.mask, wide IV compare) https://github.com/llvm/llvm-project/pull/89603

https://github.com/llvm/llvm-project/pull/82270