[llvm] [VPlan] Add new VPUniformPerUFRecipe, use for step truncation. (PR #78113)

Sun Jan 21 10:19:57 PST 2024

ayalz wrote:

> > In order to cast a scalar uniform step (placed in the preheader) could we adapt VPWidenCastRecipe - which is a per-part recipe practically widening the resulting type (only) - by having it extend/truncate a given scalar **or** the elements of a given vector, per part - depending on whether its operand is a vector or scalar? (Could possibly rename it, and/or potentially also get the element count of the latter instead of State.VF, if preferred.)
> > Being uniform across VF lanes of each part, or across VF*UF, is more accurately associated with each VPValue than with recipes. The latter may propagate uniformity from their operands to the values they define, as in cast instructions. In some cases, such as integer divide (and floor) by VF or a multiple thereof, a non-uniform (consecutive, aligned on VF) operand may yield a uniform result.
> 
> We could adjust VPWidenCastRecipe to support the uniform-per-UF case by using the existing uniformity checks, but having recipes that support generating both scalar and vector code will make the individual recipes (slightly) more complicated (need extra checks and handle multiple cases in `::execute` or cost estimates). Having separate recipes for the widen (vectorizing) and uniform cases would also allow to encode info explicitly and make it easier when reading the textual representation of a VPlan.
> 
> It also depends what we want to do with `VPReplicateRecipe` long-term, which currently models codegen for both replicating and uniform scalar codegen. Breaking up the recipe into 2 separate ones would help improve clarity and simplify codegen. As part of doing so we should try to make sure the new recipes can be created without an underlying IR instruction (which `VPReplicateRecipe` requires at the moment).
> 
> Note that after explicit unrolling lands, the `PerUF` part will be dropped.

> > In order to cast a scalar uniform step (placed in the preheader) could we adapt VPWidenCastRecipe - which is a per-part recipe practically widening the resulting type (only) - by having it extend/truncate a given scalar **or** the elements of a given vector, per part - depending on whether its operand is a vector or scalar? (Could possibly rename it, and/or potentially also get the element count of the latter instead of State.VF, if preferred.)
> > Being uniform across VF lanes of each part, or across VF*UF, is more accurately associated with each VPValue than with recipes. The latter may propagate uniformity from their operands to the values they define, as in cast instructions. In some cases, such as integer divide (and floor) by VF or a multiple thereof, a non-uniform (consecutive, aligned on VF) operand may yield a uniform result.
> 
> We could adjust VPWidenCastRecipe to support the uniform-per-UF case by using the existing uniformity checks, but having recipes that support generating both scalar and vector code will make the individual recipes (slightly) more complicated (need extra checks and handle multiple cases in `::execute` or cost estimates). Having separate recipes for the widen (vectorizing) and uniform cases would also allow to encode info explicitly and make it easier when reading the textual representation of a VPlan.
> 
> It also depends what we want to do with `VPReplicateRecipe` long-term, which currently models codegen for both replicating and uniform scalar codegen. Breaking up the recipe into 2 separate ones would help improve clarity and simplify codegen. As part of doing so we should try to make sure the new recipes can be created without an underlying IR instruction (which `VPReplicateRecipe` requires at the moment).
> 
> Note that after explicit unrolling lands, the `PerUF` part will be dropped.

The `VPUniformPerUFRecipe` of the current patch deals only with casts, so best name it as such, at-least for now, and define it comparable to `VPWidenCastRecipe`. Perhaps `VPScalarCastRecipe` would work - it effectively serves **Invariant** values placed outside the loop that are Uniform across all trip-count iterations, not only across the loop's VF and/or UF. Down the road, it should presumably be possible to query whether its VPValue operand is scalar or vector along with its uniformity across VF and/or UF or the entire Trip Count, and use that for cost and code-generation, or for asserting the case of distinct recipes.

How to refactor VPReplicateRecipe, relieving it of relying on an underlying IR instruction, and dealing with uniform cases, deserves further thoughts. Renaming may be adequate when recipe(s) generate scalar instruction(s) in general, rather than "replicate" an existing underlying Instruction. I.e., in contrast to `VPReplicateCastRecipe`.

One aspect of `VPWidenCastRecipe` is its destination type. A vector type, currently set during `::execute`, can be set as soon as VF is known. I.e., when VPlan's range of VF's holds a single value - either initially or eventually during `optimizeForVFAndUF`.

Yes, explicit unrolling-by-UF should definitely drop per-UF logic thereby simplifying recipe::execute overall. Similar unrolling-by-VF could simplify Replicating Region logic.

https://github.com/llvm/llvm-project/pull/78113