[llvm-dev] LoopVectorizer: Should the cost-model be used for legalisation?

Tue Jun 15 04:48:58 PDT 2021

Hi,

Thanks for bringing this up!

> On Jun 10, 2021, at 21:50, Sander De Smalen <Sander.DeSmalen at arm.com> wrote:
> 
> Hi,
> 
> Last year we added the InstructionCost class which adds the ability to
> represent that an operation cannot be costed, i.e. operations that cannot
> be expanded by the code-generator will have an invalid cost.
> 
> We started using this information in the Loop Vectorizer for scalable
> auto-vectorization. The LV has a legality- and a cost-model stage, which are 
> conceptually separate concepts with different purposes. But with the 
> introduction of having valid/invalid costs it's more inviting to use the 
> cost-model as 'legalisation', which leads us to the following question:
> 
>   Should we be using the cost-model to do legalisation?
> 
> 'Legalisation' in this context means asking the question beforehand if the 
> code-generator can handle the IR emitted from the LV. Examples of
> operations that need such legalisation are predicated divides (at least
> until we can use the llvm.vp intrinsics), or intrinsic calls that have no
> scalable-vector equivalent. For fixed-width vectors this legalisation issue
> is mostly moot, since operations on fixed-width vectors can be scalarised.
> For scalable vectors this is neither supported nor feasible [1].

I think this is one of the key points. LoopVectorLegality at the moment is mostly concerned whether the loop is not vectorizable due to conceptual issues (not sure sure what the best term would be) preventing vectorization (like unvectorizable dependences or FP constraints), not target specific instruction legality constraints like whether there is a vector version of a given instruction. It is also independent of any given concrete VF. IIUC this is also not limited to predication, e.g. we also need to check whether intrinsic calls are supported natively. Are there any others?

As you mentioned, predicated instructions that are not supported by a target can be scalarized and predicated. Even for fixed vectors, this is quite expensive in general, but it allows LoopVectorLegality to work mostly independently of other target constraints and focus on general legality checks. IIUC conceptually there’s nothing preventing us from scalarizing operations on scalable vectors in a similar fashion, but it requires an explicit loop dependent on the vector width at runtime and deciding the cost depends on an unknown (the target vector width).

I don’t have any insight in hardware-specific costs for scalable vector operations, so I can’t comment on any specifics there. I am wondering if the following hypothetical scenario is feasible/realistic: consider a loop where all operations expect one can be widened directly and a single operations needs scalarizing. I would expect there to be a point where the number of widenable operations gets large enough to offset the cost of emitting a loop to scalarize the single operation needing predication. Granted, depending on the maximum vector width of the hardware this number might be quite large, but I could imagine a scenario where we want to optimize given a tighter upper bound on the maximum vector width than allowed by the hardware spec. E.g. considering the an upper bound of 256 for the vector width, we’d only need to execute 2 iterations of such a loop on AArch64. It would also be interesting to get addition perspective on the cost question from people familiar with other hardware supporting scalable vectors.

As a side-note, there is a use of TTI in LVL, but it is rather problematic. At the moment, LVL considers loops with non-temporal memory operations as unvectorizable up-front, if the vector version for an arbitrary VF (in that case 2 was chosen) is illegal on the target. This has the unfortunate side effect that using non-temporal stores flat out blocks vectorization if there’s no legal non-temporal load/store for VF = 2, which can be very surprising to our users, especially on AArch64 where a single element non-temporal memory operation may not be legal to start with and non-temporal stores may be legal at higher VFs (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp#L788 <https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp#L788>). It’s not a perfect example, illustrates one of the issues of bailing out too early.

> This means there's the option to do one of two things:
> 
> 
> [Option 1]
> 
> Add checks to the LV legalisation to see if scalable-vectorisation is
> feasible. If so, assert the cost must be valid. Otherwise discard scalable
> VFs as possible candidates.
> * This has the benefit that the compiler can avoid
>   calculating/considering VPlans that we know cannot be costed.
> * Legalisation and cost-model keep each other in check. If something
>   cannot be costed then either the cost-model or legalisation was
>   incomplete.
> 
> 
> [Option 2]
> 
> Leave the question about legalisation to the CostModel, i.e. if the
> CostModel says that <operation> for `VF=vscale x N` is Invalid, then avoid
> selecting that VF.
> * This has the benefit that we don't need to do work up-front to
>   discard scalable VFs, keeping the LV design simpler.
> * This makes gaps in the cost-model more difficult to spot.
> 

I think if we would support scalarization for scalable vectors via a loop, then the cost of predicating any given instruction would not be invalid, just possibly quite high (depending on the upper bound for the vector), right? So we should  still be able to assert that each result != Invalid.

> Note that it's not useful to combine Option 1 and Option 2, because having
> two ways to choose from takes away the need to do legalisation beforehand,
> and so that's basically a choice for Option 2.
> 
> Both approaches lead to the same end-result, but we currently have a few
> patches in flight that have taken Option 1, and this led to some questions
> about the approach from both Florian and David Green. So we're looking to
> reach to a consensus and decision on what way to move forward.

I think one concern that came up during the discussion was that option 1 means that we need to add multiple isLegalToXXX helpers to TTI, which need to be kept in sync with the already existing cost functions. It also more closely couples legality checks and cost-modeling. I’m not sure if/how much native predication support will complicate things and there may be more places that need to be taught about native predication support. I’m not saying this is a blocker, just a trade-off to consider. Also, if we missed a case or add a new case that may require scalarization we would need to update/add additional checks.

Cheers,
Florian

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210615/62bf3d39/attachment.html>