[llvm-dev] LoopVectorizer: Should the cost-model be used for legalisation?

Thu Jun 24 08:12:49 PDT 2021

Hi Florian, Vineet,

Thanks both for your input!

From the comments here and from conversations off-list, it seems there is no objection -and perhaps even a slight preference- to use the 'Invalid' state of InstructionCost as a feature. That's probably also the easiest way forward for now, since it avoids us adding all sorts of TTI methods for legalisation that we may need to delete again later when we have a scalarization mechanism for scalable VFs.

Assuming there is no further objection to this, we'll change our patches that are currently on Phabricator to pursue this approach instead.

[just sharing my thoughts/understanding here]

*  Calculating a more accurate scalarization cost probably requires extending `TTI::getScalarizationOverhead` with an extra operand to tell whether the pass which calls the cost-function can create a scalarisation loop. If the flag is false the cost would be Invalid, otherwise it could return some high-but-valid cost. That's because the cost-model doesn't know whether a scalable loop can/will be created.

*  Having a scalarization mechanism is orthogonal to the question how we interpret Invalid costs, although it would significantly reduce the cases where an Invalid cost occurs. If we could emit an opt-remark each time an Invalid cost prevents vectorization, that would help identify any such cases.

* If we interpret Invalid as 'infinitely' expensive, we could remove the calls to TTI->isLegalNTLoad/Store from LVL in favour of them having invalid costs. Otherwise, we'd still need to fix/improve the case for the NT loads/stores. We come across something similar for the VFDatabase, where for some VFs a vectorised function may (not) be available, and so we'd otherwise need to filter out + maintain a set/bitmask of feasible VFs.

> Florian Hahn wrote:
>> I don’t have any insight in hardware-specific costs for scalable vector operations, so I can’t comment on any specifics there. I am wondering if the following hypothetical scenario is feasible/realistic: consider a loop where all operations expect one can be widened directly and a single operations needs scalarizing. I would expect there to be a point where the number of widenable operations gets large enough to offset the cost of emitting a loop to scalarize the single operation needing predication. Granted, depending on the maximum vector width of the hardware this number might be quite large, but I could imagine a scenario where we want to optimize given a tighter upper bound on the maximum vector width than allowed by the hardware spec. E.g. considering the an upper bound of 256 for the vector width, we’d only need to execute 2 iterations of such a loop on AArch64. It would also be interesting to get addition perspective on the cost question from people familiar with other hardware supporting scalable vectors.
You're right that there could be loops big enough for the scalarised sub-loop to be negligible. Our experiments haven't really found any where scalarisation was still beneficial, so this seems very much a corner-case. This may be different for other architectures, although I expect that extracting/inserting arbitrary elements from/to scalable vectors are likely to be more expensive than for fixed-width vectors.

When we have knowledge about the bounds of vscale, we should indeed be able to calculate a more accurate cost. The min_vscale/max_vscale IR attributes can be used for this.

> Vineet Kumar wrote:
>>  Perhaps, the assertion we want in the legalizer is that if an instruction is illegal then it cannot have a valid cost. A legal instruction might still have an invalid/prohibitively expensive cost. This way we get a model where the legalizer does a quick weed-out of illegal instructions and cost-model can do a more fine-grained analysis.
I'm not sure if such an assert would add much value, because the cases we try to guard against are the cases where the LV lets something through (Legal == true), but the code-generator can't handle them (Cost = invalid), since the operations are considered legal unless explicitly stated otherwise.

>> When you say "at runtime no redundant work is being done", does it mean that at the implementation level there might be some duplication of logic?
Yes.

>> Looking at the code, the `InstructionCost` class does not yet model a separate prohibitively-high-but-valid cost, right? IIRC, some places use a hard-coded value to represent a high cost.
That's right. That would require saturation support in InstructionCost (i.e. infinitely-expensive + 10 = (still) infinitely-expensive). This is something on we haven't started on yet, but is on our wish-list.

>> Hmmm, to be honest I can't really imagine a concrete example here. Just thinking out loud here, wondering if there be a case where a non-predicated instruction is invalid but a VP version of it is valid, in which case the VPlan can transform into using a VP version. May be a far-fetched example would be a VP instruction with a fixed EVL that makes it feasible to compute its cost.
I think the other way around would be more problematic, because you could generate a 'all true' predicate for the predicated instruction. When the loop requires predication and there is only an unpredicated instruction available, that might actually prevent vectorization. The VFDatabase case I mentioned above would be such a case. If there are function calls in the scalar loop and the VFDatabase has corresponding vector variants but only unpredicated ones, then the predicated style of vectorization would lead to having an invalid cost, whereas the unpredicated vector body would result in a valid cost.

Thanks,

Sander

> On 15 Jun 2021, at 12:48, Florian Hahn <florian_hahn at apple.com> wrote:
> 
> Hi,
> 
> Thanks for bringing this up!
> 
>> On Jun 10, 2021, at 21:50, Sander De Smalen <Sander.DeSmalen at arm.com> wrote:
>> 
>> Hi,
>> 
>> Last year we added the InstructionCost class which adds the ability to
>> represent that an operation cannot be costed, i.e. operations that cannot
>> be expanded by the code-generator will have an invalid cost.
>> 
>> We started using this information in the Loop Vectorizer for scalable
>> auto-vectorization. The LV has a legality- and a cost-model stage, which are 
>> conceptually separate concepts with different purposes. But with the 
>> introduction of having valid/invalid costs it's more inviting to use the 
>> cost-model as 'legalisation', which leads us to the following question:
>> 
>>   Should we be using the cost-model to do legalisation?
>> 
>> 'Legalisation' in this context means asking the question beforehand if the 
>> code-generator can handle the IR emitted from the LV. Examples of
>> operations that need such legalisation are predicated divides (at least
>> until we can use the llvm.vp intrinsics), or intrinsic calls that have no
>> scalable-vector equivalent. For fixed-width vectors this legalisation issue
>> is mostly moot, since operations on fixed-width vectors can be scalarised.
>> For scalable vectors this is neither supported nor feasible [1].
> 
> I think this is one of the key points. LoopVectorLegality at the moment is mostly concerned whether the loop is not vectorizable due to conceptual issues (not sure sure what the best term would be) preventing vectorization (like unvectorizable dependences or FP constraints), not target specific instruction legality constraints like whether there is a vector version of a given instruction. It is also independent of any given concrete VF. IIUC this is also not limited to predication, e.g. we also need to check whether intrinsic calls are supported natively. Are there any others?
> 
> As you mentioned, predicated instructions that are not supported by a target can be scalarized and predicated. Even for fixed vectors, this is quite expensive in general, but it allows LoopVectorLegality to work mostly independently of other target constraints and focus on general legality checks. IIUC conceptually there’s nothing preventing us from scalarizing operations on scalable vectors in a similar fashion, but it requires an explicit loop dependent on the vector width at runtime and deciding the cost depends on an unknown (the target vector width).
> 
> I don’t have any insight in hardware-specific costs for scalable vector operations, so I can’t comment on any specifics there. I am wondering if the following hypothetical scenario is feasible/realistic: consider a loop where all operations expect one can be widened directly and a single operations needs scalarizing. I would expect there to be a point where the number of widenable operations gets large enough to offset the cost of emitting a loop to scalarize the single operation needing predication. Granted, depending on the maximum vector width of the hardware this number might be quite large, but I could imagine a scenario where we want to optimize given a tighter upper bound on the maximum vector width than allowed by the hardware spec. E.g. considering the an upper bound of 256 for the vector width, we’d only need to execute 2 iterations of such a loop on AArch64. It would also be interesting to get addition perspective on the cost question from people familiar with other hardware supporting scalable vectors.
> 
> As a side-note, there is a use of TTI in LVL, but it is rather problematic. At the moment, LVL considers loops with non-temporal memory operations as unvectorizable up-front, if the vector version for an arbitrary VF (in that case 2 was chosen) is illegal on the target. This has the unfortunate side effect that using non-temporal stores flat out blocks vectorization if there’s no legal non-temporal load/store for VF = 2, which can be very surprising to our users, especially on AArch64 where a single element non-temporal memory operation may not be legal to start with and non-temporal stores may be legal at higher VFs (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp#L788). It’s not a perfect example, illustrates one of the issues of bailing out too early.
> 
>> This means there's the option to do one of two things:
>> 
>> 
>> [Option 1]
>> 
>> Add checks to the LV legalisation to see if scalable-vectorisation is
>> feasible. If so, assert the cost must be valid. Otherwise discard scalable
>> VFs as possible candidates.
>> * This has the benefit that the compiler can avoid
>>   calculating/considering VPlans that we know cannot be costed.
>> * Legalisation and cost-model keep each other in check. If something
>>   cannot be costed then either the cost-model or legalisation was
>>   incomplete.
>> 
>> 
>> [Option 2]
>> 
>> Leave the question about legalisation to the CostModel, i.e. if the
>> CostModel says that <operation> for `VF=vscale x N` is Invalid, then avoid
>> selecting that VF.
>> * This has the benefit that we don't need to do work up-front to
>>   discard scalable VFs, keeping the LV design simpler.
>> * This makes gaps in the cost-model more difficult to spot.
>> 
> 
> I think if we would support scalarization for scalable vectors via a loop, then the cost of predicating any given instruction would not be invalid, just possibly quite high (depending on the upper bound for the vector), right? So we should  still be able to assert that each result != Invalid.
> 
>> Note that it's not useful to combine Option 1 and Option 2, because having
>> two ways to choose from takes away the need to do legalisation beforehand,
>> and so that's basically a choice for Option 2.
>> 
>> Both approaches lead to the same end-result, but we currently have a few
>> patches in flight that have taken Option 1, and this led to some questions
>> about the approach from both Florian and David Green. So we're looking to
>> reach to a consensus and decision on what way to move forward.
> 
> I think one concern that came up during the discussion was that option 1 means that we need to add multiple isLegalToXXX helpers to TTI, which need to be kept in sync with the already existing cost functions. It also more closely couples legality checks and cost-modeling. I’m not sure if/how much native predication support will complicate things and there may be more places that need to be taught about native predication support. I’m not saying this is a blocker, just a trade-off to consider. Also, if we missed a case or add a new case that may require scalarization we would need to update/add additional checks.
> 
> Cheers,
> Florian