[PATCH] D132477: Improve cost model for some 128-bit vector operations that use SVE

Tue Sep 20 07:26:15 PDT 2022

david-arm added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:2055
     InstructionCost Cost = BaseT::getArithmeticInstrCost(
         Opcode, Ty, CostKind, Op1Info, Op2Info);
     if (Ty->isVectorTy()) {
----------------
hassnaa-arm wrote:
> sdesmalen wrote:
> > For the case of using SVE, this probably shouldn't be looking at the original type, but rather the type that is legal for the *operation*.
> > 
> > There are two separate things going on here:
> > 1. The *type* legalization cost, which is the cost of getting a legal type.
> > 2. The *operation* legalization cost, which currently assumes a legal type. As your code already points out, operations are only legal on 32- and 64-bit element types.
> > 
> > The way the code-generator handles this is that for 8 and 16-bit inputs, the *type* legalizer first tries to legalize them to something the code-generator can handle. After that, the code-generator will try to promote it to a type that the *operation* can handle.
> > 
> > e.g.
> > * if the type is <2 x i8> that will be promoted to <2 x i32>. This is all fine, we can use the cost of LT.first (The cost of a DIV for v2i32) and combine it with the legalisation cost for going from v2i8 -> v2i32.
> > * if the type is <4 x i16> that is a legal type, but for the DIV operation v4i16 is _not_ legal, so we'll need to consider legalisation for v4i32.
> > 
> > So for i8 and i16 element types, it should call `getTypeLegalizationCost();` again for the type that we know it will be promoted to. That type is a vector with the same number of lanes, but i32 element types.
> > 
> I understood the 2 points of legalisation, I agree with you.
> But I don't agree about using the function of getTypeLegalizationCost(); to get the cost of *operation* legalisation, because that function is related to *type* legalisation only.
> 
> I suggest that, after getting the cost of *type* legalisation,
> if the legal type is not 32/64-bit element type, then append the cost for promoting it to the nearest legal element type (32). * the splitting cost.
> and then consider the cost of the operation * splitting.
> 
So BaseT::getArithmeticInstrCost already takes legalisation cost into account:

    std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, Ty);

    bool IsFloat = Ty->isFPOrFPVectorTy();
    // Assume that floating point arithmetic operations cost twice as much as
    // integer operations.
    InstructionCost OpCost = (IsFloat ? 2 : 1);

    if (TLI->isOperationLegalOrPromote(ISD, LT.second)) {
      // The operation is legal. Assume it costs 1.
      // TODO: Once we have extract/insert subvector cost we need to use them.
      return LT.first * OpCost;
    }

    if (!TLI->isOperationExpand(ISD, LT.second)) {
      // If the operation is custom lowered, then assume that the code is twice
      // as expensive.
      return LT.first * 2 * OpCost;
    }

Esentially it is mostly already doing what @sdesmalen is asking for I believe. If the type is <2 x i8> then the code above will get the type legalisation cost of promoting to <2 x i32> and since the operation is custom lowered we return `LT.first * 2 * OpCost`. This is behaving exactly as we want - combining the legalisation cost with the operation cost. We know that <2 x i32> divides will be lowered simply to SVE's sdiv/udiv where the predicate is generated using `ptrue p0.s, vl2`. It's not quite correct for <4 x i16> where the legalisation cost is too low (1) since it's a legal type and therefore we return `1 * 2 * OpCost`. Essentially during lowering for i8 and i18 types we do manual legalisation (see LowerDIV), which neither BaseT::getArithmeticInstrCost nor getTypeLegalizationCost knows anything about. However, @hassnaa-arm's code already tries to take this unknown behaviour into account with the multiplier, i.e.

  else if (LT.second.getScalarType() == MVT::i16)
    Cost *= 4;

Essentially I believe that for all cases except the case where the legalised element type is i8 or i16 BaseT::getArithmeticInstrCost is doing exactly what @sdesmalen is asking for. Consider the different scenarios:

1. <2 x i8> -> <2 x i32>. We return the overall cost as `LT.first * 2 * 1`, i.e. 2 because LT.first = 1.
2. <4 x i8> -> <4 x i16>. Overall cost is `4 x LT.first * 2 * 1`, i.e. 8 because LT.first = 1. @hassnaa-arm has used the multipler '4' as an estimate for the cost of manual type legalisation in `AArch64TargetLowering::LowerDIV`.
3. <2 x i16> -> <2 x i32>. We return the overall cost as `LT.first * 2 * 1`, i.e. 2 because LT.first = 1.
4. <4 x i16>. The type is already legal so we return `4 x LT.first * 2 * 1`, i.e. 8 because LT.first = 1.
5. <8 x i8>. The type is already legal so we return `8 x LT.first * 2 * 1`, i.e. 16 because LT.first = 1.

It seems to me that the real problem here is that getTypeLegalizationCost returns the wrong cost for the legalisation of types such as <2 x i8>, i.e. it only returns 1 even though we need a SIGN_EXTEND_INREG operation for promotion of each input operand.

If we want more accurate costs for < 128 bits how about a compromise here? If the type is < 128 bits we can just manually create a table of costs, since the list is quite small and look up the opcode cost from the table? It feels like it's going to be too difficult trying to model this using generic TTI interfaces. If the vector width >= 128 bits then @hassnaa-arm can use the existing code with the hard-coded multipliers?

You can look for existing calls to CostTableLookup in this file for examples. You'd end up with a table something like this:

  static const CostTblEntry DivTbl[]{
      {ISD::SDIV,  MVT::v2i8, 5},
      {ISD::SDIV,  MVT::v4i8, 8},
      {ISD::SDIV,  MVT::v8i8, 8},
      {ISD::SDIV,  MVT::v2i16, 5},
      {ISD::SDIV,  MVT::v4i16, 5},
      {ISD::SDIV,  MVT::v2i32, 1},
      ... plus entries for ISD::UDIV ...

  if (... size of Ty is < 128 bits ...) {
    MVT OpTy = TLI->getValueType(DL, Ty);
    const auto *Entry =
        CostTableLookup(DivTbl, ISD::SDIV, OpTy);
    return Entry->Cost;
  } else if (LT.second.getScalarType() == MVT::i8)
    Cost *= 8;
  else if (LT.second.getScalarType() == MVT::i16)
    Cost *= 4;

where here I have estimated the cost based on the number of instructions generated for these operations. I treated the ptrue instruction as zero cost.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:2100
     // need to scalarize so the cost can be cheaper (smull or umull).
+    // so the cost can be cheaper (smull or umull).
     if (LT.second != MVT::v2i64 || isWideningInstruction(Ty, Opcode, Args))
----------------
nit: This line looks unnecessary.

================
Comment at: llvm/test/Analysis/CostModel/AArch64/sve-fixed-length.ll:144
+
+; expected cost: 32
+; CHECK: cost of [[#mul(mul(div(64-1, VBITS)+1, 2), 8)]] for instruction:  %udiv64.i8   = udiv <8 x i8> undef, undef
----------------
I think these examples of "expected costs" need updating.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D132477/new/

https://reviews.llvm.org/D132477