[PATCH] D135991: [AArch64] Fix cost model for `udiv` instruction when one of the operands is a uniform constant

Wed Nov 2 03:28:23 PDT 2022

zjaffal added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:2109
+            (Op2Info.isConstant() && Op2Info.isUniform())) {
+          InstructionCost InsertCost = getArithmeticInstrCost(
+              Instruction::InsertElement, Ty, CostKind, Op1Info, Op2Info);
----------------
dmgreen wrote:
> zjaffal wrote:
> > dmgreen wrote:
> > > getArithmeticInstrCost should not be used with InsertElement or ExtractElement. It is used for arithmetic instructions like Add and Div. getVectorInstrCost should be used with InsertElement and ExtractElement.
> > > 
> > > For the i64 Mul case below we hard-coded the values 2 instead, due to the more regular nature of the scalarization. (It was originally 1, but we had to increase it as it was not quite high enough in cases).
> > The code before my patch used `getArithmeticInstrCost` should we only use `getVectorInstrCost` then?
> I think that using getArithmeticInstrCost in the old code was wrong, yeah. But all the old costmode code is pretty odd, like the way it does `Cost += Cost;`. Using getVectorInstrCost might end up giving too high a cost, if it does I would not be against using a cost of 2, like we do for Mul.
> 
> I would expect the basic cost of a vector divide that we expand to be:
> `(getArithmeticInstrCost(Div, ScalarTy) + 2 * getVectorInstrCost(Extract) + getVectorInstrCost(Insert)) * VF`. The `2*ExractCost` could be reduced to `1*ExtractCost` if the operand is Uniform.
> 
> I haven't done any checking of that scheme though. That's where benchmarks come in, to make sure the theory matches practice and it doesn't end up making things worse. There may have been a reason why the old code got things "wrong".
I tried doing the following now for uniform constants
```
          FixedVectorType *VTy = cast<FixedVectorType>(Ty);
          InstructionCost InsertExtractCost;
          for(unsigned Idx = 0; Idx < VTy->getNumElements() ; Idx ++){
            InsertExtractCost += getVectorInstrCost(Instruction::InsertElement, VTy, Idx);
            InsertExtractCost += getVectorInstrCost(Instruction::ExtractElement, VTy, Idx);
          }
          InstructionCost DivCost = getArithmeticInstrCost(Opcode, VTy->getScalarType(), CostKind);
          return InsertExtractCost + DivCost * VTy->getNumElements();
```
It overly estimates the cost of div instructions, take the following example:
```
  %V64i8 = udiv <64 x i8> undef, <i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7>
```

The original estimate was: **880**
My current patch estimate: **80**
Using `getArithmeticInstrCost`: **424**

>I haven't done any checking of that scheme though. That's where benchmarks come in, to make sure the theory matches practice and it doesn't end up making things worse. There may have been a reason why the old code got things "wrong".
I will put a microbenchmark patch to test the cost result. I think this might give a clear indicator of the cost estimate

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D135991/new/

https://reviews.llvm.org/D135991