[PATCH] D135991: [AArch64] Fix cost model for `udiv` instruction when one of the operands is a uniform constant

Mon Oct 31 04:00:01 PDT 2022

dmgreen added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:2109
+            (Op2Info.isConstant() && Op2Info.isUniform())) {
+          InstructionCost InsertCost = getArithmeticInstrCost(
+              Instruction::InsertElement, Ty, CostKind, Op1Info, Op2Info);
----------------
zjaffal wrote:
> dmgreen wrote:
> > getArithmeticInstrCost should not be used with InsertElement or ExtractElement. It is used for arithmetic instructions like Add and Div. getVectorInstrCost should be used with InsertElement and ExtractElement.
> > 
> > For the i64 Mul case below we hard-coded the values 2 instead, due to the more regular nature of the scalarization. (It was originally 1, but we had to increase it as it was not quite high enough in cases).
> The code before my patch used `getArithmeticInstrCost` should we only use `getVectorInstrCost` then?
I think that using getArithmeticInstrCost in the old code was wrong, yeah. But all the old costmode code is pretty odd, like the way it does `Cost += Cost;`. Using getVectorInstrCost might end up giving too high a cost, if it does I would not be against using a cost of 2, like we do for Mul.

I would expect the basic cost of a vector divide that we expand to be:
`(getArithmeticInstrCost(Div, ScalarTy) + 2 * getVectorInstrCost(Extract) + getVectorInstrCost(Insert)) * VF`. The `2*ExractCost` could be reduced to `1*ExtractCost` if the operand is Uniform.

I haven't done any checking of that scheme though. That's where benchmarks come in, to make sure the theory matches practice and it doesn't end up making things worse. There may have been a reason why the old code got things "wrong".

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:2116
+                            << " Extract Cost: " << ExtractCost << "\n");
+          return InsertCost + ExtractCost + VTy->getNumElements();
+        }
----------------
zjaffal wrote:
> dmgreen wrote:
> > This is assuming that the cost of the actual divide is 1? That sounds too low in some cases, compared to scalar. Should it be `(InsertCost + ExtractCost + Cost) * VTy->getNumElements()`? 
> The problem is that cost here counts the insert and extract cost so we will end up counting twice. 
> ```
>     InstructionCost Cost = BaseT::getArithmeticInstrCost(
>         Opcode, Ty, CostKind, Op1Info, Op2Info);
> ```
> Takes into account the scalarization overhead. For the following example
> 
> ```
> define <8 x i32> @foo(<8 x i32> %v) {
>   %res = udiv <8 x i32> %v, <i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255>
>   ret <8 x i32> %res
> }
> ```
> The cost was 44. I don't think it is correct
I see, because we convert the divide into a series of multiplies and shifts. That would apply to scalar divide too, right?

Do we need some code like this?
https://github.com/llvm/llvm-project/blob/72e9447e29abf111c742da59afe4152150a2f8e7/llvm/lib/Target/X86/X86TargetTransformInfo.cpp#L302
It only handles powers-of-2 though, so might need extending for your usecase. Having it somewhere generic would be good too.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D135991/new/

https://reviews.llvm.org/D135991