[PATCH] D135991: [AArch64] Fix cost model for `udiv` instruction when one of the operands is a uniform constant

Wed Nov 2 08:46:04 PDT 2022

dmgreen added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:2109
+            (Op2Info.isConstant() && Op2Info.isUniform())) {
+          InstructionCost InsertCost = getArithmeticInstrCost(
+              Instruction::InsertElement, Ty, CostKind, Op1Info, Op2Info);
----------------
zjaffal wrote:
> dmgreen wrote:
> > zjaffal wrote:
> > > dmgreen wrote:
> > > > getArithmeticInstrCost should not be used with InsertElement or ExtractElement. It is used for arithmetic instructions like Add and Div. getVectorInstrCost should be used with InsertElement and ExtractElement.
> > > > 
> > > > For the i64 Mul case below we hard-coded the values 2 instead, due to the more regular nature of the scalarization. (It was originally 1, but we had to increase it as it was not quite high enough in cases).
> > > The code before my patch used `getArithmeticInstrCost` should we only use `getVectorInstrCost` then?
> > I think that using getArithmeticInstrCost in the old code was wrong, yeah. But all the old costmode code is pretty odd, like the way it does `Cost += Cost;`. Using getVectorInstrCost might end up giving too high a cost, if it does I would not be against using a cost of 2, like we do for Mul.
> > 
> > I would expect the basic cost of a vector divide that we expand to be:
> > `(getArithmeticInstrCost(Div, ScalarTy) + 2 * getVectorInstrCost(Extract) + getVectorInstrCost(Insert)) * VF`. The `2*ExractCost` could be reduced to `1*ExtractCost` if the operand is Uniform.
> > 
> > I haven't done any checking of that scheme though. That's where benchmarks come in, to make sure the theory matches practice and it doesn't end up making things worse. There may have been a reason why the old code got things "wrong".
> I tried doing the following now for uniform constants
> ```
>           FixedVectorType *VTy = cast<FixedVectorType>(Ty);
>           InstructionCost InsertExtractCost;
>           for(unsigned Idx = 0; Idx < VTy->getNumElements() ; Idx ++){
>             InsertExtractCost += getVectorInstrCost(Instruction::InsertElement, VTy, Idx);
>             InsertExtractCost += getVectorInstrCost(Instruction::ExtractElement, VTy, Idx);
>           }
>           InstructionCost DivCost = getArithmeticInstrCost(Opcode, VTy->getScalarType(), CostKind);
>           return InsertExtractCost + DivCost * VTy->getNumElements();
> ```
> It overly estimates the cost of div instructions, take the following example:
> ```
>   %V64i8 = udiv <64 x i8> undef, <i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7>
> ```
> 
> The original estimate was: **880**
> My current patch estimate: **80**
> Using `getArithmeticInstrCost`: **424**
> 
> >I haven't done any checking of that scheme though. That's where benchmarks come in, to make sure the theory matches practice and it doesn't end up making things worse. There may have been a reason why the old code got things "wrong".
> I will put a microbenchmark patch to test the cost result. I think this might give a clear indicator of the cost estimate
> 
424 seems like it might it might not be too far off, just counting instructions: https://godbolt.org/z/x8fsYP5Pb. Any 64x vector is going to fairly expensive if we need to scalarize it.

If it was me, I might be tempted to just use `(DivCost + 4) * VF`, to match the mul variant below. 4 is 2 for insert, 2 for extract. It would be 6 for an instruction with variable operands. It might work better for regular scalarization of integers.

That would rely on getArithmeticInstrCost(Div, ScalarTy) getting the correct cost for non-power2 constants to be precise for the example above, which it might not at the moment.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D135991/new/

https://reviews.llvm.org/D135991