[llvm] [AArch64][CostModel] Alter sdiv/srem cost where the divisor is constant (PR #123552)
Sushant Gokhale via llvm-commits
llvm-commits at lists.llvm.org
Thu Mar 6 11:14:39 PST 2025
================
@@ -3526,23 +3526,103 @@ InstructionCost AArch64TTIImpl::getArithmeticInstrCost(
default:
return BaseT::getArithmeticInstrCost(Opcode, Ty, CostKind, Op1Info,
Op2Info);
+ case ISD::SREM:
case ISD::SDIV:
- if (Op2Info.isConstant() && Op2Info.isUniform() && Op2Info.isPowerOf2()) {
- // On AArch64, scalar signed division by constants power-of-two are
- // normally expanded to the sequence ADD + CMP + SELECT + SRA.
- // The OperandValue properties many not be same as that of previous
- // operation; conservatively assume OP_None.
- InstructionCost Cost = getArithmeticInstrCost(
- Instruction::Add, Ty, CostKind,
- Op1Info.getNoProps(), Op2Info.getNoProps());
- Cost += getArithmeticInstrCost(Instruction::Sub, Ty, CostKind,
- Op1Info.getNoProps(), Op2Info.getNoProps());
- Cost += getArithmeticInstrCost(
- Instruction::Select, Ty, CostKind,
- Op1Info.getNoProps(), Op2Info.getNoProps());
- Cost += getArithmeticInstrCost(Instruction::AShr, Ty, CostKind,
- Op1Info.getNoProps(), Op2Info.getNoProps());
- return Cost;
+ /*
+ Notes for sdiv/srem specific costs:
+ 1. This only considers the cases where the divisor is constant, uniform and
+ (pow-of-2/non-pow-of-2). Other cases are not important since they either
+ result in some form of (ldr + adrp), corresponding to constant vectors, or
+ scalarization of the division operation.
+ 2. Constant divisors, either negative in whole or partially, don't result in
+ significantly different codegen as compared to positive constant divisors.
+ So, we don't consider negative divisors seperately.
+ 3. If the codegen is significantly different with SVE, it has been indicated
+ using comments at appropriate places.
+
+ sdiv specific cases:
+ -----------------------------------------------------------------------
+ codegen | pow-of-2 | Type
+ -----------------------------------------------------------------------
+ add + cmp + csel + asr | Y | i64
+ add + cmp + csel + asr | Y | i32
+ -----------------------------------------------------------------------
+
+ srem specific cases:
+ -----------------------------------------------------------------------
+ codegen | pow-of-2 | Type
+ -----------------------------------------------------------------------
+ negs + and + and + csneg | Y | i64
+ negs + and + and + csneg | Y | i32
+ -----------------------------------------------------------------------
+
+ other sdiv/srem cases:
+ -------------------------------------------------------------------------
+ commom codegen | + srem | + sdiv | pow-of-2 | Type
+ -------------------------------------------------------------------------
+ smulh + asr + add + add | - | - | N | i64
+ smull + lsr + add + add | - | - | N | i32
+ usra | and + sub | sshr | Y | <2 x i64>
+ 2 * (scalar code) | - | - | N | <2 x i64>
+ usra | bic + sub | sshr + neg | Y | <4 x i32>
+ smull2 + smull + uzp2 | mls | - | N | <4 x i32>
+ + sshr + usra | | | |
+ -------------------------------------------------------------------------
+ */
+ if (Op2Info.isConstant() && Op2Info.isUniform()) {
+ InstructionCost AddCost =
+ getArithmeticInstrCost(Instruction::Add, Ty, CostKind,
+ Op1Info.getNoProps(), Op2Info.getNoProps());
+ InstructionCost AsrCost =
+ getArithmeticInstrCost(Instruction::AShr, Ty, CostKind,
+ Op1Info.getNoProps(), Op2Info.getNoProps());
+ InstructionCost MulCost =
+ getArithmeticInstrCost(Instruction::Mul, Ty, CostKind,
+ Op1Info.getNoProps(), Op2Info.getNoProps());
+ // add/cmp/csel/csneg should have similar cost while asr/negs/and should
+ // have similar cost.
+ if (LT.second.isScalarInteger()) {
+ if (Op2Info.isPowerOf2()) {
+ return ISD == ISD::SDIV ? (3 * AddCost + AsrCost)
+ : (3 * AsrCost + AddCost);
+ } else {
+ return MulCost + AsrCost + 2 * AddCost;
+ }
+ } else {
+ InstructionCost UsraCost = 2 * AsrCost;
+ if (Op2Info.isPowerOf2()) {
+ // Division with scalable types corresponds to native 'asrd'
+ // instruction when SVE is available.
+ // e.g. %1 = sdiv <vscale x 4 x i32> %a, splat (i32 8)
+ if (Ty->isScalableTy() && ST->hasSVE())
+ return 2 * AsrCost;
----------------
sushgokh wrote:
>The A725 has a throughput of 1 for these, as opposed to 2 for most vector operations. So there is precedence for it.
I assume fdiv as one of the examples here. This is OK here. But for cpu like Neoverse-v2 where the throughput>=1 for most of the instructions, recip_tput becomes approx equal to 1 for all. There is no way to differentiate how costly the instruction is wrt some other instruction.
Ideally, we would always like to know the no. of cycles consumed and this is the thing that we refer to when using tools like llvm-mca. We never go on calculating recip_tput. Also, in articles like [this](https://chadaustin.me/2009/02/latency-vs-throughput/), the unit of recip_tput is cycles/instr which is nothing but latency(under certain conditions though).
Having cost=1 (with recip_tput as the cost metric)for most of the instructions is problematic I think for the same reason e.g. a load from constant pool would be costed same as a normal mul/add etc.
>What do you mean by the groups?
I mean some sort of equivalence groups.
e.g. group of MemoryOps consisting of load/store where the instruction in this group is compared only within this group and then assigned a cost relative to others in this group. If there is comparison between two diff groups, groups can be coalesced to have a revised costing.
Now, this is my thinking but there maybe flaws with this.
https://github.com/llvm/llvm-project/pull/123552
More information about the llvm-commits
mailing list