[llvm] [RISCV][CostModel] Add cost for fabs/fsqrt of type bf16/f16 (PR #118608)
Craig Topper via llvm-commits
llvm-commits at lists.llvm.org
Wed Dec 11 23:41:52 PST 2024
================
@@ -1035,21 +1035,40 @@ RISCVTTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
}
break;
}
- case Intrinsic::fabs:
+ case Intrinsic::fabs: {
+ auto LT = getTypeLegalizationCost(RetTy);
+ if (ST->hasVInstructions() && LT.second.isVector()) {
+ // lui a0, 8
+ // addi a0, a0, -1
+ // vsetvli a1, zero, e16, m1, ta, ma
+ // vand.vx v8, v8, a0
+ // f16 with zvfhmin and bf16 with zvfhbmin
+ if (LT.second.getVectorElementType() == MVT::bf16 ||
+ (LT.second.getVectorElementType() == MVT::f16 &&
+ !ST->hasVInstructionsF16()))
+ return LT.first * getRISCVInstructionCost(RISCV::VAND_VX, LT.second,
+ CostKind) +
+ 2;
+ else
+ return LT.first *
+ getRISCVInstructionCost(RISCV::VFSGNJX_VV, LT.second, CostKind);
+ }
+ break;
+ }
case Intrinsic::sqrt: {
auto LT = getTypeLegalizationCost(RetTy);
- // TODO: add f16/bf16, bf16 with zvfbfmin && f16 with zvfhmin
if (ST->hasVInstructions() && LT.second.isVector()) {
- unsigned Op;
- switch (ICA.getID()) {
- case Intrinsic::fabs:
- Op = RISCV::VFSGNJX_VV;
- break;
- case Intrinsic::sqrt:
- Op = RISCV::VFSQRT_V;
- break;
- }
- return LT.first * getRISCVInstructionCost(Op, LT.second, CostKind);
+ SmallVector<unsigned, 3> Opcodes;
+ // f16 with zvfhmin and bf16 with zvfbfmin
+ if (LT.second.getVectorElementType() == MVT::bf16)
----------------
topperc wrote:
I don't think 12 is the correct cost for <vscale x 16 x f16>. That cost did not use f32 for the vsqrt.v cost.
The cost for <vscale x 4 x bfloat> should be at least 4 and possibly 5 or 6, not 3. There's a vfwcvt.f.f.v from EEW=16 EMUL=m1 to EEW=32 EMUL=m2. That's either a cost of 1 if the hardware can produce an EMUL=2 result in 1 cycle or its 2 if it takes 2 cycles to produce a wide result.
Then there's a EEW=32 EMUL=2 vfsqrt.v. That should be a cost of 2 since the EMUL is 2.
Then there's a vfncvt.f.f.w from EEW=32 EMUL=2 to EEW=16 EMUL=1. That's either 1 cycle if the hardware can narrow EMUL=2 in 1 cycle. Or it's a cost of 2.
So the total cost is either (1 + 2 + 1) or (2 + 2 + 1) or (1 + 2 + 2) or (2 + 2 + 2).
https://github.com/llvm/llvm-project/pull/118608
More information about the llvm-commits
mailing list