[llvm] [RISCV][CostModel] Add cost for fabs/fsqrt of type bf16/f16 (PR #118608)

Wed Dec 11 23:41:52 PST 2024

================
@@ -1035,21 +1035,40 @@ RISCVTTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
     }
     break;
   }
-  case Intrinsic::fabs:
+  case Intrinsic::fabs: {
+    auto LT = getTypeLegalizationCost(RetTy);
+    if (ST->hasVInstructions() && LT.second.isVector()) {
+      // lui a0, 8
+      // addi a0, a0, -1
+      // vsetvli a1, zero, e16, m1, ta, ma
+      // vand.vx v8, v8, a0
+      // f16 with zvfhmin and bf16 with zvfhbmin
+      if (LT.second.getVectorElementType() == MVT::bf16 ||
+          (LT.second.getVectorElementType() == MVT::f16 &&
+           !ST->hasVInstructionsF16()))
+        return LT.first * getRISCVInstructionCost(RISCV::VAND_VX, LT.second,
+                                                  CostKind) +
+               2;
+      else
+        return LT.first *
+               getRISCVInstructionCost(RISCV::VFSGNJX_VV, LT.second, CostKind);
+    }
+    break;
+  }
   case Intrinsic::sqrt: {
     auto LT = getTypeLegalizationCost(RetTy);
-    // TODO: add f16/bf16, bf16 with zvfbfmin && f16 with zvfhmin
     if (ST->hasVInstructions() && LT.second.isVector()) {
-      unsigned Op;
-      switch (ICA.getID()) {
-      case Intrinsic::fabs:
-        Op = RISCV::VFSGNJX_VV;
-        break;
-      case Intrinsic::sqrt:
-        Op = RISCV::VFSQRT_V;
-        break;
-      }
-      return LT.first * getRISCVInstructionCost(Op, LT.second, CostKind);
+      SmallVector<unsigned, 3> Opcodes;
+      // f16 with zvfhmin and bf16 with zvfbfmin
+      if (LT.second.getVectorElementType() == MVT::bf16)
----------------
topperc wrote:

I don't think 12 is the correct cost for <vscale x 16 x f16>. That cost did not use f32 for the vsqrt.v cost.

The cost for <vscale x 4 x bfloat> should be at least 4 and possibly 5 or 6, not 3. There's a vfwcvt.f.f.v from EEW=16 EMUL=m1 to EEW=32 EMUL=m2. That's either a cost of 1 if the hardware can produce an EMUL=2 result in 1 cycle or its 2 if it takes 2 cycles to produce a wide result.

Then there's a EEW=32 EMUL=2 vfsqrt.v. That should be a cost of 2 since the EMUL is 2.

Then there's a vfncvt.f.f.w from EEW=32 EMUL=2 to EEW=16 EMUL=1. That's either 1 cycle if the hardware can narrow EMUL=2 in 1 cycle. Or it's a cost of 2.

So the total cost is either (1 + 2 + 1) or (2 + 2 + 1) or (1 + 2 + 2) or (2 + 2 + 2).

https://github.com/llvm/llvm-project/pull/118608