[llvm] [RISCV][TTI] Fix a costing mistake for truncate/fp_round with LMUL>m1 (PR #101051)

Tue Jul 30 07:45:26 PDT 2024

================
@@ -1108,60 +1108,60 @@ define void @trunc() {
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %nxv1i64_nxv1i1 = trunc <vscale x 1 x i64> undef to <vscale x 1 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %nxv2i16_nxv2i8 = trunc <vscale x 2 x i16> undef to <vscale x 2 x i8>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %nxv2i32_nxv2i8 = trunc <vscale x 2 x i32> undef to <vscale x 2 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %nxv2i64_nxv2i8 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %nxv2i64_nxv2i8 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i8>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %nxv2i32_nxv2i16 = trunc <vscale x 2 x i32> undef to <vscale x 2 x i16>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %nxv2i64_nxv2i16 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i16>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %nxv2i64_nxv2i32 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i32>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %nxv2i64_nxv2i16 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i16>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %nxv2i64_nxv2i32 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i32>
----------------
preames wrote:

I went and ran a couple micro benchmarks on the bp3.

```
Running vnsrl-mf2.out
  ~3.934785 cycles-per-inst
  ~4053.832000 cycles-per-iteration
  ~1030.255150 insts-per-iteration

Running vnsrl-m1.out
  ~3.971956 cycles-per-inst
  ~4020.713650 cycles-per-iteration
  ~1012.275450 insts-per-iteration

Running vnsrl-m2.out
  ~7.453559 cycles-per-inst
  ~8139.421000 cycles-per-iteration
  ~1092.018050 insts-per-iteration

Running vnsrl-m4.out
  ~14.008862 cycles-per-inst
  ~16234.939100 cycles-per-iteration
  ~1158.904950 insts-per-iteration

For comparison, here's a vadd.vv at m1:
  ~3.970621 cycles-per-inst
  ~4017.622100 cycles-per-iteration
  ~1011.837100 insts-per-iteration
```
So, at least on this board, it looks like you're right that the cost is scaling with the destination LMUL, not the source LMUL.  Interesting!


https://github.com/llvm/llvm-project/pull/101051