[llvm] [RISCV] Account for factor in interleave memory op costs (PR #111511)

Tue Oct 8 12:35:11 PDT 2024

preames wrote:

At a conceptual level, this doesn't seem like the right costing.  I suspect a better model is performing a wide load and then performing some kind of additional shuffle uop.

Here's some preliminary throughput data from the banana pi3:

```
perf stat ./vlseg-nf4.out 
  ~23.188822 cycles-per-inst
  ~24061.448450 cycles-per-iteration
  ~1037.631350 insts-per-iteration

perf stat ./vlseg-nf2.out 
  ~11.796910 cycles-per-inst
  ~12026.720450 cycles-per-iteration
  ~1019.480600 insts-per-iteration
```
These are fairly expensive operations.  Both above are LMUL=m1.

Some reference points (focusing on the nf2 case):
```
perf stat ./vnsrl-m1.out
  ~3.935553 cycles-per-inst
  ~20508.467900 cycles-per-iteration
  ~5211.076900 insts-per-iteration

$ perf stat ./vle-m2.out 
  ~3.972999 cycles-per-inst
  ~4020.006750 cycles-per-iteration
  ~1011.831750 insts-per-iteration

./vlseg-nf2-emulated.out
  ~9039.639900 cycles-per-iteration
  ~5019.285100 insts-per-iteration 
```

>From the look of this, it looks like the right cost model for the segment load instruction might be load cost + NF*shuffle cost.  

Interestingly, the emulated version - which uses exactly that expansion - appears to be higher throughput.  That was a real surprise to me, as I thought I'd heard that BP3 had fast segmented loads and stores.  It's possible I've got an error in my tests, anyone else have data on this question?

https://github.com/llvm/llvm-project/pull/111511