[llvm] [RISCV] Account for factor in interleave memory op costs (PR #111511)
Philip Reames via llvm-commits
llvm-commits at lists.llvm.org
Tue Oct 8 12:35:11 PDT 2024
preames wrote:
At a conceptual level, this doesn't seem like the right costing. I suspect a better model is performing a wide load and then performing some kind of additional shuffle uop.
Here's some preliminary throughput data from the banana pi3:
```
perf stat ./vlseg-nf4.out
~23.188822 cycles-per-inst
~24061.448450 cycles-per-iteration
~1037.631350 insts-per-iteration
perf stat ./vlseg-nf2.out
~11.796910 cycles-per-inst
~12026.720450 cycles-per-iteration
~1019.480600 insts-per-iteration
```
These are fairly expensive operations. Both above are LMUL=m1.
Some reference points (focusing on the nf2 case):
```
perf stat ./vnsrl-m1.out
~3.935553 cycles-per-inst
~20508.467900 cycles-per-iteration
~5211.076900 insts-per-iteration
$ perf stat ./vle-m2.out
~3.972999 cycles-per-inst
~4020.006750 cycles-per-iteration
~1011.831750 insts-per-iteration
./vlseg-nf2-emulated.out
~9039.639900 cycles-per-iteration
~5019.285100 insts-per-iteration
```
>From the look of this, it looks like the right cost model for the segment load instruction might be load cost + NF*shuffle cost.
Interestingly, the emulated version - which uses exactly that expansion - appears to be higher throughput. That was a real surprise to me, as I thought I'd heard that BP3 had fast segmented loads and stores. It's possible I've got an error in my tests, anyone else have data on this question?
https://github.com/llvm/llvm-project/pull/111511
More information about the llvm-commits
mailing list