[llvm] [RISCV] Decompose LMUL > 1 reverses into LMUL * M1 vrgather.vv (PR #104574)

Fri Aug 16 09:59:20 PDT 2024

camel-cdr wrote:

> Is known hardware still quadratic in LMUL for this case?

Yes, it behaves like that at least, C906/C908/C910, and the X60 seem to perform the same regardless of SEW.

> SiFive p470 and p670 are **quadratic in the worst case**, but will skip reading input registers when they aren't used.

That isn't reflected in the scheduling models, though, unless I don't understand how they work. The P670 scheduling models reports 1/2/4/8 scaling, and the P470 1/12/16/24. See [RISCVSchedSiFiveP600.td](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/RISCV/RISCVSchedSiFiveP600.td#L737) and the [P470 PR](https://github.com/llvm/llvm-project/pull/102155)

I think it will be unlikely that unrolling to LMUL=1 when possible will perform worse by more than a few percent on any processor, while it has already been shown that there can be huge benefits:

* [SpacemiT X60](https://camel-cdr.github.io/rvv-bench-results/bpi_f3/byteswap.html) 
* [XuanTie C908](https://camel-cdr.github.io/rvv-bench-results/canmv_k230/byteswap.html)
* [XuanTie C910](https://camel-cdr.github.io/rvv-bench-results/milkv_pioneer/byteswap.html) (the measurements are all over the place, I'm not sure what happened there)
* [XiangShanV3](https://camel-cdr.github.io/rvv-bench-results/xiangshanv3/byteswap.html) (there were some performance problems with the vsetvli implementation, which should be fixed now, however I wasn't able to find a commit on which I could run the benchmark again)

There was no difference on [saturn-vector](https://camel-cdr.github.io/rvv-bench-results/saturn/byteswap.html), it's probably close to the X280 implementations, since it's also one element per cycle and uses a lot of chaining.

https://github.com/llvm/llvm-project/pull/104574