[llvm] [RISCV] Decompose LMUL > 1 reverses into LMUL * M1 vrgather.vv (PR #104574)

Fri Aug 16 09:33:45 PDT 2024

preames wrote:

> > As far as I'm aware, vrgather.vv is quadratic in LMUL on most microarchitectures today due to each output register needing to read from each input register in the group.
> 
> SiFive p470 and p670 are quadratic in the worst case, but will skip reading input registers when they aren't used.
> 
> Earlier versions of x280 were one element per cycle, but newer generations will improve.

For these processors, how does the lowering here compare to the default vrgather.vv?  Is it at least neutral?  Or do we need a tuning flag?

> For smaller VLEN and large EEW it's impossible to read all sources at high LMUL. For example, VLEN=128 SEW=64 has only 2 elements per register so can only depend on 2 source registers in the worst case. Is known hardware still quadratic in LMUL for this case?

It can only depend on two source registers, but which two source registers is not known until runtime.  Depending on where/how vector operations are split, I could see reasonable implementations which both were and weren't able to exploit this.  

We should definitely run this test on the BP3.  From the fact Luke saw an improvement on the BP3, I'd guess it either doesn't implement this optimization or the VLEN=256 imply potentially four sources is enough decrease the impact.

https://github.com/llvm/llvm-project/pull/104574