[llvm] [RISCV] Use vsetvli instead of vlenb in Prologue/Epilogue (PR #113756)

Sat Oct 26 17:42:13 PDT 2024

lukel97 wrote:

> > Is this still profitable even for cases when there's no shift needed, e.g. would a vsetvli be better than a single csrr?
> 
> CSR reads in general are seializing. Vlenb needs to be special cased in the microarchitecture. Some versions of SiFive cores missed this optimization. Have we checked BananaPi?

I did some experimenting and it looks like the F3 is also missing the optimisation. 

csrr.S:
```asm
.global start
start:
      	li a0, 0
        li a1, 10485760
loop:
     	csrr t0, COUNTER
        addi a0, a0, 1
        blt a0, a1, loop
exit:
        li a7, 93
        ecall
```

vsetvli.s:
```asm
.global _start
_start:
        li a0, 0
        li a1, 10485760
loop:
        vsetvli t0, zero, e8, m1, ta, ma
        addi a0, a0, 1
        blt a0, a1, loop
exit:
        li a7, 93
        ecall
```

```
$ clang csrr.S -nostdlib -DCOUNTER=vlenb
$ perf stat ./a.out
...
105,898,187      cycles:u
$ clang csrr.S -nostdlib -DCOUNTER=fflags
$ perf stat ./a.out
...
105,894,862      cycles:u
$ clang vsetvli.s -nostdlib
$ perf stat ./a.out
...
53,484,388      cycles:u
```

I guess this could also introduce VL/VTYPE toggles, but I don't have any data as to whether or not that would be an issue in practice given that this is restricted to prologues and epilogues. I also don't know how expensive a VL/VTYPE toggle is in comparison to a CSR read.

https://github.com/llvm/llvm-project/pull/113756