[llvm] [RISCV] Account for factor in interleave memory op costs (PR #111511)
Luke Lau via llvm-commits
llvm-commits at lists.llvm.org
Tue Oct 8 22:05:31 PDT 2024
lukel97 wrote:
> Interestingly, the emulated version - which uses exactly that expansion - appears to be higher throughput. That was a real surprise to me, as I thought I'd heard that BP3 had fast segmented loads and stores. It's possible I've got an error in my tests, anyone else have data on this question?
I tried this out: I think I can recreate your observation that the emulated vlseg2e8.v is faster, ~0.94B cycles vs ~1.26B cycles:
<details><summary>Sources</summary>
<p>
```asm
.global start
start:
la a0, foo
li a1, 0
li a2, 104857600
la a3, dst1
la a4, dst2
loop:
vsetvli t0, zero, e16, m2, ta, ma
vle16.v v8, (a0)
vsetvli t0, zero, e8, m1, ta, ma
vnsrl.wi v10, v8, 0
vnsrl.wi v11, v8, 8
addi a1, a1, 1
blt a1, a2, loop
exit:
li a7, 93
ecall
.data
foo:
.zero 512
dst1:
.zero 256
dst2:
.zero 256
```
```asm
.global start
start:
la a0, foo
li a1, 0
li a2, 104857600
la a3, dst1
la a4, dst2
vsetvli t0, zero, e8, LMUL, ta, ma
loop:
vlseg2e8.v v8, (a0)
addi a1, a1, 1
blt a1, a2, loop
exit:
li a7, 93
ecall
.data
foo:
.zero 512
dst1:
.zero 256
dst2:
.zero 256
```
</p>
</details>
However were you storing the deinterleaved results back somewhere? Once I added that in the native vlseg2e8.v overtook the emulated version slightly, ~1.58B cycles vs 1.68B
<details><summary>Sources with stores</summary>
<p>
```asm
.global start
start:
la a0, foo
li a1, 0
li a2, ITERS
la a3, dst1
la a4, dst2
loop:
vsetvli t0, zero, e16, m2, ta, ma
vle16.v v8, (a0)
vsetvli t0, zero, e8, m1, ta, ma
vnsrl.wi v10, v8, 0
vnsrl.wi v11, v8, 8
vse8.v v10, (a3)
vse8.v v11, (a4)
addi a1, a1, 1
blt a1, a2, loop
exit:
li a7, 93
ecall
.data
foo:
.zero 512
dst1:
.zero 256
dst2:
.zero 256
```
```asm
.global start
start:
la a0, foo
li a1, 0
li a2, ITERS
la a3, dst1
la a4, dst2
vsetvli t0, zero, e8, LMUL, ta, ma
loop:
VLSEG v8, (a0)
vse8.v v8, (a3)
vse8.v v9, (a4)
addi a1, a1, 1
blt a1, a2, loop
exit:
li a7, 93
ecall
.data
foo:
.zero 512
dst1:
.zero 256
dst2:
.zero 256
```
</p>
</details>
https://github.com/llvm/llvm-project/pull/111511
More information about the llvm-commits
mailing list