[llvm] [RISCV] Account for factor in interleave memory op costs (PR #111511)

Wed Oct 9 01:17:30 PDT 2024

lukel97 wrote:

> I suspect a better model is performing a wide load and then performing some kind of additional shuffle uop.

I think you're right, I did some more benchmarking on the banana pi and I think the throughput is proportional to something like `Wide load + Factor * LMUL`. These cycle counts are for various segmented loads without any storing:

```
vlseg2e8 M1: 1.26B      2 + 2 * 1 = 4 ops   (0.315 cycles/op)
vlseg2e8 M2: 2.52B      4 + 2 * 2 = 8 ops   (0.315 cycles/op)
vlseg2e8 M4: 5.04B      8 + 2 * 4 = 16 ops  (0.315 cycles/op)

vlseg3e8 M1: 2.10B      4 + 3 * 1 = 7 ops   (0.30 cycles/op)
vlseg3e8 M2: 4.20B      8 + 3 * 2 = 14 ops  (0.30 cycles/op)

vlseg4e8 M1: 2.52B      4 + 4 * 1 = 8 ops   (0.315 cycles/op)
vlseg4e8 M2: 5.04B      8 + 4 * 2 = 16 ops  (0.315 cycles/op)
```

I'm not sure what the exact formula is for the NF=3 loads are, but it seems close enough. I'll update this PR anyway.

https://github.com/llvm/llvm-project/pull/111511