[llvm] [RISCV] Account for factor in interleave memory op costs (PR #111511)

Tue Oct 15 06:47:24 PDT 2024

lukel97 wrote:

> BP3 appears to have two implementations - one used for factors 2,3,4 and the other for factors 5,6,7,8. The first appears to be a wide load followed by some kind of shuffle operation, whereas the second appears to scale with the number of 128 b loads required + some kind of adjustment term. I'm having trouble fitting a good formula for either to be honest.

For factors 2,3,4 the formula `1.5 * (VLEN/DLEN) * 2 * LMUL * NF` seems to fit the results in https://github.com/preames/bp3-microarch/pull/1/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5, which models `(wide load) + NF * LMUL shuffles`.

> Given these results, and what Craig has noted above about the x280, I think we need to have distinct costing models for different factors. Annoyingly, it looks like that threshold may need to differ by processor as well.

Agreed. But regardless of the uarch today, we're still under-costing all segmented accesses regardless of whether or not that specific NF is optimized as a wide load + shuffle.

Might I propose that we land this as an incremental improvement over what's in-tree today, where we use the `(wide load) + NF * LMUL shuffles` as a generic cost model. That way generically compiled code without -mcpu/-mtune will still benefit.

Then as a follow up we can add a tuning feature like `TuneHasSlow{2,3,4,5,6,7,8}SegmentedLoadStore` that further increases the cost to be proportional to VLMAX or `VL*ceil(DLEN/(SEW*factor))`, which we would add to the spacemit-x60 and x280 processor definitions

https://github.com/llvm/llvm-project/pull/111511